Possibly random thoughts of a oddly organized dba with a very short attention span

1.26.2009

30 days in the hole ...

So ... not really 30 days (21 to be precise) but it feels like forever.

Late November, early December (while I was in the U.K.) several sites were installed with an updated version of one of the tools. Everything seemed to be going well, until all the databases ran out of space roughly around the same time. (We lost a team member a few months ago and he used to monitor sites until they achieved steady state. god I miss him right now.) So we added lots more storage and the support team set about executing jobs to get the data current.

Within a day or two, there are performance complaints ... really loud performance complaints. Some processes were taking 10 times as long to execute. Once site was rumored still be processing after 30 hours. And as usually happens, the support communications indicated that all processing was always slow which makes it really difficult to identify the problem. (honestly ... if you can tell your DBA that a job was slow on Tues, but it ran fine the Tues prior, and maybe even give reasonably accurate time estimates, your DBA will adore you. maybe even make you cookies.)

Instrumentation to the rescue! This isn't what I intended to do with it, but checking the elapsed time data prior to the storage failure, things look pretty good: (time shown in seconds)

as of Dec 31 2008

ILO_MODULE AVERAGE VMR CPU_AVERAGE CPU_VMR
------------- -------- ---------- ----------- ----------
Load Minutes 3428.565 3583.708 1137.632 888.786
Daily A 104.142 822.448 91.190 833.984
Daily B 832.647 442.812 7.895 .826
Daily C 822.274 557.284 19.967 1.209
Daily Counts .970 1.687 .243 .016
Weekly A 75.454 876.928 73.003 879.594
Weekly B 1210.030 434.168 20.778 2.806
Weekly C 1346.397 429.613 10.459 .582
Monthly A 35.718 388.126 34.149 390.315
Monthly B 1712.982 851.974 22.109 1.912
Monthly C 2500.288 1662.503 11.067 .283



Ha! Proof positive that things were running well at an earlier point in time. However, it was clear that processing had gone south shortly thereafter:

as of Jan 12 2009

ILO_MODULE AVERAGE VMR CPU_AVERAGE CPU_VMR
------------- -------- ---------- ----------- ----------
Load Minutes 6166.723 15667.593 2998.772 8468.038
Daily A 94.692 785.182 80.864 796.596
Daily B 1075.017 673.456 6.945 .553
Daily C 995.049 794.861 18.996 .948
Daily Counts .995 1.483 .284 .026
Weekly A 65.701 875.962 63.121 878.844
Weekly B 1465.957 629.084 18.045 2.017
Weekly C 1619.834 709.329 9.337 .426
Monthly A 31.170 386.997 29.547 389.259
Monthly B 1983.282 866.176 19.172 1.443
Monthly C 2757.768 1484.035 9.916 .227



The Load Minutes process clearly has an order of magnitude increase in variation. Checking the times for just that process, look at what happened on Jan 9th:

SQL> @ET_per_module_by_day.sql
Enter value for ilo_module: Load Minutes
old 7: where ilo_module = '&ilo_module'
new 7: where ilo_module = 'Load Minutes'

RUN_TIME ILO_ACTION ELAPSED_TIME CPU_TIME
---------- -------------- ------------ --------
2008-11-26 11/23/2008 28.57 27.16
2008-11-27 11/24/2008 31.34 28.49
2008-11-28 11/25/2008 11645.59 11514.59
2008-11-29 11/26/2008 20265.22 19317.15
2008-11-30 11/27/2008 4572.09 4542.39
2008-12-01 11/27/2008 3266.5 175.91
2008-12-01 11/28/2008 1151.75 112.01
2008-12-02 11/29/2008 2488.95 199.6
2008-12-03 11/30/2008 2687.63 208.92
2008-12-04 12/01/2008 1984.41 164.51
2008-12-05 12/02/2008 577.16 248.29
2008-12-06 12/03/2008 387.52 244.52
2008-12-07 12/04/2008 382.82 243.59
2008-12-08 12/02/2008 2863.44 170.22
2008-12-08 12/05/2008 199.63 39.66
2008-12-10 12/03/2008 2223.74 174.37
2008-12-10 12/04/2008 2368.83 187.21
2008-12-10 12/06/2008 2607.61 194.31
2008-12-11 12/05/2008 3007.1 199.94
2008-12-11 12/07/2008 3494.08 230.27
2008-12-11 12/08/2008 2964.79 191.71
2008-12-12 12/09/2008 3134.21 205.42
2008-12-13 12/10/2008 3203.94 214.89
2008-12-14 12/11/2008 3389.64 229.35
2008-12-15 12/12/2008 3307.51 235.6
2008-12-16 12/13/2008 3865.3 253.49
2008-12-17 12/14/2008 4245.49 263.02
2008-12-18 12/15/2008 3907.79 243.07
2008-12-19 12/16/2008 3942.16 253.25
2008-12-20 12/17/2008 4211.94 263.08
2008-12-21 12/18/2008 4155.62 264.56
2008-12-22 12/19/2008 4223.73 265.13
2008-12-23 12/20/2008 4803.58 292.74
2008-12-24 12/21/2008 5094.34 299.39
2008-12-28 12/25/2008 2527.33 145.23
2008-12-29 12/26/2008 2493.81 137.34
2009-01-03 12/31/2008 2554.95 140.18
2009-01-04 01/01/2009 2446.72 44.77
2009-01-05 01/02/2009 2461.68 45.54
2009-01-08 01/01/2009 5061.06 202.28
2009-01-09 01/06/2009 40109.15 25982.18
2009-01-10 01/07/2009 23231.61 17965.96
2009-01-11 01/08/2009 30444.29 25191.01
2009-01-12 01/09/2009 44336.19 23280.41



So what happened on this very special day? And btw, happened on multiple, remote, disconnected sites simultaneously? I checked for changes, checked for trace files and found a few that mentioned an ORA-12842 error. So, I checked a few more sites, and while I didn't necessarily get the error message every time, I was able to confirm that we were no longer getting parallel processing every time. This resulted in jobs that normally executed in 60 to 90 minutes taking exponentially longer to run.

This is a bug and there is apparently also a patch, but if I'm reading the notes on Metalink correctly (751588.1), the patch will cause the process to fail so it can be rescheduled, hopefully staying in parallel mode the second time. This doesn't help us much: this product is supposed to execute completely unattended, so no one will know the process fails until the data is missing. (if a process fails in a forest ... yeah, it really fails, no philosophical meandering to get out of this one.) And we don't have time or staff to code something to recognize this issue and reschedule it. All hands are on other decks.

However, the thing that causes the process to stop operating in parallel is when a ddl statement touches on object involved in the parallel query. Interesting ... we have jobs that manage partitions, and dbms_stats can trigger the bug too but both job types are normally scheduled at outside the summary window.

This brings us to the cause of the massive performance catastrophe on Jan 9th. The databases ran out of storage, lots of space gets added and now the catch up processes are being executed by people and people don't do things the same way every time. This increased the chance the a summarization process will collide with an add partitions job or a gather stats process.

If a process stops processing in parallel when a ddl type statement touches one of the objects involved, this is preventable. So I increased the serialization of the processes in order to maximize parallelization. ( i love a dichotomy) By incorporating the management tasks directly into the summarization procedure, there are no conflicting job schedules no matter when the summarizations execute, or who executes them.

So, this gets done and implemented and everything runs beautifully ... for two whole days.

stay tuned ...

No comments:

Search This Blog

Loading...

My Blog List

disclaimer ...

The views expressed on this page are mine and do not reflect the opinions of my employer, former employers, other associates or the opinions I may hold 3 days from now.