Possibly random thoughts of a oddly organized dba with a very short attention span

8.12.2008

Variance Round 2

I re-executed the analysis of the data warehouse job completion times for FY08 and found the results to be very consistent with last year - this in spite of a major hardware and application upgrade.

Here are the steps:

I'm using 11g so I used a virtual column to calculate the elapsed time:

alter table nightly_job_history
add (elapsed_time as ((date_run - start_dat_time)*24*3600)) ;

Once the column has been added, check the range of the data store:


select job_name, start_dat_time, date_run, elapsed_time
from nightly_job_history
where date_run = (select min(date_run) from nightly_job_history)
or date_run = (select max(date_run) from nightly_job_history)
order by logical_date ;

JOB START_DAT_TIME DATE_RUN ELAPSED_TIME
------ ------------------- ------------------- ------------
dw08 1998-02-27 20:18:42 1998-02-27 20:18:46 4
dw34 2008-08-01 17:15:36 2008-08-01 17:15:52 16



Breakdown of job executions by year:


select job_name, start_dat_time, date_run, elapsed_time
from nightly_job_history
where date_run = (select min(date_run) from nightly_job_history)
or date_run = (select max(date_run) from nightly_job_history)
order by logical_date ;

select to_char(to_date(date_run), 'YYYY') Year, count(job_name) Executions
from nightly_job_history
group by to_char(to_date(date_run), 'YYYY')
order by to_char(to_date(date_run), 'YYYY');

YEAR EXECUTIONS
---- ----------
1998 58751
1999 93557
2000 115224
2001 139643
2002 153217
2003 168760
2004 190320
2005 192119
2006 201882
2007 207139
2008 116842 <--- curious that there are fewer job executions for FY08.
Are there fewer jobs? Less
re-executions? Either could be true after the upgrade.

11 rows selected.


And yes, I'm very fortunate to have such an extensive data set to work with, but this is just an average production environment with a third party scheduling tool. This data should be obtainable in most data centers and it's well worth seeking out.

Create a table for the data set. Makes it easier the double check results later. Last year I created several different versions and attempted to identify job runs that had 'special causes' but I found that made little difference in the results. Also used the data to set control limits - basically a line in the data that says any run times longer than x shall be investigated. We'll get back to that later.

create table fy08_job_stats
as
select job_name,
count(job_name) sample_size,
round(avg(elapsed_time),3) average,
round(median(elapsed_time),3) median,
round(stddev(elapsed_time),3) deviation,
round(variance(elapsed_time),3) variance,
round(stddev(elapsed_time)/avg(elapsed_time),3) volatility,
round(variance(elapsed_time)/avg(elapsed_time),3) randomness
from nightly_job_history
where START_DAT_TIME > '2007-07-31 23:59:59'
and START_DAT_TIME < '2008-08-01 00:00:01' -- fiscal year end + y/e processing
group by job_name
order by job_name ;


Connect the job number with a description of the executable. Ensure a minimum sample size of 12.

create or replace view fy08_named_job_stats
as
select JOB_NAME, EXECUTABLE, SAMPLE_SIZE, AVERAGE, MEDIAN,
DEVIATION, VARIANCE, VOLATILITY, RANDOMNESS
from fy08_job_stats a, maestro_jobs b
where a.job_name = b.ba_job
and a.sample_size >11;


Find the top 20 jobs by randomness:

select * from (
select JOB_NAME, EXECUTABLE, SAMPLE_SIZE, AVERAGE, MEDIAN, DEVIATION,
VARIANCE, VOLATILITY, RANDOMNESS, RANK()
over (
order by randomness desc
) TopN from fy08_named_job_stats
)
where TopN <= 20
order by randomness desc ;


Find the top 20 jobs by volatility:

select * from (
select JOB_NAME, SAMPLE_SIZE, AVERAGE, MEDIAN, DEVIATION,
VARIANCE, VOLATILITY, RANDOMNESS, RANK()
over (
order by volatility desc
) TopN from fy08_named_job_stats
)
where TopN <= 20
order by volatility desc ;

This year every job with high volatility was a snapshot job which indicates a process influenced by the quantity of data being processed. The changes in the data set size is cyclical, increasing at month end, quarter end and/or year end.

Jobs with a high level of randomness as indicated by the variance to mean ratio are the ones with code that has the most optimization potential. The jobs tend to use functions that don't perform well with high volumes of data, resulting in abnormally high cpu usage for some executions and the effect is exaggerated when the functions are executed by multiple processes within the same time frame. (Sometimes because it's a commonly used function and sometimes because multiple users request the same reports at the same time.) For my current project, which does not have any job schedule data, I'm working to add instrumentation that will record both the elapsed time and the cpu elapsed time. That's one thing I don't have in the FY08 data set and I really wish I did ... I'd like to see how much of the increase in overall elapsed time correlates to cpu elapsed time.

There are a few snapshots that made both lists and this is why I ask our Six Sigma guy to run a signs test last year. It showed that the results sets were not correlated and the results are similar enough to last year that I'm fairly confident we're not correlated again, but I'm may ask him to check just in case.

No comments:

Search This Blog

Loading...

My Blog List

disclaimer ...

The views expressed on this page are mine and do not reflect the opinions of my employer, former employers, other associates or the opinions I may hold 3 days from now.