Possibly random thoughts of a oddly organized dba with a very short attention span

6.18.2010

Instrumentation rocks!!

Those who have read my previous posts know I've got a few favorite topics:

1. Data Models
2. Measuring
3. Variance
4. Instrumentation

This post will be short but it touches on all of my favorite things. Over the last year and a half, I've had the opportunity to redesign a not-so-great application working with the amazingly great developers at Method-R. In the process, we redesigned the data model, rewrote the procedural code, and instrumented the code with the ILO package. Next, I added the elapsed time module I mentioned in chapter 13 of Expert Oracle Practices: Oracle Database Administration from the Oak Table. (The elapsed time module is available for download at Apress.)

We are now building a full system capacity test environment that includes the redesigned database on the new, appropriately sized hardware soon to be installed in production, and we will be testing much larger loads than the customer ever expects to obtain. The 11g spatial database is just one piece of the complete system, but the instrumentation package is capturing the elapsed time within the database while the total system/application time is captured externally. This means that not only can we measure exactly what is happening in the database, we can compare it to the time the application *thinks* it is in the database, and determine how much time is spent communicating between the systems.

If you've seen my presentations, you may remember that the variance numbers on the slides are ridiculously large. This is partly because of some very bad performance bottlenecks and some very bad code, but it's also related to the overall length of the process. After all, the bigger the window, the more room there is for the unexpected to slip in and thoroughly screw up your performance.

This new application is a bunch of short, quick procedures so the performance profile is very different than the long running jobs I've discussed in the past, but what already stands out is how consistently the processes are performing.

Total Elapsed Time: (In seconds, rounded to 3 decimal places)








These numbers are well under our average run time requirements, as in at least 100 times faster than our initial goal. You'll notice that only 3 of the procedures have enough variance to survive the 3-decimal place rounding, and one of those 3 is the parent process for the other 2. Since most of the queries are very quick and variance doesn't even register on most of the processes, I've added two new calculations: the minimum and maximum processing times.

The variance in child processes 1 and 2 was expected btw. These two procedures are responsible for locating and assigning values within specific, sometimes narrow, ranges. As available values decrease, these identities become scarce and the procedure will try multiple times to find an acceptable value. In some cases, the process may fail to find a match at all so the failures take longer that the successes. In this case, the instrumentation proved our expectations of where the next bottleneck will appear. The instrumentation has also shown that this bottleneck is not a factor anywhere close to normal system loads, as repeated retries only occur when the resource is scarce AND the procedure called in a rapid fire manner. Neither condition is the typical execution pattern.

Average is a mostly useless number in my opinion: I want to see the best and the worst performance a customer can experience and the size of the elapsed time range tells me if there's a potential problem lurking in the system. Variance is still the best way to measure disparate performance levels as it gives you a single number to reference, but min and max will suffice. I could also extend the decimal points waaaay out there, but that's a little challenging to read on a blog.

Since this is 11g, it's also possible to capture the cpu elapsed time, although it's an incredibly dull set of numbers in this case:

CPU Elapsed Time: (In seconds, rounded to 3 decimal places)









These measurements are captured in addition to those taken by the system wide testing tool to measure the complete process, but this data gives me specific information to compare database performance against the overall results. Elapsed time recording will be enabled continuously in production as it's extremely low overhead. Plus I can compare the ILO elapsed time data to other performance indicators (DB Time, AWR) in case we do see some unexpected behavior in production. (plus people are much less likely to make unsubstantiated claims of poor database performance when they know I've got real numbers to verify against)

So we've managed to turn one of the most poorly designed databases I'd ever seen into a very quick, very scalable, very consistently performing application. But how?

The primary factor in this application's solid performance is the data model and the application code. When I first began working with this database, processing time was highly variable and the primary process was not capable of EVER meeting the performance target. We knew the hardware platform was inappropriate for the spatial processes, but we didn't have the option to upgrade hardware at the time. By changing the data model and the application code, the database was able to meet it's original performance goals in spite of the less than optimal hardware. Now we have the opportunity to combine appropriate hardware with good code, resulting in an application that will scale.

I started this post about two weeks ago, so I've already got numbers for the first round of testing. They'll be posted soon.

2 comments:

Debra Lilley said...

I like to read these blog posting from my friends but rarely comment as the deep tech stuff is beyond my reach, but your comment about not having the option to upgrade the hardware really resonated with me.

In January at the big Sun Launch I overheard a senior exec say that database tuning would be redundant with Exadata. I know Exadata has a very important role to play and as mutual friends would tell us 'It Rocks'; but your experience here proves that tuning is not dead and the payoff is well worth the investment.

Robyn said...

Hello Debra,

Good to hear from you! While Exadata may reduce the need for tuning or make the pain from a poor design less painful, a smart CIO/architect/DBA/developer will want to combine the best of both worlds. Otherwise they would get less return on their hardware investment than they should, and anyone shelling out the dollars (or pounds) for Exadata ought to be aiming to get the most for them.

In our case, we were able to optimize the schema and the code so that it performed to requirements in spite of inadequate hardware. And as difficult as it may be to believe, we were able to update the software faster than we could have gotten new hardware approved and installed.

Then customer requested new functionality in a pending release, and that new functionality made new hardware mandatory. Seeing the existing code perform on the new hardware was an eye-opening experience for many on the team. The combination of a good design on the right hardware scaled well beyond expectations.

Our application will never justify Exadata - it's unlikely the database would ever approach anything sizable enough to spend that much money and there are loads of other databases out there that 'probably don't need Exadata'. However, how could someone possibly justify buying Exadata without optimizing their design as part of the upgrade? Think how much more could be done with an optimized design AND the massive machine.

Will 'tuning' become redundant with Exadata? Maybe. I really haven't done anything on these servers that would fall into the standard definition of tuning.

Will optimized designs become unnecessary? Never. Choosing the right schema, the right hardware, the right code and the right interface are important not just for performance, but to ensure that IT isn't throwing money away.

Hope to see you again soon Debra - Open World is coming up so we have to plan a lunch. We just need to avoid any presidential visits this go-round :)

cheers ... Robyn

Search This Blog

Loading...

My Blog List

disclaimer ...

The views expressed on this page are mine and do not reflect the opinions of my employer, former employers, other associates or the opinions I may hold 3 days from now.