Those who have read my previous posts know I've got a few favorite topics:
1. Data Models
This post will be short but it touches on all of my favorite things. Over the last year and a half, I've had the opportunity to redesign a not-so-great application working with the amazingly great developers at Method-R. In the process, we redesigned the data model, rewrote the procedural code, and instrumented the code with the ILO package. Next, I added the elapsed time module I mentioned in chapter 13 of Expert Oracle Practices: Oracle Database Administration from the Oak Table. (The elapsed time module is available for download at Apress.)
We are now building a full system capacity test environment that includes the redesigned database on the new, appropriately sized hardware soon to be installed in production, and we will be testing much larger loads than the customer ever expects to obtain. The 11g spatial database is just one piece of the complete system, but the instrumentation package is capturing the elapsed time within the database while the total system/application time is captured externally. This means that not only can we measure exactly what is happening in the database, we can compare it to the time the application *thinks* it is in the database, and determine how much time is spent communicating between the systems.
If you've seen my presentations, you may remember that the variance numbers on the slides are ridiculously large. This is partly because of some very bad performance bottlenecks and some very bad code, but it's also related to the overall length of the process. After all, the bigger the window, the more room there is for the unexpected to slip in and thoroughly screw up your performance.
This new application is a bunch of short, quick procedures so the performance profile is very different than the long running jobs I've discussed in the past, but what already stands out is how consistently the processes are performing.
Total Elapsed Time: (In seconds, rounded to 3 decimal places)
These numbers are well under our average run time requirements, as in at least 100 times faster than our initial goal. You'll notice that only 3 of the procedures have enough variance to survive the 3-decimal place rounding, and one of those 3 is the parent process for the other 2. Since most of the queries are very quick and variance doesn't even register on most of the processes, I've added two new calculations: the minimum and maximum processing times.
The variance in child processes 1 and 2 was expected btw. These two procedures are responsible for locating and assigning values within specific, sometimes narrow, ranges. As available values decrease, these identities become scarce and the procedure will try multiple times to find an acceptable value. In some cases, the process may fail to find a match at all so the failures take longer that the successes. In this case, the instrumentation proved our expectations of where the next bottleneck will appear. The instrumentation has also shown that this bottleneck is not a factor anywhere close to normal system loads, as repeated retries only occur when the resource is scarce AND the procedure called in a rapid fire manner. Neither condition is the typical execution pattern.
Average is a mostly useless number in my opinion: I want to see the best and the worst performance a customer can experience and the size of the elapsed time range tells me if there's a potential problem lurking in the system. Variance is still the best way to measure disparate performance levels as it gives you a single number to reference, but min and max will suffice. I could also extend the decimal points waaaay out there, but that's a little challenging to read on a blog.
Since this is 11g, it's also possible to capture the cpu elapsed time, although it's an incredibly dull set of numbers in this case:
CPU Elapsed Time: (In seconds, rounded to 3 decimal places)
These measurements are captured in addition to those taken by the system wide testing tool to measure the complete process, but this data gives me specific information to compare database performance against the overall results. Elapsed time recording will be enabled continuously in production as it's extremely low overhead. Plus I can compare the ILO elapsed time data to other performance indicators (DB Time, AWR) in case we do see some unexpected behavior in production. (plus people are much less likely to make unsubstantiated claims of poor database performance when they know I've got real numbers to verify against)
So we've managed to turn one of the most poorly designed databases I'd ever seen into a very quick, very scalable, very consistently performing application. But how?
The primary factor in this application's solid performance is the data model and the application code. When I first began working with this database, processing time was highly variable and the primary process was not capable of EVER meeting the performance target. We knew the hardware platform was inappropriate for the spatial processes, but we didn't have the option to upgrade hardware at the time. By changing the data model and the application code, the database was able to meet it's original performance goals in spite of the less than optimal hardware. Now we have the opportunity to combine appropriate hardware with good code, resulting in an application that will scale.
I started this post about two weeks ago, so I've already got numbers for the first round of testing. They'll be posted soon.