Monday, July 14, 2014

The Joy of Low Hanging Fruit, Part 3: Deferred satisfaction key to great code

In the second post of this "Joy" series, I introduced the primary technique we would use to improve the performance of extremememe.info's daily job: bulk processing with FORALL and BULK COLLECT.

I emphasized that whenever you see a loop that contains non-query DML, you should convert it to use these bulk features.

So we've got some problematic code and we know what features of PL/SQL to use. So now it's time to check the documentation, make sure we've the syntax down, and apply the feature, right?

Change the code around and watch the bits and bytes flow like greased lightning!

Not quite.

Sure, we could have done that. But after some 30 years in software, I have grown leery and wary of excitement. When we get all excited about a new feature or a solution to a problem, we tend to then be in a hurry to apply it. When you hurry, you miss critical information. When you hurry, you don't question your assumptions. When you hurry, you are much more likely to make mistakes.

In fact, I have found that deferred satisfaction (holding off as long as you before writing the actual code) has served me well, saved me from creating bigger problems, and definitely saved me time.
With that in mind, let's continue with the story.

Code Review Identifies a Logic Flaw

I shared my dismay (non-query DML inside loop) and delight (a known fix with greatly increased performance) with the extremememe.info team. They were excited about the potential improvements and were eager to get started. Then, however, I noticed something odd about the algorithm they'd presented. Here, let's show you the relevant part of the code; can you see what might be the problem?

The original em_update_status

  1  CREATE OR REPLACE PROCEDURE em_update_status
  2  IS
  3     CURSOR incoming_cur
  4     IS
  5        SELECT * FROM em_incoming;
  6
  7     l_mention   em_mentions%ROWTYPE;
  8     l_status    em_memes.meme_status%TYPE;
  9  BEGIN
 10     FOR incoming_r IN incoming_cur
 11     LOOP
 12        BEGIN
 13           SAVEPOINT new_transaction;
 14
 15           l_mention := em_memes_pkg.unpacked_incoming (incoming_r);
 16
 17           INSERT INTO em_mentions (meme_id,
 18                                    source_name,
 19                                    source_details,
 20                                    occurred_on)
 21                VALUES (l_mention.meme_id,
 22                        l_mention.source_name,
 23                        l_mention.source_details,
 24                        l_mention.occurred_on);
 25
 26           em_memes_pkg.reset_meme_status (l_mention.meme_id,
 27                                           l_status);
 28
 29           IF l_status IS NOT NULL
 30           THEN
 31              UPDATE em_memes
 32                 SET meme_status = l_status
 33               WHERE meme_id = l_mention.meme_id;
 34           END IF;
 35        EXCEPTION
 36           WHEN OTHERS
 37           THEN
 38              em_memes_pkg.log_error;
 39              ROLLBACK TO new_transaction;
 40        END;
 41     END LOOP;
 42 END;

A brief Q&A session soon clarified matters.

"Can the table of incoming data contain more than one reference to the same meme?"

Heads nodded vigorously. They told me that the more "viral" a meme behaved, the more mentions there would be. On some very popular memes, the daily incoming could contain millions of references to the same meme. That led directly to my second question:

"Then why are you updating the meme status after each insert of a mention?" (see lines 26-34)

Now eyebrows shot up and heads turned to look at each other. 

"Wow," said one developer. "Oh. My. Gosh." said another.

The team lead turned to me. "We never thought of that before. We've been doing an enormous amount of unnecessary processing. That's kind of embarrassing. It sure helps having a fresh pair of eyes looking at our code."

Yes, it does.

Never underestimate the power of your brain to hide the obvious from you. Once you've spent a lot of time writing or working with a program, you don't really see the code anymore. You see what you think the code should be. When someone else looks at it, they come at it without any preconceived notions and see very different aspects of the program.

SO REMEMBER!
If you're the only one who's ever looked at your code,
you can be certain it contains bugs,
and perhaps even substantial errors in your algorithms.*

We quickly came to an agreement that the statuses of the memes should only be done after all mentions were inserted. Furthermore, only those memes with new mentions should be updated, so we would need to keep track of those.

The resulting code might be a bit more complicated, but if written correctly we can make the code easy to understand and maintain. Bottom line, though: we are focused on performance and this could be a big help.

* Which reminds me of one of my favorite sayings: If you look around the room and wonder who the chump** is, you are the chump.

** Chump: a foolish or easily deceived person.

Analyze Current Behavior

Before you can think about making changes to a program, you need to make sure that you understand how the current state of the code works - and what it does when errors occur. If you don't do this, how can you be sure that the new version still meets user requirements? Oh, right. You could build a regression test. I address that in the next section.

For now, let's focus simply on the importance of analyzing and understanding (deeply) your code before embarking on changes. This is especially critical for programs that change the contents of tables.

Here is a list of the behaviors we need to reproduce in the new, bulked-up version of reset_meme_status:
  • If the incoming table is empty, then make no changes.
  • If an error occurs when inserting a row into the mentions table, do not update the meme status.      
  • If an error occurs processing either DML statement, log the error and continue processing.


But wait a minute – that certainly reflects the current program, but not the program we want to build. Remember that we figured out that the meme status update should occur after all incoming rows are processed. So in actually the new set of behaviors we need are:

  • If the incoming table is empty, then make no changes.
  • If an error occurs when inserting a row into the mentions table, log the error and continue processing.
  • If an error does not occur, record this meme as being changed.
  • After all incoming rows have been processed, updated the status on all affected memes.

This change actually simplifies greatly the kind of code we will need to write using BULK COLLECT and FORALL. The reason is that FORALL can only be used with a single DML statement. So if my loop contains two DML statements I have to find a way to "communicate" between from FORALL to the next. I will explore this challenge more fully in the next installment of this series.

So now we know what we need to do in the bulked-up version. That means it's time start writing code, right? Wrong.

Build the Regression Test

Now, let's talk about seriously deferring satisfaction.

Let's talk about the importance of building automated regression tests before you embark on a substantial revision to mission-critical code.

The reset_meme_status procedure had been in production for several years. It had been thoroughly tested and was now relied upon implicitly to do its job without errors.

There was no margin for error; the new version had to work just as well (but lots faster). And the only way to ensure that this would be the case was a to build a script that verifies the current behavior, and could then be run against the new version as well.

This is called a regression test. You want to make sure that your code has not regressed (to an earlier, buggier state) after you applied a change.

It isn't easy to build regression tests for database programs. First, you must analyze and document current behavior. Then you need to come up with a list of all the test cases needed to verify that behavior. Then you have to build a script to set up the tables with all the variability of production data. Then you have to write code to check the contents of those tables after your program runs. Finally, you need a way to run all that code so that you can easily identify failures or confirm success.
Some development groups write all of this code by hand. Others use one of several regression test tools, including:

utPLSQL - an open source project that I wrote back in 1999, is still used by many groups around the world and has recently come back to life, under the administration of Paul Walker. You must write your own test packages, but the utPLSQL framework then runs the package and verifies the results.

SQL Developer Unit Testing - Oracle's SQL Developer tool offers integrated unit testing that also allows you to describe the expected behavior of your program and then generates the code needed to verify that behavior.

Code Tester for Oracle - a commercial product from Dell that will generate test packages from your descriptions of expected behavior.

There are others besides these, including PLUTO, PL/Unit and dbFit.

I encourage you to explore these options and to make the greatest effort possible to build regression tests, so that you can avoid getting calls from users that go like this: "Thanks for adding all those new features in version 2.6, but you broke three old features that I rely on every day."

Now, having said all of that, I am also resigned to the fact that it will be the rarest of PL/SQL developers (and teams) who actually make time to build automated regression tests.

I don't do a whole lot of them myself.

So I would like to offer a simple alternative that could give you some nice benefits without taking an enormous amount of time.

Before I do I have a question:

You need to write a new program. Your manager says so. OK, so you think about what you need to do and you start typing. Here's the question: how do you know when you're done?

Seems like an obvious question. But the answer often comes down to a variation of:

"I'll know it when I see it."

Unfortunately, our vision is all too often blurred by the pressures we face to get our work done and our programs coded.

So here is my suggestion: before you start a new program, take out a piece of paper and ask yourself "How will I know when I'm done?"

Write down brief descriptions of distinct scenarios the program should handle and what should happen. Don't worry about too many details; the bulk of the value of this step is to externalize the list from your head (very bad place to store a to-do list when you are pressed for time)

Then (finally), you write your code, and you check off the items on the list as you go.

The checklist would contain reminders of key best practices, but more importantly lists out the things your program should be able to do before you declare it done.

Of course, the most important items on the checklist are taken from the previous section: what does the new version of the program need to do?

Task
Completed?
Program Requirements

If the incoming table is empty, then make no changes.

If an error occurs when inserting a row into the mentions table, log the error and continue processing.

If an error does not occur, record this meme as being changed.

After all incoming rows have been processed, updated the status on all affected memes.


Best Practices

Add instrumentation to trace program activity.

Avoid hard-coded declarations. Use %TYPE or subtypes.

Validate all assumptions you are making (and document them).


I realize that when you look at my little checklist here, it can seem kind of obvious. Actually, that is (partly) the point.

These things should be obvious, they often are obvious, but it is so easy to forget them, to push hard and fast along the path to production, and inadvertently leave unimplemented requirements in your wake. Having a piece of paper (or, better, a tool) to help you remember can make an enormous difference.

OK, so remember what I said about deferred satisfaction?

Here's another deferral: I will post this next episode of my series without even getting to bulk processing!

In my next article, though, I promise to offer a detailed look at how bulk processing involves a shift from row by row to phased processing, and what effect that can have on your code.

Until then, happy PL/SQL coding!

Steven
steven.feuerstein@oracle.com


No comments:

Post a Comment