Monday, June 16, 2014

The Joy of Low Hanging Fruit, Part 1

I publish an article in most Oracle Magazine issues, focusing on the PL/SQL language. I love writing these articles and I enjoy getting feedback from readers, but I also feel constrained by the, well, constraints of the Oracle Magazine format (both in terms of length of the article and also the way we can present code).

So I have decided to complement my articles (some, not necessarily all) with an extended, serialized treatment on my blog. In this first series, I walk through the refactoring of some code that has performance issues, and provide all the code for the initial, intermediate and final forms of the program here.

I hope you find this format interesting and helpful. Please comment!

Struggling to Keep Up

I recently spent a few days with a team of developers at (all names changed to protect the innocent), an up-and-coming Web 3.0 paradigm shifter that analyzes Internet memes, tracks them to their source, and best of all predicts new, near-future memes.

They are very knowledgeable about Oracle and PL/SQL, but as is the case with many developers they have little time to explore new features in technology. They mostly struggle to keep up with the demands of their users. And even when they know about some great new feature, it can be a challenge to convince management to commit the resources to apply those features to stable production code.

We interspersed training on PL/SQL techniques with reviews of their code, and in the process came across a program that updates the status of all their memes, and runs in a daily batch process. Unfortunately, as their data volumes have grown, the time it takes to complete that process was taking longer and longer, currently approaching twenty-three hours. Given how uncomfortably close that was to a fully day, we decided to take a closer look.

I was both dismayed and delighted with what I found (em_update_status0.sql):

Listing 1. The original em_update_status

  1  CREATE OR REPLACE PROCEDURE em_update_status
  2  IS
  3     CURSOR incoming_cur
  4     IS
  5        SELECT * FROM em_incoming;
  7     l_mention   em_mentions%ROWTYPE;
  8     l_status    em_memes.meme_status%TYPE;
  9  BEGIN
 10     FOR incoming_r IN incoming_cur
 11     LOOP
 12        BEGIN
 13           SAVEPOINT new_transaction;
 15           l_mention := em_memes_pkg.unpacked_incoming (incoming_r);
 17           INSERT INTO em_mentions (meme_id,
 18                                    source_name,
 19                                    source_details,
 20                                    occurred_on)
 21                VALUES (l_mention.meme_id,
 22                        l_mention.source_name,
 23                        l_mention.source_details,
 24                        l_mention.occurred_on);
 26           em_memes_pkg.reset_meme_status (l_mention.meme_id,
 27                                           l_status);
 29           IF l_status IS NOT NULL
 30           THEN
 31              UPDATE em_memes
 32                 SET meme_status = l_status
 33               WHERE meme_id = l_mention.meme_id;
 34           END IF;
 35        EXCEPTION
 36           WHEN OTHERS
 37           THEN
 38              em_memes_pkg.log_error;
 39              ROLLBACK TO new_transaction;
 40        END;
 41     END LOOP;
 42 END;

First I will explain some background on references to tables and code, then I will explain the key lines in this procedure.

It works with the following three tables. The columns of those tables have been simplified greatly for purposes of this article.

They have a table for all the memes of which they are aware and for which they track the status (note that setting the default value of the primary key to the next sequential value is a 12.1 feature; in 11.2 and earlier, you should use a trigger to get the same effect - you will find the trigger code in em_memes_setup.sql).

   meme_id         INTEGER PRIMARY KEY,
   meme_name       VARCHAR2 (1000) UNIQUE,
   discovered_on   DATE,
   meme_status     VARCHAR2 (100)

The meme_status might be "VIRAL", "DORMANT", "HOAX", etc.

They have a table for all the "mentions" or references to a meme:

CREATE SEQUENCE em_mentions_seq

CREATE TABLE em_mentions
   mention_id       INTEGER
                       DEFAULT em_mentions_seq.NEXTVAL
                       PRIMARY KEY,
   meme_id          INTEGER    REFERENCES em_memes (meme_id),
   source_name      VARCHAR2 (100),
   source_details   CLOB,
   occurred_on       DATE

Note: the ability to specify the next value of a sequence in the DEFAULT clause of a table is new to 12.1.

Mentions are loaded from many different sources and they rely on a staging table to collect together data from all sources:

CREATE TABLE em_incoming
   meme_name         VARCHAR2 (1000),
   source_name       VARCHAR2 (100),
   occurred_on       DATE,
   mention_details   CLOB

The developers at also use the following em_memes_pkg package to log errors and to perform some critical proprietary computations, based on the content of those tables:

   TYPE incoming_t IS TABLE OF em_incoming%ROWTYPE;

   TYPE mentions_t IS TABLE OF em_mentions%ROWTYPE;

   TYPE meme_ids_t IS TABLE OF em_memes.meme_id%TYPE;

   PROCEDURE log_error (
      error_code_in   IN INTEGER DEFAULT SQLCODE,
      error_msg_in    IN VARCHAR2
         DEFAULT DBMS_UTILITY.format_error_stack);

   FUNCTION unpacked_incoming (
      incoming_in   IN em_incoming%ROWTYPE)
      RETURN em_mentions%ROWTYPE;

   PROCEDURE reset_meme_status (
      meme_id_in       IN     em_memes.meme_id%TYPE,
      new_status_out      OUT em_memes.meme_status%TYPE);

For reasons of space, we will not provide the package body (whoops, you caught me! I no longer have constraints on "space", but the real reason I won't show you the body is that it is totally uninteresting, just stubs to make sure the package body will compile), but here is what the elements in the specification provide:

em_memes_pkg .incoming_t
A nested table type that contains rows of data from the em_incoming table
em_memes_pkg .mentions_t
A nested table type that contains rows of data from the em_mentions table
em_memes_pkg .meme_ids_t
A nested table type that contains primary keys from the em_memes table
em_memes_pkg log_error
A generic (and typical) error logging mechanism. Defined as an autonomous transaction, it writes out the basic error information available from the DBMS_UTILITY functions, and more to a log table and then commits just that insert.
A function that converts a row of incoming data (whose mention_details column is an XML document that holds the different formats of source information) into a record that can be inserted into the me_mentions table
The "heart and soul" of the proprietary process, it analyzes the contents of the mentions table and determines the status of the meme.

Here is an explanation of the key parts of em_update_status (see Listing 1):

For each row in the staging table (em_incoming)…
12, 40
Put the entire body of the loop inside its own nested block, so that any exception can be trapped and logged.
Set a savepoint for this new "transaction" (add a mention, update the status).
Convert the incoming row to a record that can be inserted into the mentions table.
Insert a single row into the mentions table.
Compute the new status for the meme, based on the new mention.
If the status isn't NULL, update the memes table with that status
If anything goes wrong, log the error and then erase the effects of either/both the insert and update by rolling back to the savepoint.

We now have all the information we need to analyze and fix the em_update_status procedure.

Before I tell you about my solution (and, preferably, before you read the Oracle Magazine article!), however, I'd like to stop and offer you, dear reader, the opportunity to think through the situation yourself:

1. What do you see as the key problems with the program and perhaps more deeply the table design?

2. What would you do to fix the problem(s)?

No comments:

Post a Comment