Monday, June 30, 2014

The Joy of Low Hanging Fruit, Part 2: the Dismay and the Delight

In the first post of this "Joy" series, I explained the problem: a procedure that runs in a daily batch job threatening to take more than a day to complete. Not a good situation.

I invited readers to analyze the situation and come up with their own solution before I continued. 

Now I continue!

My Dismay and My Delight – First, the Dismay

I was dismayed when I saw the implementation of reset_meme_status because I found in that procedure a cursor FOR loop that contained two non-query DML statements (. That is a classic "anti-pattern", meaning a general pattern of coding that should be avoided.

It should be avoided because the inserts and updates are changing the tables on a row-by-row basis, which maximizes the number of context switches and consequently greatly slows the performance of the code. Fortunately, this classic anti-pattern has a classic, well-defined solution: use BULK COLLECT and FORALL to switch from row-by-row processing to bulk processing.

Before moving on to explaining why it is also my delight, let's take a closer look at the anti-pattern. I am suggesting to you that you have a problem whenever you see code like this:

BEGIN
   FOR my_record IN (SELECT ...)
   LOOP
      INSERT ....
         and/or
      UPDATE ...
         and/or
      DELETE ...
   END LOOP;
END; 
 
Or like this:

BEGIN
   FOR my_index IN low_value .. high_value
   LOOP
      INSERT ....
         and/or
      UPDATE ...
         and/or
      DELETE ...
   END LOOP;
END; 

Furthermore, those SQL statements could be static (fixed at compile time, also referred to as embedded SQL) or dynamic (finalized at run time).

You might be wondering: why should this code cause a problem? Why should it run the SQL slowly? PL/SQL is supposed to be tightly integrated with SQL. Shouldn't Oracle be able to optimize this sort of code to run blazingly fast?

Well, let's give Oracle its due: it does optimize this code to run blazingly fast. The Oracle Database, SQL and PL/SQL are all very efficient. But however hot the fire, it can always get hotter right? So Oracle has built features into PL/SQL to make your code run even faster (hey, I would have said that even before I rejoined Oracle!).

The fundamental cause for the relative slowness of row-by-row-processing is that SQL statements are executed by a different runtime engine from PL/SQL statements. So Oracle must "switch" between them; these are called context switches and Oracle has paid lots of attention over the years to reduce the cost of a single context switch. That's good, because we are switching back and forth all the time.

For example, I create a simple function on top of SUBSTR:

CREATE OR REPLACE FUNCTION betwnstr (
   string_in   IN   VARCHAR2
 , start_in    IN   INTEGER
 , end_in      IN   INTEGER
)
   RETURN VARCHAR2
IS
BEGIN
   RETURN (SUBSTR (string_in, start_in, end_in - start_in + 1));
END;
/

I can then call it inside an SQL statement:
BEGIN
   FOR rec IN (SELECT betwnstr (last_name, 1, 5)
                 FROM employees)
   LOOP
      NULL;
   END LOOP;
END;

In this case, Oracle must switch from the SQL engine to the PL/SQL engine each time betwnstr is called.

Old timers like me can remember back in the days of Oracle8 and Oracle8i, when you rarely if ever called your own (aka, user-defined) function inside SQL, like I showed you above. It was simply too slow.

Now, we do this all the time, because Oracle sliced the time it takes to do the switch. And continues to look for new ways to do. Which is why, in Oracle Database 12c, you can use the WITH clause to define a function right inside your SQL statement and you can use the UDF pragma in your function definition: two techniques to reduce some context-switching overhead.


Still, another way to get around that overhead is to reduce the number of context switches needed. And that's what the bulk processing features of PL/SQL offer to us.


Consider the following figure:


I have a cursor FOR loop executing an update statement.

Each time the PL/SQL engine gets to the update, it stops, prepares the bind variables, and then passes a packet of info (the SQL string, the variables, and other data as well) to the SQL engine. The SQL engine "does its thing" and passes back a result set (or an error). This occurs for each update statement, whether there be two or two thousand.

Hency, my dismay with extremememe.info's code.

OK, Back to My Delight

The presence of this anti-pattern was also the source of my delight. The developers needed to improve the performance of the procedure significantly. That could be very hard to do, if they have already taken full advantage of performance optimization features in SQL and PL/SQL. Since their code still contained non-query DML inside loops, I knew that was not the case and in fact we were looking at a case of  "low hanging fruit" (a goal that is easily achievable).

It was possible that we could solve their problem with a relatively straightforward change to their existing code. Conesquently, they might not need to make changes to their database design, which would have much greater impact and incur a much greater cost of development and testing.

Well, you've seen the anti-pattern: non-query DML inside the loop. What's the (anti-anti-) pattern? We already saw how Oracle has gone to great lengths to reduce the cost of a context switch. So the next obvious things to look at is: how can we reduce the number of context switches? The answer is: BULK COLLECT and FORALL (bulk processing).

This next figure demonstrates at a very high level the performance potential offered by bulk processing.


I have replaced my cursor FOR loop with a FORALL statement. Under or within the header of FORALL I attach my DML statement.

When the PL/SQL engine hits the meta-DML FORALL statement, it says:

"Oh, I know what you want me to do: for each element in the driving array, generate a DML statement with bind variables set from the array.
"Then I will pass the whole set of DML statements to the SQL engine with one context switch.
"The SQL engine does whatever it is supposed to do (that's none of my business) and then I get the results back....
"Way faster!"
Seriously, that is exactly how the PL/SQL engine talks. I should know. We've had lots of conversations over the years.

And that is the fundamental idea behind bulk processing: the fewer context switches, the faster your program runs. And we are not talking about a 10% or 20% increase. You can see DML statements execute an order of magnitude faster using FORALL.

Of course, that's talk – and  you want to see action. In the coming weeks, I will step through the transformation of the row-by-row solution to a fully compatible and much faster bulk solution.
If you are eager to see that end point, simply point your browser to:


In the meantime, I will end this chapter in my saga with a simple piece of advice that can have a big impact on your code:

Unless working with a trivial amount of data, convert all loops containing non-query DML to a combination of BULK COLLECT and FORALL.

Sunday, June 29, 2014

PL/SQL 201: When do I have to create a schema-level type?

I received this question in my In Box last week:

In your PL/SQL 101: Working with Collections article in Oracle Magazine, the use of "TYPE ... IS ..." is demonstrated.  However I found it's not possible to instantiate an object of the TYPE. It seems that the use of "CREATE TYPE ... AS OBJECT" is required. What are the differences between the 2 ways of creating a user defined data type? 

Yes, it can certainly be  confusing when you use the same TYPE statement for arrays and object types, but they cannot all be used in all the same ways in all the same places.

So let's go over the differences.

PL/SQL

Within PL/SQL, you use the TYPE statement to define two different, ahem, types of user-defined types: record and collection.

Record

A composite datatype consisting of one of more fields. Each field may be of almost any PL/SQL datatype, including another record type or a collection type.

You can use the TYPE statement to define a new record type in any declaration section of a PL/SQL block, including a package.

Here are some examples:

create or replace package my_types
is
   type two_fields_rt is record (
      field1 number, 
      field2 date);
end;
/

declare
   type two_fields_rt is record (
      field1 number, 
      field2 date);
   
   l_record1 two_fields_rt;   
   l_record2 my_types.two_fields_rt;
begin
   l_record1.field1 := 100;
end;
/

Collection

Collections are PL/SQL's versions of arrays. There are different sorts: associative array, nested table and varray. When you define a collection you define the type of data that can be sorted in the collection.

You can use the TYPE statement to define a new collection type in any declaration section of a PL/SQL block, including a package.

Here are some examples:

create or replace package my_types
is
   type two_fields_rt is record (
      field1 number, 
      field2 date);

   type assoc_array_t is table of two_fields_rt
      index by pls_integer;

   type mested_table_t is table of two_fields_rt;

   type varray_t is varray(10) of two_fields_rt;
end;
/

declare
   type assoc_array_t is table of two_fields_rt
      index by pls_integer;
   
   l_list1 assoc_array_t;
   l_list2 my_types.assoc_array_t;
   l_list3 my_types.varray_t;
begin
   l_list1 (100).field1 := 100;
end;
/

You may not, however, declare an object type inside a PL/SQL block. Like a relational table, object types must be defined at the schema level, as a database object, using a SQL DDL statemeant.

SQL

Within SQL, you use the CREATE TYPE DDL statement to define two different, ahem, types of user-defined types: collection and object. You cannot define a record type in SQL.

Collection

SQL implements two types of collections as part of its object model: nested table and varray. When you define a collection you define the type of data that can be sorted in the collection.

Here are some examples:

create or replace type nested_table_t is table of number
/

create or replace type varray_t is varray(10) of date
/

declare
   l_list1 nested_table_t := nested_table_t (1, 2, 3);
   l_list2 varray_t;
begin
   l_list1 (1) := 100;
end;
/

Object

Within SQL, you can also use CREATE TYPE to define an object type, which is Oracle's "version" of a class for object-oriented development.

create or replace type food_t is object (
   name varchar2(100),
   grown_in varchar2(100),
   food_category varchar2(30)
)
/

declare
   l_food food_t := food_t (
      'Brussels Sprouts', 'Soil', 'Vegetable');
begin
   l_food.grown_in := 'Sandy Soil';
end;
/

Even though you create an object type in the database just like a relational table (and the syntax is very similar), an object type is really just a "template" for declaring instances of the type, while a table is a container for data.

You can have a table of object types:

create table our_food (your_food food_t, my_food food_t)

But you cannot declare an object type within a PL/SQL block, for "temporary" use inside that block. Instead, declare a record type.


Friday, June 27, 2014

New Link to My Demo.zip PL/SQL script files

For over a decade, PL/SQL developers have headed over to PL/SQL Obsession (on ToadWorld.com) to download my demo.zip file. They do this because all of my trainings references files from that zip by name. I use them to demonstrate  PL/SQL features, test performance of those features, and even offer some (potentially) reusable utilities.

I have, as hopefully you know, moved on to Oracle Corporation. PL/SQL Obsession will still be in place for another year or so, and it will take me some time to construct a new portal for "all things PL/SQL".

In the meantime, however, and going forward, if you want to download the very latest version of demo.zip, you can use either of these links:
If you encounter any problems with this archive or individual files within it, please don't hesitate to drop me a note.

Monday, June 16, 2014

The Joy of Low Hanging Fruit, Part 1

I publish an article in most Oracle Magazine issues, focusing on the PL/SQL language. I love writing these articles and I enjoy getting feedback from readers, but I also feel constrained by the, well, constraints of the Oracle Magazine format (both in terms of length of the article and also the way we can present code).

So I have decided to complement my articles (some, not necessarily all) with an extended, serialized treatment on my blog. In this first series, I walk through the refactoring of some code that has performance issues, and provide all the code for the initial, intermediate and final forms of the program here.

I hope you find this format interesting and helpful. Please comment!

Struggling to Keep Up

I recently spent a few days with a team of developers at extremememe.info (all names changed to protect the innocent), an up-and-coming Web 3.0 paradigm shifter that analyzes Internet memes, tracks them to their source, and best of all predicts new, near-future memes.

They are very knowledgeable about Oracle and PL/SQL, but as is the case with many developers they have little time to explore new features in technology. They mostly struggle to keep up with the demands of their users. And even when they know about some great new feature, it can be a challenge to convince management to commit the resources to apply those features to stable production code.

We interspersed training on PL/SQL techniques with reviews of their code, and in the process came across a program that updates the status of all their memes, and runs in a daily batch process. Unfortunately, as their data volumes have grown, the time it takes to complete that process was taking longer and longer, currently approaching twenty-three hours. Given how uncomfortably close that was to a fully day, we decided to take a closer look.

I was both dismayed and delighted with what I found (em_update_status0.sql):

Listing 1. The original em_update_status

  1  CREATE OR REPLACE PROCEDURE em_update_status
  2  IS
  3     CURSOR incoming_cur
  4     IS
  5        SELECT * FROM em_incoming;
  6
  7     l_mention   em_mentions%ROWTYPE;
  8     l_status    em_memes.meme_status%TYPE;
  9  BEGIN
 10     FOR incoming_r IN incoming_cur
 11     LOOP
 12        BEGIN
 13           SAVEPOINT new_transaction;
 14
 15           l_mention := em_memes_pkg.unpacked_incoming (incoming_r);
 16
 17           INSERT INTO em_mentions (meme_id,
 18                                    source_name,
 19                                    source_details,
 20                                    occurred_on)
 21                VALUES (l_mention.meme_id,
 22                        l_mention.source_name,
 23                        l_mention.source_details,
 24                        l_mention.occurred_on);
 25
 26           em_memes_pkg.reset_meme_status (l_mention.meme_id,
 27                                           l_status);
 28
 29           IF l_status IS NOT NULL
 30           THEN
 31              UPDATE em_memes
 32                 SET meme_status = l_status
 33               WHERE meme_id = l_mention.meme_id;
 34           END IF;
 35        EXCEPTION
 36           WHEN OTHERS
 37           THEN
 38              em_memes_pkg.log_error;
 39              ROLLBACK TO new_transaction;
 40        END;
 41     END LOOP;
 42 END;

First I will explain some background on references to tables and code, then I will explain the key lines in this procedure.

It works with the following three tables. The columns of those tables have been simplified greatly for purposes of this article.

They have a table for all the memes of which they are aware and for which they track the status (note that setting the default value of the primary key to the next sequential value is a 12.1 feature; in 11.2 and earlier, you should use a trigger to get the same effect - you will find the trigger code in em_memes_setup.sql).

CREATE TABLE em_memes
(
   meme_id         INTEGER PRIMARY KEY,
   meme_name       VARCHAR2 (1000) UNIQUE,
   discovered_on   DATE,
   meme_status     VARCHAR2 (100)
)
/

The meme_status might be "VIRAL", "DORMANT", "HOAX", etc.

They have a table for all the "mentions" or references to a meme:

CREATE SEQUENCE em_mentions_seq
/

CREATE TABLE em_mentions
(
   mention_id       INTEGER
                       DEFAULT em_mentions_seq.NEXTVAL
                       PRIMARY KEY,
   meme_id          INTEGER    REFERENCES em_memes (meme_id),
   source_name      VARCHAR2 (100),
   source_details   CLOB,
   occurred_on       DATE
)
/

Note: the ability to specify the next value of a sequence in the DEFAULT clause of a table is new to 12.1.

Mentions are loaded from many different sources and they rely on a staging table to collect together data from all sources:

CREATE TABLE em_incoming
(
   meme_name         VARCHAR2 (1000),
   source_name       VARCHAR2 (100),
   occurred_on       DATE,
   mention_details   CLOB
)
/

The developers at extremememe.info also use the following em_memes_pkg package to log errors and to perform some critical proprietary computations, based on the content of those tables:

CREATE OR REPLACE PACKAGE em_memes_pkg
IS
   TYPE incoming_t IS TABLE OF em_incoming%ROWTYPE;

   TYPE mentions_t IS TABLE OF em_mentions%ROWTYPE;

   TYPE meme_ids_t IS TABLE OF em_memes.meme_id%TYPE;

   PROCEDURE log_error (
      error_code_in   IN INTEGER DEFAULT SQLCODE,
      error_msg_in    IN VARCHAR2
         DEFAULT DBMS_UTILITY.format_error_stack);

   FUNCTION unpacked_incoming (
      incoming_in   IN em_incoming%ROWTYPE)
      RETURN em_mentions%ROWTYPE;

   PROCEDURE reset_meme_status (
      meme_id_in       IN     em_memes.meme_id%TYPE,
      new_status_out      OUT em_memes.meme_status%TYPE);
END;
/

For reasons of space, we will not provide the package body (whoops, you caught me! I no longer have constraints on "space", but the real reason I won't show you the body is that it is totally uninteresting, just stubs to make sure the package body will compile), but here is what the elements in the specification provide:

em_memes_pkg .incoming_t
A nested table type that contains rows of data from the em_incoming table
em_memes_pkg .mentions_t
A nested table type that contains rows of data from the em_mentions table
em_memes_pkg .meme_ids_t
A nested table type that contains primary keys from the em_memes table
em_memes_pkg log_error
A generic (and typical) error logging mechanism. Defined as an autonomous transaction, it writes out the basic error information available from the DBMS_UTILITY functions, and more to a log table and then commits just that insert.
em_memes_pkg.unpacked_incoming
A function that converts a row of incoming data (whose mention_details column is an XML document that holds the different formats of source information) into a record that can be inserted into the me_mentions table
em_memes_pkg.reset_meme_status
The "heart and soul" of the extremememe.info proprietary process, it analyzes the contents of the mentions table and determines the status of the meme.

Here is an explanation of the key parts of em_update_status (see Listing 1):

Line(s)
Description
10
For each row in the staging table (em_incoming)…
12, 40
Put the entire body of the loop inside its own nested block, so that any exception can be trapped and logged.
13
Set a savepoint for this new "transaction" (add a mention, update the status).
15
Convert the incoming row to a record that can be inserted into the mentions table.
17-24
Insert a single row into the mentions table.
26-27
Compute the new status for the meme, based on the new mention.
29-34
If the status isn't NULL, update the memes table with that status
35-39
If anything goes wrong, log the error and then erase the effects of either/both the insert and update by rolling back to the savepoint.

We now have all the information we need to analyze and fix the em_update_status procedure.

Before I tell you about my solution (and, preferably, before you read the Oracle Magazine article!), however, I'd like to stop and offer you, dear reader, the opportunity to think through the situation yourself:

1. What do you see as the key problems with the program and perhaps more deeply the table design?

2. What would you do to fix the problem(s)?