Thursday, June 16, 2016

Table functions to the rescue....again! A refactoring story

I've published several posts on table functions. Here's another one! Why?

A few days ago I tweeted the following:


Two of my followers immediately asked to hear the story: "those are the fun ones, but pics or it didnt happen ;-)" and "may be interesting to view the step you do to find solutions more then result."

Very good points; I especially agree with the latter. As I was going through the revisions to my code, I was thinking (as I have often done before): "It might be helpful to show the process I go through, because it sure feels like a powerful, positive way to build and improve code."

The problems I run into when actually sitting down to tell the story are:

1. Time: yes, I know, we are all busy. Plus, isn't it my job to share thoughts on PL/SQL programming?  Yes it is! Well, part of my job, anyway. And I have been neglecting my blog. But right now, at this moment, I am very focused on finishing the "early adaptor" release of our new skin over the PL/SQL Challenge.  That's the code I was refactoring, in fact.

2. The details: I am working on very specific tables and logic. Do I explain all of that to my readers? Do I modify the code to simplify/redact it? All of that takes extra time and pushes me away from the post.

3. Saving copies of old code: to tell the story properly, I should take snapshots of my code and app, so that I can go back and reproduce the steps. Sounds easy enough, right? But when I am laser-focused on my code, it is soooo hard to do that. :-( 

OK, so the heck with all that! I will take some time away from programming to share my story. And I will publish my code as scripts on LiveSQL, while offering greatly truncated versions in this post.

Why Did I Need to Refactor?

Well, actually, first, you may be wondering: "What is this 'refactoring' you are talking about?"

Refactoring is, essentially, a fancy word for "the process of restructuring existing computer code without changing its external behavior." In other words, you modify the guts of your code to make it more maintainable or faster, without changing the outcome of running the code.  And you do so in a way that is careful and methodical and verified by a regression test.

So when it comes to refactoring and Steven Feuerstein: I am on board with careful. I am so-so on methodical. And I generally fail when it comes to a regression test. 

Let's get to it. Here's why I needed to make changes to my code: the PL/SQL Challenge (PLCH) development team (consisting of Eli Feuerstein, myself and in key supporting roles a number of others, including Dan McGhan, Koen Lostrie, and Shakeeb Rahman) has been working on a new "skin" over the PLCH quiz platform.

We are calling it the Oracle Dev Gym and it will be available publicly as an "early adaptor" release for current PLCH users on 27 June. Hey, that date is quickly approaching!

So it was time to move the app to stage and test it against production volumes of data. For the most part, it performed excellently. One very important page (page 3) in the app displays quizzes for the selected technology:



And here we had a BIG problem: performance degraded from < 1 second on Dev to over 60 seconds on Prod. 

Talk about unacceptable.

OK, to give the full story: the 60 second time was the first pass. When I immediately ran it again (refreshed the page), elapsed time went down to 20 seconds (caching, adaptive plan, etc.), and after a third time, the elapsed time went down to 5 seconds or so. Even more interestingly, when I executed the query in SQL Developer, it usually would finish in under 2 seconds. 

Bottom line: inside the app, this query was waaaaay tooooo sloooooow - and at least part of the performance problem had to do with materializing the result set on the page.

The page was displaying all the PL/SQL quizzes (more than 1400), along with useful information like player ratings of the quiz, average time to finish the quiz and so on. It was doing lots of work, and it was done by a SELECT statement that was actually a SELECT inside a SELECT inside a SELECT, with a number of PL/SQL function calls (context switches). Check out the full query here.  Below is a very abbreviated version:


So there are a couple of things to note at this point:
  • My innermost query gets all possible quizzes, plus it walks the topics tree, and lots of other activity as well. 
  • The page doesn't really display 1400 quizzes. It displays 18 at a time. You could then page through to the others. 
  • But the user already paid the price for preparing all 1400 cards for viewing. 
  • We present the quizzes in the following order: if you haven't yet taken the quiz it takes precedence over those you've answered in the past, and then we randomize the order. 
  • That's right. We randomize the display of quizzes so you are not constantly shown the same quizzes. 
There is no doubt about the fact that I could make changes to the query to improve its performance, including:
  • Create materialized views to pre-calculate historical data (that is, in fact, what the mv_qdb_question_summaries view shown in the image above does).
  • Remove any PL/SQL functions that can be implemented directly in the SQL statement.
  • Make sure all necessary indexes are in place.
But the performance issue also gave me pause and time to ask a very important question: does the behavior on this page match how a user will use the page? 

We all know the saying about Google: if the search result is not on the first page, it might as well not be there at all. Who ever goes to the 2nd or 3rd or 2,507th page of search results? 

Certainly not me.

And since a big point of the Dev Gym is to make it really easy and fast for a developer to find and take a quiz, it also seemed unlikely they would (or we would want them to) spend time paging through those 1400 quizzes. 

Much more likely is that they will find a quiz right there or they will give up. Or maybe they'd tell the Dev Gym: show me another 20 random quizzes to pick from.

I hope this makes sense to you. It made a ton of sense to me - and it also offered a way to dramatically improve performance on this page:
Don't execute a query with an inner SELECT that returns 1400 quizzes, and then filter them out. Instead, right at the core of that query, return just 20 randomized quizzes, and then apply all the "expensive" processing to that small set. 
Then, on the app page, provide a refresh button so that the user can quickly ask for a new set to peruse. 

Now, of course, as you would expect from your own experience, there were a number of additional complicated aspects to selecting out that set of 20: a user could specify a search filter, a difficulty level, etc. Perhaps I was better at SQL I could have done it all directly inside a SELECT.

But I could instantly see how I could leverage a table function to give me the desired effect in a relatively clean manner.

Especially because I had already written a function to return a single randomly-selected quiz for a given topic.  You can see the function in its entirety here. Here's a shortened version that gives you some sense of the inner complexities:


The function constructed a query dynamically, executed it and returned a single row. I needed a query to return up to N quizzes (currently 20, but who knows for the future?). And I needed a function that returned a collection so it can be executed inside a SELECT within the TABLE operator.

So the steps I resolved to take were:
  1. Create a new, separate function to construct and return the query
  2. Change the current random_question to function to call this function.
  3. Create a new random_questions (notice the "s" at the end!) to return a collection of question IDs, which could then be consumed by the query driving page 3 in my app.
Here's the header of my query function (all the code is available for viewing here):


Here's the refactoring of random_question, utilizing the new function:

And, finally, the the new table function needed to solve the original problem (all code viewable here):


There are several elements of this function I very much like:
  • Use of FETCH FIRST N ROWS - added in Oracle Database 12c Release 1. Nice, easy way to say: just give me the first 20. 
  • Reuse of the random_question_query function. I hate to copy/paste code!
  • The second call to the query generator is only made if there were not enough unplayed quizzes. 
  • For the layer quizzes fetch, I BULK COLLECT into a second array. Notice I just ask for 20 (question_limit_in) again. Since the first one didn't return enough, this second one will "fill in the gap" with played quizzes.
  • Then on line 44 I leverage the MULTISET UNION operator (available only for nested tables) so that the PL/SQL engine can do the "heavy lifting" of finding all the quizzes in the l_questions_played array and add them to the unplayed quizzes.
And then...finally...it is time to use the table function inside my page 3 query. Here's a fragment of the whole, showing the inclusion of the table function:


I have now moved lots of complexity to the table function (including the walk down through the topics tree - the CONNECT BY you saw in the original), but more importantly the outer SELECTs only apply their computations to the 30 quiz IDs returned by the table function.

The resulting performance? Outstanding! Virtually instantaneous. 

So....lessons learned:
  • Test your app as early as possible against production volumes of data so that you don't have to dig yourself out of a performance hole late in the game.
  • Watch out for the overhead of calling PL/SQL functions in SQL (context switching).
  • Leverage table functions - where and when appropriate - to move complexity (especially procedural logic) out of your SELECT.
BUT MOST IMPORTANT OF ALL:
Make sure the behavior of your app matches the way the user will most intuitively and simply want to use it.

I hope you found my story interesting and helpful. I look forward to your comments. Please do feel free to make suggestions for improving code.

Now - can I get back to programming? :-)

8 comments:

  1. Thanks for sharing! As a bonus it looks like you may have rid your code of the sql injection vulnerability where you were appending in user_id instead of binding :)

    ReplyDelete
  2. Thanks, Craig, always good to point out. I would say in this case the vulnerability is very low, since the user cannot type in their "user ID" anywhere on the site. Concatenation of text in dynamic SQL strings is primarily a concern when you are concatenating text provided directly by a user.

    ReplyDelete
  3. Very interesting. Thanks for sharing.

    ReplyDelete
  4. If you've played all but 10 of the quizzes, won't your table function return 30 quizzes instead of 20?

    The unplayed function call fetches the 10 unplayed quizzes; that's less than you asked for, so it pulls up 20 played quizzes as well - making 30 altogether.

    Shouldn't the second call be asking for (question_limit_in - l_questions.COUNT) questions instead?

    ReplyDelete
  5. Thanks and good point, Chris. Isn't public code review fun? :-) I think I will be applying a few tweaks to this code. Not the least of which is to re-evaluate the use of ORDER BY DBMS_RANDOM.VALUE. I am not excited about the potential performance impact of a context switch for every row fetched/ordered - especially if I then only use a small portion of the rows.

    ReplyDelete
  6. Hello Steven,
    Having in my head a kind of "model" for performing pagination queries, I wonder how will your query page behave when a ("stubborn") user will decide to ask for the "next page" of quizzes ?
    It looks to me that a RANDOM ordering puts a "consistent" request for the "next page" out of discussion.
    That is, the player will only have the option of repeating his query, with the same or a different number of rows requested,
    and then the random ordering will practically ensure a new
    set of quizzes returned.

    Another possible approach could be to use a non-random ordering,
    even one that the player could choose from a list of several options, and then use the OFFSET option of the 12c FETCH NEXT
    to easily return the "next set" of quizzes, consistently
    to the chosen ordering.

    Otherwise put:
    The random ordering is nice when a player wants to (re)take a set of
    quizzes for training ... but not so good when a player is maybe
    looking for some specific quizzes he remembers of having played.

    Regarding the performance issue:

    The "classic" way of doing pagination queries used a combination
    of ORDER BY with a ROWNUM < :n condition,
    and for such queries the Oracle optimizer does have a built-in algorithm
    that ensures the best performance, by avoiding the need to fully sort
    the entire result set.
    As far as I am aware the FETCH NEXT feature of 12c is using analytic functions
    under the covers, so it still sorts the entire data set.
    Maybe at a volume of about 1400 rows the performance impact is not effectively visible.


    Best Regards & Enjoy the KScope16 :)
    Iudith

    ReplyDelete
  7. Thanks for your comments, Iudith. The current idea with this page is that you click on the spinner to bring up another set of randomly selected quizzes (for the topic you selected).

    We will add search as well, but I think that with the topic selector (in other words, you can drill down as "tightly" as you want to then see the associated quizzes) most of the need for search and pagination would go away.

    ReplyDelete