Skip to main content

Table Functions, Part 5c: Another use case for Pipelined Table Functions (and simple example)

Please do feel encouraged to read this and my other posts on table functions, but you will learn much more about table functions by taking my Get Started with PL/SQL Table Functions class at the Oracle Dev Gym. Videos, tutorials and quizzes - then print a certificate when you are done!

From Oracle Help Center (a.k.a., documentation), we read:
Data is said to be pipelined if it is consumed by a consumer (transformation) as soon as the producer (transformation) produces it, without being staged in tables or a cache before being input to the next transformation.  
Pipelining enables a table function to return rows faster and can reduce the memory required to cache a table function's results.  
A pipelined table function can return the table function's result collection in subsets. The returned collection behaves like a stream that can be fetched from on demand. This makes it possible to use a table function like a virtual table.
In a nutshell, this means that the calling query can put to use the rows returned by the pipelined table function (PTF) before the function has returns all rows.
A simple way to demonstrate this (and highlight another nice use case for PTFs) is with the SQL IN operator.
IN is used in the WHERE clause to determine if a column or expression is in the specified list. The list can be a literal list:
SELECT * FROM my_table WHERE my_col IN (1,2,3,4)
or the list can be a subquery, which is really important, since you cannot have more than 1,000 literals in a list.

There is no such limit with this kind of IN list:
SELECT * FROM my_table 
WHERE my_col IN (SELECT num_col FROM my_list)

Hey, that's a FROM clause! That means we could call a table function there! And yes you can:


SELECT * FROM my_table 
WHERE my_col IN (
   SELECT num_col FROM TABLE (my_tf))

To evaluate whether or not a value is in the IN list, the SQL engine needs to go through the list, looking for a match. When it finds a match, it stops searching.

Gee, so that should mean that if the table function is pipelined, and the function is returning rows as it generates them, the SQL engine should be able to get to an answer faster with pipelining than without. 

Shall we test that? Yes, we shall!

First, I create a table and two functions, the second of which is pipelined:

CREATE TABLE plch_data (n NUMBER)
/

CREATE OR REPLACE TYPE numbers_t IS TABLE OF NUMBER;
/

CREATE OR REPLACE FUNCTION my_list_tf
   RETURN numbers_t AUTHID DEFINER
IS
   ns   numbers_t := numbers_t ();
BEGIN
   ns.EXTEND (1000000);

   FOR indx IN 1 .. 1000000
   LOOP
      ns (indx) := indx;
   END LOOP;

   RETURN ns;
END;
/

CREATE OR REPLACE FUNCTION my_list_ptf
   RETURN numbers_t
   PIPELINED AUTHID DEFINER
IS
BEGIN
   FOR indx IN 1 .. 1000000
   LOOP
      PIPE ROW (indx);
   END LOOP;
   RETURN;
END;
/

Let's take a closer look at the pipelined version. There are just three items to note:
  1. I add the PIPELINED keyword.
  2. Instead of populating a nested table to return, I pipe the row directly out of the function.
  3. I return nothing but control at the end of my function. 
Oh and here's an odd thing to note: if you leave off the RETURN; statement in a pipelined table function, you will get no complaint from the SQL or PL/SQL engines. Since no data is being returned by RETURN, the function will simply terminates and returns control.

But I suggest you include the RETURN anyway. Looks better, less confusing to someone maintaining the code later.

DECLARE
   l_count   INTEGER;
   l_start   PLS_INTEGER;

   PROCEDURE mark_start
   IS
   BEGIN
      l_start := DBMS_UTILITY.get_cpu_time;
   END mark_start;

   PROCEDURE show_elapsed (NAME_IN IN VARCHAR2)
   IS
   BEGIN
      DBMS_OUTPUT.put_line (
            '"'
         || NAME_IN
         || '" elapsed CPU time: '
         || TO_CHAR (DBMS_UTILITY.get_cpu_time - l_start)
         || ' centiseconds');
      mark_start;
   END show_elapsed;
BEGIN
   INSERT INTO plch_data VALUES (1);

   COMMIT;
   
   mark_start;

   SELECT COUNT (*)
     INTO l_count
     FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_tf));

   show_elapsed ('TF match on first');

   SELECT COUNT (*)
     INTO l_count
     FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_ptf));

   show_elapsed ('PTF match on first');

   UPDATE plch_data
      SET n = 1000000;

   SELECT COUNT (*)
     INTO l_count
     FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_tf));

   show_elapsed ('TF match on last');

   SELECT COUNT (*)
     INTO l_count
     FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_ptf));

   show_elapsed ('PTF match on last');
END;
/

And when I run this block with server output turned on, I see:

"TF match on first" elapsed CPU time: 11 centiseconds
"PTF match on first" elapsed CPU time: 1 centiseconds
"TF match on last" elapsed CPU time: 13 centiseconds
"PTF match on last" elapsed CPU time: 5 centiseconds

Yep. Pipelining resulted in a signficant boost in performance. Go, PTF, go!

By the way, here are the clean-up steps for the above script:

DROP TYPE numbers_t
/

DROP TABLE plch_data
/

DROP FUNCTION my_list_tf
/

DROP FUNCTION my_list_ptf
/

Comments

  1. Hello Steven,

    This post was really nice and stimulated some research.
    Of course, the natural temptation was to perform a test where table plch_data has a higher number of rows, so, using your code already at hand, I performed the test below, in 11.2.0.3.0.


    DECLARE
    l_count INTEGER;
    l_start PLS_INTEGER;

    PROCEDURE mark_start
    IS
    BEGIN
    l_start := DBMS_UTILITY.get_cpu_time;
    END mark_start;

    PROCEDURE show_elapsed (NAME_IN IN VARCHAR2)
    IS
    BEGIN
    DBMS_OUTPUT.put_line (
    '"'
    || NAME_IN
    || '" elapsed CPU time: '
    || TO_CHAR (DBMS_UTILITY.get_cpu_time - l_start)
    || ' centiseconds');
    mark_start;
    END show_elapsed;
    BEGIN
    -- INSERT INTO plch_data VALUES (1);
    FOR i in 1 .. 1000000 LOOP
    INSERT INTO plch_data VALUES (i) ;
    END LOOP;

    COMMIT;

    DBMS_STATS.GATHER_TABLE_STATS( USER, 'PLCH_DATA' );

    mark_start;

    SELECT COUNT (*)
    INTO l_count
    FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_tf));

    show_elapsed ('TF match on first');

    SELECT COUNT (*)
    INTO l_count
    FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_ptf));

    show_elapsed ('PTF match on first');

    -- ----------------------------------------------------

    UPDATE plch_data
    SET n = n + 500000 ;

    COMMIT;

    DBMS_STATS.GATHER_TABLE_STATS( USER, 'PLCH_DATA' );

    mark_start;

    SELECT COUNT (*)
    INTO l_count
    FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_tf));

    show_elapsed ('TF match on last');

    SELECT COUNT (*)
    INTO l_count
    FROM plch_data
    WHERE n IN (SELECT * FROM TABLE (my_list_ptf));

    show_elapsed ('PTF match on last');
    END;
    /
    "TF match on first" elapsed CPU time: 94 centiseconds
    "PTF match on first" elapsed CPU time: 74 centiseconds

    "TF match on last" elapsed CPU time: 87 centiseconds
    "PTF match on last" elapsed CPU time: 72 centiseconds

    PL/SQL procedure successfully completed.

    ( to be continued )

    ReplyDelete
  2. ( I already posted the continuation yesterday, but it seems to have been lost ...
    so I post it once again ... )

    As we can expect, the execution plan (for both functions) used a hash join and looks as follows:

    ---------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    ---------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | 7 | 312 (3)| 00:00:04 |
    | 1 | SORT AGGREGATE | | 1 | 7 | | |
    |* 2 | HASH JOIN RIGHT SEMI | | 1 | 7 | 312 (3)| 00:00:04 |
    | 3 | COLLECTION ITERATOR PICKLER FETCH| MY_LIST_PTF | 8168 | 16336 | 19 (0)| 00:00:01 |
    | 4 | TABLE ACCESS FULL | PLCH_DATA | 1000K| 4882K| 289 (2)| 00:00:04 |
    ---------------------------------------------------------------------------------------------------

    The cardinality that we see here for the TABLE function result set is the default of 8168 that the Oracle optimizer is using if we do not "help" it to use a more realistic number
    ( Adrian Billington at http://www.oracle-developer.net has several nice articles on this topic ).

    But, anyway, as we see in the execution plan, the result set of the TABLE function is used as the build table for the HASH join, so intuitively I cannot comprehend how exactly can the pipeline-ness of the function be used as an advantage for the join itself.

    We indeed see that while using the pipelined function is still faster than the non-pipelined one, the two CPU times are not dramatically different.

    I guess that the only advantage of using the pipelined function here may come during the creation of the in-memory hash join "build table" based on the table function result set, that is, hash values can be stored in the hash table while data is still coming in.

    I think that the join step itself to table plch_data is not influenced by the table function
    being pipelined or not.

    Of course, this conclusion can be different for a NESTED LOOPS execution plan, driven by the table function result set.
    In such a case, the join process itself (scanning of the plch_data outer table) can already be performed while rows are still being returned from the pipelined function table.

    But such a plan will probably be chosen by Oracle only for a much bigger data table in comparison with the driving table function data set, with an index on the big data table.


    Just for comparison, for your example with table plch_data having only 1 row,
    the execution plan looks like this:

    ---------------------------------------------------------------------------------------------------
    | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
    ---------------------------------------------------------------------------------------------------
    | 0 | SELECT STATEMENT | | 1 | 5 | 23 (5)| 00:00:01 |
    | 1 | SORT AGGREGATE | | 1 | 5 | | |
    |* 2 | HASH JOIN SEMI | | 1 | 5 | 23 (5)| 00:00:01 |
    | 3 | TABLE ACCESS FULL | PLCH_DATA | 1 | 3 | 3 (0)| 00:00:01 |
    | 4 | COLLECTION ITERATOR PICKLER FETCH| MY_LIST_PTF | 8168 | 16336 | 19 (0)| 00:00:01 |
    ---------------------------------------------------------------------------------------------------

    Here we see that the smaller table plch_data was chosen as the hash build table,
    and then indeed the time difference can be higher between the pipelined and non-pipelined table functions, because the pipeline-ness does influence the join itself.


    I would be glad if the performance gurus reading this post could confirm my conclusions :)

    Thanks a lot & Best Regards,
    Iudith



    ReplyDelete
  3. I got a lot of mileage using this approach to bind arrays of ROWIDs when issuing [C]RUD operations where the ROWIDs are known. This affords me the performance benefits of using ROWIDs but not requiring a dynamic number of bind variables or having to use literals.

    SELECT | UPDATE | DELETE
    .......
    WHERE ROWID IN (SELECT CHARTOROWID(COLUMN_VALUE) FROM TABLE(?))

    (In Java JDBC we reference ROWIDs as strings)

    ReplyDelete

Post a Comment

Popular posts from this blog

Running out of PGA memory with MULTISET ops? Watch out for DISTINCT!

A PL/SQL team inside Oracle made excellent use of nested tables and MULTISET operators in SQL, blending data in tables with procedurally-generated datasets (nested tables).  All was going well when they hit the dreaded: ORA-04030: out of process memory when trying to allocate 2032 bytes  They asked for my help.  The error occurred on this SELECT: SELECT  *    FROM header_tab trx    WHERE (generated_ntab1 SUBMULTISET OF trx.column_ntab)       AND ((trx.column_ntab MULTISET             EXCEPT DISTINCT generated_ntab2) IS EMPTY) The problem is clearly related to the use of those nested tables. Now, there was clearly sufficient PGA for the nested tables themselves. So the problem was in executing the MULTISET-related functionality. We talked for a bit about dropping the use of nested tables and instead doing everything in SQL, to avoid the PGA error. That would, however require lots of work, revamping algorithms, ensuring correctness, you know the score. Then my eyes snagge

How to Pick the Limit for BULK COLLECT

This question rolled into my In Box today: In the case of using the LIMIT clause of BULK COLLECT, how do we decide what value to use for the limit? First I give the quick answer, then I provide support for that answer Quick Answer Start with 100. That's the default (and only) setting for cursor FOR loop optimizations. It offers a sweet spot of improved performance over row-by-row and not-too-much PGA memory consumption. Test to see if that's fast enough (likely will be for many cases). If not, try higher values until you reach the performance level you need - and you are not consuming too much PGA memory.  Don't hard-code the limit value: make it a parameter to your subprogram or a constant in a package specification. Don't put anything in the collection you don't need. [from Giulio Dottorini] Remember: each session that runs this code will use that amount of memory. Background When you use BULK COLLECT, you retrieve more than row with each fetch,

Quick Guide to User-Defined Types in Oracle PL/SQL

A Twitter follower recently asked for more information on user-defined types in the PL/SQL language, and I figured the best way to answer is to offer up this blog post. PL/SQL is a strongly-typed language . Before you can work with a variable or constant, it must be declared with a type (yes, PL/SQL also supports lots of implicit conversions from one type to another, but still, everything must be declared with a type). PL/SQL offers a wide array of pre-defined data types , both in the language natively (such as VARCHAR2, PLS_INTEGER, BOOLEAN, etc.) and in a variety of supplied packages (e.g., the NUMBER_TABLE collection type in the DBMS_SQL package). Data types in PL/SQL can be scalars, such as strings and numbers, or composite (consisting of one or more scalars), such as record types, collection types and object types. You can't really declare your own "user-defined" scalars, though you can define subtypes  from those scalars, which can be very helpful from the p