Two Methods for Batching Records with Oracle SQL

Two Methods for Batching Records with Oracle SQL

·

7 min read

Introduction

It is not often that you need to split records into batches. When you do, however, having a couple of approaches to fall back on is helpful. In the past month, I encountered two problems requiring me to batch records using two approaches. This post will explain each of these approaches.

  1. The NTILE Analytic Function. The NTILE function divides an ordered data set into equal buckets and assigns an appropriate bucket number to each row.

  2. Using the OFFSET and LIMIT SQL (row limiting clause) to page through a data set.

The NTILE Approach

Over my career, I have had to batch records to simplify processing several times. Sometimes a set of records is just too big, and you need to break it down into manageable chunks.

One reason you may want to do this is so that each batch of records can be processed in parallel to speed up the overall processing time.

In the past, I may have written some complex PL/SQL logic to loop through the target table first, assigning batch numbers to each record before the primary processing is run. With the NTILE function, I can now do this much more quickly.

Let's take the following table called ev_sales_by_year. It contains records of EV sales in California:

Table Definition

The table has 15,591 records, and I want to split it into about 5,000 records per batch. The NTILE function expects you to provide an Integer with the number of batches you want it to create. I first need to calculate the approximate number of batches we want based on a target batch size.

SELECT COUNT(1)                   total_row_count
,      ROUND((COUNT(1) / 5000),0) number_of_batches
FROM   ev_sales_by_year;

We are looking for a target batch size of 5,000 records in the above example. This results in the following:

Generate Batches in SQL. Number of Batches.

Now that we know the number of batches (3), we can use the NTILE function to split the table into batches.

SELECT id
,      NTILE(3) OVER (ORDER BY id) batch_number 
FROM   ev_sales_by_year;

This results in a list of every record in the table with its ID value and associated batch number.

Generate Batches in SQL. Each record with its batch ID.

This helps, but we can do better. Because I have a unique ID column in the table ev_sales_by_year, I can generate a list of batches with the minimum and maximum ID for each batch as follows:

SELECT batch_number
,      MIN(id)   min_id
,      MAX(id)   max_id
,      COUNT(1)  batch_size
FROM   (SELECT id
        ,      NTILE(3) OVER(ORDER BY id) batch_number 
        FROM   ev_sales_by_year)
GROUP BY batch_number
ORDER BY batch_number;
You must ORDER BY a unique value/key to avoid overlapping IDs in each batch.

This results in the following batches:

Generate Batches in SQL. Batches with Min and Max ID

Now, I can loop through these three batches, pulling in each set of records using the MIN_ID and MAX_ID values.

-- Batch Number 1
SELECT *
FROM   ev_sales_by_year
WHERE  id BETWEEN 1 AND 5197;

Putting it all Together in PL/SQL

We can put this all together in a PL/SQL block as follows:

DECLARE
  CURSOR cr_batches (cp_num_buckets IN PLS_INTEGER) IS
    SELECT batch_number
    ,      MIN(id)   min_id
    ,      MAX(id)   max_id
    ,      COUNT(1)  batch_size
    FROM   (SELECT id
            ,      NTILE(cp_num_buckets) 
                      OVER (ORDER BY id) batch_number 
            FROM   ev_sales_by_year)
    GROUP BY batch_number
    ORDER BY batch_number;

  CURSOR cr_batch (cp_min_id IN NUMBER, cp_max_id IN NUMBER) IS
    SELECT *
    FROM   ev_sales_by_year
    WHERE  id BETWEEN cp_min_id AND cp_max_id;

  c_batch_size       CONSTANT PLS_INTEGER := 5000;
  l_num_buckets      PLS_INTEGER;

BEGIN

  -- Calculate the number of buckets based on our target batch size.
  SELECT ROUND((COUNT(1) / c_batch_size),0) INTO l_num_buckets
  FROM   ev_sales_by_year;

  -- Generate list of batches with Min and Max ID values.
  FOR r_batch IN cr_batches (cp_num_buckets => l_num_buckets) LOOP  
    dbms_output.put_line('Bucket#: ' || r_batch.batch_number ||
                         ' Min/Max ID: '|| r_batch.min_id ||
                         '/' || r_batch.max_id);
    -- Do something with the batch.
    FOR r_record IN cr_batch 
     (cp_min_id => r_batch.min_id, 
      cp_max_id => r_batch.max_id) LOOP
      NULL;
    END LOOP;
    -- Something else you could do is launch a DBMS_SCHEDULER
    --  job for each batch so they can be processed in parallel. 
  END LOOP;

END;
/

This results in the following:

Bucket#: 1 Min/Max ID: 1/5197
Bucket#: 2 Min/Max ID: 5198/10394
Bucket#: 3 Min/Max ID: 10395/15591

Of course, instead of just looping through the batches, it may be more useful to launch a DBMS Scheduler Job for each batch so that they can be processed in parallel.

Offset & Limit Approach

In this example, I want to export the content of the table ev_sales_by_year to JSON and then send the JSON to another system for processing. To make handling the JSON more manageable for the other system, I have split the export into several equally sized JSON files.

Sample Code

The code below illustrates how this can be achieved using the Row Limiting Clause in SQL.

I have included comments in the code to explain each step.

DECLARE

  lc_rows_per_file   CONSTANT PLS_INTEGER := 5000;
  l_total_rows       PLS_INTEGER;
  l_offset_row       PLS_INTEGER;
  l_file_count       NUMBER;
  l_json_clob        CLOB;
  l_json_blob        BLOB;
  l_json_blob_zip    BLOB;

BEGIN

  -- Determine Number of Records in the Table.
  SELECT COUNT(1) INTO l_total_rows
  FROM   ev_sales_by_year;

  -- Calculate the number of JSON files to generate.
  l_file_count := CEIL(l_total_rows / lc_rows_per_file);
  dbms_output.put_line ('Generating ['|| TO_CHAR(l_file_count,'fm990') || 
                        '] files.');

  -- Loop through generating a file for each set of 5,000 records.
  FOR i IN 1..l_file_count LOOP

    -- Calculate the Offset for the current interation.
    l_offset_row := (i * lc_rows_per_file) - lc_rows_per_file;

    -- Select a batch of 5,000 records starting at the current offset.
    -- Note: the final batch may have less than 5,000 rows.
    WITH file_rows AS
      (SELECT *
       FROM   ev_sales_by_year
       ORDER  BY id
       OFFSET l_offset_row ROWS
       FETCH  FIRST lc_rows_per_file ROWS ONLY)
    SELECT json_object 
          ('items' value json_arrayagg 
            (json_object (* ABSENT ON NULL RETURNING CLOB) RETURNING CLOB
          ) RETURNING CLOB)
    INTO   l_json_clob
    FROM   file_rows;

    -- Note 1: The 'ABSENT ON NULL' clause omits NULL fields from the
    --   JSON. This can significantly reduce the size of JSON files.
    -- Note 2: The 'RETURNING CLOB' caluse is neccessary to when 
    --   generating large JSON CLOBs.

    -- Convert the JSON CLOB to a BLOB.
    l_json_blob := apex_util.clob_to_blob (p_clob => l_json_clob);
    dbms_output.put_line ('JSON BLOB created, Size: '|| 
      apex_string_util.to_display_filesize(dbms_lob.getlength(l_json_blob)));

    -- Add the BLOB to a ZIP File.
    apex_zip.add_file
     (p_zipped_blob => l_json_blob_zip,
      p_file_name   => 'file_'||i||'.json',
      p_content     => l_json_blob);

  END LOOP;

  -- Close out the ZIP File.
  apex_zip.finish (p_zipped_blob => l_json_blob_zip);
  dbms_output.put_line ('ZIP File Size: '|| 
    apex_string_util.to_display_filesize(dbms_lob.getlength(l_json_blob_zip)));

  -- Now you can send the Zipped BLOB containing the 5 JSON 
  --  files to OCI Object Store, or any other target system.

END;
/
You must ORDER BY a unique value/key to avoid overlapping IDs in each batch.

Output from the above script:

Generating [4] files.
JSON BLOB created, Size: 1.5MB
JSON BLOB created, Size: 1.5MB
JSON BLOB created, Size: 1.5MB
JSON BLOB created, Size: 180.2KB
ZIP File Size: 167.5KB

Bonus Approach

Of course, there are many ways to skin a cat. There is another approach I can think of (and I am sure there are more), which is to use BULK COLLECT to fetch batches of records into a PL/SQL array. Bulk collecting into PL/SQL arrays can be a very efficient way of processing large volumes of records in PL/SQL.

The example code below includes comments to explain what it does.

DECLARE

  CURSOR cr_ev_sales IS
    SELECT *
    FROM   ev_sales_by_year;

  -- Declare a tebl type based on records from the table ev_sales_by_year
  TYPE lt_ev_sales_t IS TABLE OF ev_sales_by_year%ROWTYPE INDEX BY PLS_INTEGER;
  -- Dellare a PL/SQL table to store records.
  lt_ev_sales         lt_ev_sales_t;
  -- Determine how many record to fetch during each iteration.
  lc_fetch_count      CONSTANT PLS_INTEGER := 5000;
  l_total_vehicles    NUMBER := 0;

BEGIN
  -- Open the Cursor and Start Looping.
  OPEN cr_ev_sales;
  LOOP
    -- Bulk collect 5,000 rows from the table into the PL/SQL 
    --   table lt_ev_sales.
    FETCH cr_ev_sales 
    BULK COLLECT INTO lt_ev_sales LIMIT lc_fetch_count;
    dbms_output.put_line ('Fetched ['|| 
      TO_CHAR(lt_ev_sales.COUNT(),'fm999,999,990') || '] rows.');

    -- Exit the loop when there are no more records to fetch.
    EXIT WHEN lt_ev_sales.COUNT = 0;

    -- Loop through the records in the PL/SQL table.
    FOR i IN 1..lt_ev_sales.COUNT LOOP
      -- Add code here to something with the rows in the PL/SQL table.
      l_total_vehicles := l_total_vehicles + 
        lt_ev_sales(i).number_of_vehicles;
    END LOOP;

  END LOOP;
  CLOSE cr_ev_sales;
  dbms_output.put_line ('Grand Total Vehicles ['|| 
    TO_CHAR(l_total_vehicles,'fm999,999,990') || '].');

END;

One of the advantages of this approach (other than its speed) is that you can limit the amount of memory used for the PL/SQL table. In the above example, we fetch at most 5,000 records at a time.

Conclusion

Hopefully, this post has shown you that although batching records into manageable chunks does not come up often, it is good to have several approaches that you can employ when it does.