5X File Loads from Object Storage with DBMS_CLOUD.COPY_DATA

5X File Loads from Object Storage with DBMS_CLOUD.COPY_DATA

ยท

9 min read

Introduction

When dealing with loading large volumes of data into my APEX Apps, I typically follow this pattern:

  • Have third-party systems post data files to OCI Object Storage.

  • Use OCI Events to have OCI call an ORDS Service to inform the Database that a file is ready to load.

  • The ORDS Service launches an APEX Automation to fetch the file from OCI Object Storage using APEX_WEB_SERVICE, use APEX_ZIP to Unzip the file then APEX_DATA_PARSER.PARSE to parse and load the file into a DB table for further processing.

You can read more about this approach by reading this post, 'Event-Driven Integration with OCI Events, ORDS, & APEX'.

This pattern usually works well, but loading very large files from OCI Object Store to a database table using APEX_DATA_PARSER can be slow.

โฉ
In this post, I will run a performance test and show how you can substitute APEX_WEB_SERVICE, APEX_ZIP, and APEX_DATA_PARSER with the dbms_cloud.copy_data API to 5X your data load times. ๐Ÿ’จ

The Approach

When you call the dbms_cloud.copy_data PL/SQL API, the following steps are performed:

How FBMS_CLOUD.COPY_DATA works.

  1. Make the call to dbms_cloud.copy_data.

  2. dbms_cloud.copy_data uses ORACLE_LOADER for External Files (similar to SQL*Loader) to parse and stream the file's content (in parallel) to your table in the database directly from Object Storage.

  3. Artifacts from the load (.log and .bad files) are created in the DATA_PUMP_DIR of your database, and external tables (COPY$XXX_LOG and COPY$XXX_BAD) are created pointing to these files.

  4. Write a record to the table user_load_operations with status, names of log and bad files, and a record count.

๐Ÿ’ก
The reason for the massive speed increase is that dbms_cloud.copy_data streams the content of your file and parses the content in parallel.

Preparing for the Performance Test

We need to put a few things in place before we can test the performance of dbms_cloud.copy_data vs using APEX_DATA_PARSER.

Test Data Preparation

I am using California EV Sales data to run the tests. You can get the Excel of the data here. I converted the data to CSV and duplicated the data set a few times to get a total of 94,025 records (about 4 MB).

The first 3 rows of the CSV file look like this:

data_year,county,fuel_type,make,model,number_of_vehicles
1998,Los Angeles,Electric,Ford,Ranger,1
1998,Orange,Electric,Ford,Ranger,1

Using the script below, I created a no-frills table to load the data into.

CREATE TABLE "EV_SALES_BY_YEAR" 
 ("DATA_YEAR"          NUMBER NOT NULL, 
  "COUNTY"             VARCHAR2(50) NOT NULL, 
  "FUEL_TYPE"          VARCHAR2(50) NOT NULL, 
  "MAKE"               VARCHAR2(50) NOT NULL, 
  "MODEL"              VARCHAR2(50) NOT NULL, 
  "NUMBER_OF_VEHICLES" NUMBER NOT NULL);

OCI Setups

You will need to perform the following OCI Setups:

  1. Create an Object Store Bucket

  2. Create an OCI Service Account

  3. Create a Security Group

  4. Create a Security Policy

๐Ÿ’ก
You can find out how to perform steps 1-4 by reading my post 'Secure APEX Access to OCI Object Storage'

Database Setups

Grants

You need to perform the following grants for your database user <YOUR_USER>:

GRANT EXECUTE ON dbms_cloud TO <YOUR_USER>;
GRANT READ,WRITE ON directory DATA_PUMP_DIR TO <YOUR_USER>;

Credential

Create a credential using dbms_cloud.create_credential. The values are the same as the ones used for creating the Oracle APEX Web Credential in the post I mentioned previously.

BEGIN
  dbms_cloud.create_credential
   (credential_name => 'obj_store_dbms_cloud_test',
    user_ocid       => 'ocid1.user.oc1..aaaaaaaae...',
    tenancy_ocid    => 'ocid1.tenancy.oc1..aaaaaaaa...',
    private_key     => 'MIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwgg...',
    fingerprint     => 'b9:53:5a:44:t3:5a:1e:f1:27:3b:08:81:32:c6:32:46');
END;

The Performance Tests

๐ŸŽฌ OK, now we are ready to run the performance tests.

APEX_DATA_PARSER

Here is the code to perform the load using APEX_DATA_PARSER and APEX_ZIP. You may also notice that I used dbms_cloud.get_object to get the zip file from Object Storage. You can use this as an alternative to APEX_WEB_SERVICE if you like.

BEGIN
  -- Truncate the Table before loading.
  EXECUTE IMMEDIATE 'TRUNCATE TABLE EV_SALES_BY_YEAR';
  -- Fetch the File from OCI, Unzip it, Parse it and Load It.
  INSERT INTO EV_SALES_BY_YEAR (DATA_YEAR,COUNTY,FUEL_TYPE,MAKE,MODEL,NUMBER_OF_VEHICLES)
  SELECT col001  data_year
  ,      col002  COUNTY
  ,      col003  FUEL_TYPE
  ,      col004  MAKE
  ,      col005  MODEL
  ,      col006  NUMBER_OF_VEHICLES
  FROM   apex_data_parser.parse
           (p_content => apex_zip.get_file_content 
                          (p_zipped_blob => dbms_cloud.get_object
                                             (credential_name => 'obj_store_dbms_cloud_test',
                                              object_uri      => 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/xxxxx/b/dbms_cloud_test/o/CA_EV_Sales.csv.zip'),
                          p_file_name    => 'CA_EV_Sales.csv'),
            p_skip_rows => 1,
            p_detect_data_types => 'N',
            p_file_name => 'CA_EV_Sales.csv');

END;
โฑ
I ran this test 3 times, and the load took an average of 9.1 seconds.

DBMS_CLOUD.COPY_DATA

The code to load the data using dbms_cloud.copy_data looks like this. I have included comments to describe each of the parameters.

DECLARE
  l_operation_id  user_load_operations.id%TYPE;
BEGIN
  -- Truncate the table prior to load.
  EXECUTE IMMEDIATE 'TRUNCATE TABLE EV_SALES_BY_YEAR';
  -- Stream, Parse and Load the File.
  dbms_cloud.copy_data
    -- Table we want to load data into.
   (table_name      => 'EV_SALES_BY_YEAR',
    -- Credential we created above using dbms_cloud.create_credential
    credential_name => 'obj_store_dbms_cloud_test',
    -- CSV List of fields from the CSV file, 
    --   listed in the same order as columns in the table.
    -- The field names do not have to match the column names.
    -- It is a good idea to specify field sizes for VARCHAR2 columns
    --  as I have done here. Otherwise, dbms_cloud.copy_data 
    --  assumes a default of only 240 characters.
    field_list      => 'data_year,county CHAR(50),fuel_type CHAR(50),make CHAR(50),model CHAR(50),number_of_vehicles',    
    -- File name, CSV List of Files, or Wildcard Expression
    --  representing the file(s) to load.
    file_uri_list   => 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/xxxx/b/dbms_cloud_test/o/CA_EV_Sales.gzip',
    -- Variable returned by dbms_cloud.copy_data with 
    --  the ID of the load. Track Progess of the load by 
    -- querying user_load_operations WHERE id = l_operation_id
    operation_id    => l_operation_id,
    -- JSON_OBJECT containing details required to perform the load.
    --   type: CSV File, coompression: File is compressed by gzip
    format          => json_object ('type'                 value  'csv',
                                    'compression'          value  'gzip',
                                    'skipheaders'          value  '1',
                                    'delimiter'            value  ',',
                                    'ignoremissingcolumns' value 'true' ));
END;
โฑ
I ran this test 3 times, and the load took an average of 1.9 seconds. This is nearly 5X faster ๐ŸŽ‰

More on DBMS_CLOUD.COPY_DATA

Checking the Result of a Load

You can track the result of the load by querying the table user_load_operations. You can get the ID for the current load from the parameter operation_id.

SELECT id
,      status
,      logfile_table
,      badfile_table
,      rows_loaded
FROM   user_load_operations
WHERE  id = 48;

You can then query the Log and Bad File external tables (created by dbms_cloud.copy_data) to get more details of any errors that occurred. These external tables point to the log and bad files on the database file system that ORACLE_LOADER generated.

Log File External Table:

Bad File External Table:

๐Ÿ’ก
You can use the value in the rows_loaded column of the table user_load_operations to get a count of rows loaded from the file for auditing purposes.

Errors & Potential Exceptions

This section describes some scenarios where you may encounter exceptions.

Required Column in Table not Included in File

If I remove the number_of_vehicles column from the CSV file (which is required in the table) as follows:

data_year,county,fuel_type,make,model
1998,Los Angeles,Electric,Ford,Ranger
1998,Orange,Electric,Ford,Ranger

Result: โŒ Error. The following exception is raised by dbms_cloud.copy_data:

Error report -
ORA-20000: ORA-01400: cannot insert NULL into 
   ("CNAPPS"."EV_SALES_BY_YEAR"."NUMBER_OF_VEHICLES")

Extra Column in CSV File not in Table

Having an extra column in the CSV file that is not represented in the table is fine. In this example, the extra column does not exist in the table.

data_year,county,fuel_type,make,model,number_of_vehicles,extra
1998,Los Angeles,Electric,Ford,Ranger,1,2
1998,Orange,Electric,Ford,Ranger,1,2

Result: โœ… Success

File Column Name Does not Match Table Column Name

If the column name in the CSV file does not match the column name in the DB table (the last column should be number_of_vehicles, not num_vehicles).

bannanas,county,fuel_type_apples,make,model,num_vehicles
1998,Los Angeles,Electric,Ford,Ranger,1
1998,Orange,Electric,Ford,Ranger,1

Result: โœ… Success. The columns in the CSV file must be in the same order as the DB table but do not have to have the same name.

Invalid Value in File

What if you have invalid values in the file? In this example, row 2 has an invalid value (non-numeric) for number_of_vehicles.

data_year,county,fuel_type,make,model,number_of_vehicles
1998,Los Angeles,Electric,Ford,Ranger,ABC123
1998,Orange,Electric,Ford,Ranger,1

Result: โŒ Error. In this case, an exception is raised:

Error report -
ORA-20003: Reject limit reached, query table "CNAPPS"."COPY$59_LOG" for error details
ORA-06512: at "C##CLOUD$SERVICE.DBMS_CLOUD", line 1822

We can check the external LOG table for details of the error:

Handling Dates

You can specify a specific date or timestamp format in the field_list parameter like this:

'model_release_date CHAR(22) date_format DATE mask "YYYY-MM-DD"T"HH24:MI:SS"'

If all your date/timestamp fields are formatted the same, you can set the format strings by passing values for dateformat and or timestampformat in the format parameter json_object. e.g.

    format => json_object ('type'            value  'csv',
                           'skipheaders'     value  '1',
                           'timestampformat' value  'YYYY-MM-DD"T"HH24:MI:SS',
                           'delimiter'       value  ','));

Other Considerations

Here are some other things to consider when working with dbms_cloud.copy_data

  • While dbms_cloud.copy_data does a lot of the heavy lifting for you, you still need to put controls around it, including handling errors gracefully and verifying file record counts vs. loaded record counts, etc.

  • dbms_cloud.copy_data allows you to load multiple files at a time.

  • It is good practice to include the size of your VARCHAR fields in the field_list parameter. dbms_cloud.copy_data assumes 240 characters for a VARCHAR filed by default.

  • dbms_cloud.copy_data does not support ZIP compression. It does support gzip, zlib, and bzip2. If you set the compression option to auto, it will determine the compression type for you.

  • Even though I have featured CSV in this post, you can also load Avro, Datapump, Orc, Parquet, and JSON formatted files using dbms_cloud.copy_data.

  • You can also use dbms_cloud.copy_data to load data from sources other than OCI Object Storage, including GitHub, Azure & AWS S3.

  • Pass a value for logdir in the format parameter to tell Oracle which directory to create the log and bad files. The default is DATA_PUMP_DIR.

  • Pass a value for logretention in the format parameter to tell Oracle how long you want the log and bad files/external tables to be kept. The default is 2 days.

    • You can also use DBMS_CLOUD.DELETE_OPERATION(<operation_id>); to delete artifacts related to a specific copy, or use DBMS_CLOUD.DELETE_ALL_OPERATIONS; to delete all operations and related artifacts.
  • You can install the DBMS_CLOUD utilities on-premise by following this Oracle Support Note 'How To Setup And Use DBMS_CLOUD Package (Doc ID 2748362.1)', or this post.

  • Documentation for dbms_cloud.copy_data.

  • Documentation for format parameter options. I encourage you to review all of these options!

Conclusion

I hope you found this post useful. The dbms_cloud.copy_data PL/SQL API helps reduce the amount of code you have to write and can significantly improve performance over other methods for loading data.

๐Ÿ™ Thanks

  • A big Thank You to Matt Paine for alerting me to this great utility and providing some excellent insights into what to watch out for when using it.

  • Also, many thanks to Sanket Jain from the Autonomous Database group for reaching out to me and correcting me on how dbms_cloud.copy_data works behind the scenes.

ย