DBMS_CLOUD : Loading a TEXT file without any DELIMITER

Recently I was working on an issue related to loading data from a flat file on to Autonomous Data Warehouse using dbms_cloud api. While lot has been written on this subject, I thought of writing on this as the challenge here was that the file in this case was without any delimiter. The Original data to be loaded was around 90 Million, but for testing purpose customer was loading around 200k rows.

What is a Delimiter? Delimiter is one or more character that separates two strings. These characters are used to identify different columns.

One of the best blog on this topic is this blog from Nilay Panchal.

Coming back to my topic of this blog. In this case, customer was trying to load 200k rows into a table and the way they were using it, the time it took to load these 200k rows was around 25 minutes which is too huge considering that actual production number of rows would be 90 Million.

As mentioned, the challenge here was that the content in this flat file had no delimiter or any character to identify column break. The original flat file had around 500 columns, but for testing the file was truncated to 5 columns. The idea here is to demonstrate an efficient way of loading the data into a table when the data in the flat file is without any delimiter. The sample data is as under:

CCODEDEBUTRUNSHUNDREDSPLAYER_NAME
IND151119891592151SACHIN TENDULKAR
AUS081219951337841   RICKY PONTING
 SA141219951328945  JACQUES KALLIS
 SL200720001240038 KUMAR SANGAKARA
 WI061219901195334      BRIAN LARA
IND20062011 720227     VIRAT KOHLI

Our very first step will be to create cloud object storage credentials using dbms_cloud.create_credentials. If you have already created a credential, then you can skip this. Check dba_credentials for the list of Credentials.

begin
  dbms_cloud.create_credential(
    credential_name=>'VIVEK_CREDENTIAL',
    username=>'VIVEKS',
    password=>'paste_auth_token_here'
  );
end;
/

SQL> column owner for a20
SQL> column credential_name for a30
SQL> column username for a20
SQL> select owner, credential_name, username from dba_credentials;

OWNER		     CREDENTIAL_NAME		    USERNAME
-------------------- ------------------------------ --------------------
ADMIN		     VIVEK_CREDENTIAL		    VIVEKS

I have uploaded by data.txt file that contains the required data to be loaded on the Object Storage. Next Step will be create a table in which the data is to be loaded. For COPY_DATA, you need to ensure the Target Table is already created as the procedure would fail.

CREATE TABLE CRICKETER_DATA
   (CCODE VARCHAR2(3) ,
   DEBUT DATE,
   RUNS  NUMBER(5),
   HUNDREDS NUMBER(2),
   PLAYER_NAME VARCHAR2(16));

Now, I will create by COPY_DATA procedure with the required inputs. It is important to note the format and field_list option. In my case, the date was in DDMMYYYY format and hence I had to specify the dateformat explicitly to allow copy_data to parse the date accurately. Without the dateformat value, my copy_data failed with an error. The other parameter is field_list, which ensures that in the absence of delimiter, the copy_data parses the column values based on the start and end position. You need to ensure that these position are accurate.

begin
dbms_cloud.copy_data(
    table_name =>'CRICKETER_DATA',
    credential_name =>'VIVEK_CREDENTIAL',
    file_uri_list =>'&MY_OBJECT_STORAGE_URL/data.txt',
    format => json_object('skipheaders' value '1', 'dateformat' value 'ddmmyyyy'),
    field_list =>'CCODE POSITION(1:3), DEBUT POSITION(4:11), RUNS POSITION(12:16), HUNDREDS POSITION(17:18), PLAYER_NAME POSITION(19:34)');
end;
/

PL/SQL procedure successfully completed.

Elapsed: 00:00:01.50
SQL> SQL> select * from cricketer_data;

CCO DEBUT		       RUNS   HUNDREDS PLAYER_NAME
--- -------------------- ---------- ---------- ----------------
IND 15-NOV-1989 00:00:00      15921	    51 SACHIN TENDULKAR
AUS 08-DEC-1995 00:00:00      13378	    41	  RICKY PONTING
 SA 14-DEC-1995 00:00:00      13289	    45	 JACQUES KALLIS
 SL 20-JUL-2000 00:00:00      12400	    38	KUMAR SANGAKARA
 WI 06-DEC-1990 00:00:00      11953	    34	     BRIAN LARA
IND 20-JUN-2011 00:00:00       7202	    27	    VIRAT KOHLI

6 rows selected.

If for some reason the copy_data procedure fails, it creates a error log table, which is reported immediately along with the error number. To drill down to the actual error, you need to query the log table. For example, when I omitted FORMAT option from my COPY_DATA procedure, it failed with following error. I queried the COPY$22_LOG to extract the actual error.


select * from COPY$21_LOG;

RECORD
------------------------------------------------------------------------------------------------------------------------------------
 LOG file opened at 01/28/20 05:15:18

Total Number of Files=1

Data File: &MY_OBJECT_STORAGE_URL/data.txt

Log File: COPY$21_331103.log

 LOG file opened at 01/28/20 05:15:18

Bad File: COPY$21_331103.bad

Field Definitions for table COPY$Q18IZ07MUUMYRU3IG6MU
  Record format DELIMITED BY
  Data in file has same endianness as the platform
  Rows with all null fields are accepted

  Fields in Data Source:

    CCODE			    CHAR (3)
      Record position (1, 3)
      Terminated by "|"
    DEBUT			    CHAR (8)
      Record position (4, 11)
      Terminated by "|"
    RUNS			    CHAR (5)
      Record position (12, 16)
      Terminated by "|"
    HUNDREDS			    CHAR (2)
      Record position (17, 18)
      Terminated by "|"
    PLAYER_NAME 		    CHAR (16)
      Record position (19, 34)
      Terminated by "|"
error processing column DEBUT in row 1 for datafile &MY_OBJECT_STORAGE_URL/data.txt

ORA-01858: a non-numeric character was found where a numeric was expected

40 rows selected.

Autonomous Database ! ADW and ATP. Why do I need both ?

Autonomous Data Warehouse (ADW) was announced in March 2018 and then Autonomous Transaction Processing (ATP) is the latest addition to the Autonomous Database family. ATP is for Mixed Workload and therefore provides 4 Service Names. These are :

HIGH :Highest level of CPU and IO Resources. Concurrency with this Service Name depends upon the number of OCPUs and will scale accordingly.

MEDIUM :Lowest level of CPU and IO Resources. Concurrency depends on the number of CPU’s and will scale accordingly.

LOW :Least amount of CPU and IO Resources. Again, Concurrency depends on the number of CPU’s and the Scaling is directly proportional to the number of OCPU’s.

ADW provides these three Service Names as well. However, unlike ADW (see my previous blog), the definition of these Services are different. In ADW, HIGH provides highest level of Resources but less concurrency, whereas, in ATP, HIGH do provide highest level of Resources but also provides Highest level of Concurrency. Since ATP is designed for Mixed Workload, it provides an additional Service Name PARALLEL. It provides high level of resources but lowest concurrency and is useful for Batch Processes or Heavy Reporting.

Now, many of the technical folks have raised this question on “Why do we need Both? Can’t we use ATP to take care of Data Warehouse kind of a load?”. ATP very well does that by providing an option to run Mixed Workload. However, there are many implementations that are purely Data Warehouse. The problem is that the amount of data that we generate, Warehouse Databases have grown massive in size supporting large amount of complex queries involving many nested joins and expecting sub-second response time. Further, these Databases house data from many different sources. Optimization Strategy for an OLTP and Data Warehouses are completely different. Therefore, an OLTP kind of a setup cannot (in most of the cases) provide the response time needed for DWH Queries.

Row Store v/s Columnar Store

Now, coming back to ATP v/s ADW argument. In an ADW, the data loaded is by default compressed to Oracle Hybrid Columnar Compression, whereas ATP uses a Row Store Format. What this means is that in an ADW, the data stored at disk is in a Columnar format, which is a proven format for speeding up Analytics kind of a Workload where we query few columns. All Data Warehouse Implementations, like Redshift, Google Big Query etc. store data in Columnar format. On the other hand, ATP stores the data in Row format which is ideal for OLTP that demands Single or few row(s) updates or deletes or queries small number of rows.

Since the data in ADW is stored Compressed in Columnar format, it allows scanning of Billions of Rows per CPU cycle. The data for a particular column is stored contiguously making it possible to scan through large number of rows in a single CPU cycles. This technique is called SIMD (Single Instruction Multiple Dataset) processing. To meet a sub-second response time, this technique is plays a significant role.

Such large scans (scanning billion of rows per second) is not possible with Row Store and one of the reasons for this can be found in one of my blog published in 2015 which is still very much relevant. Skipping of Columns consume CPU Cycles which is demonstrated in the blog post that can be found here.

Through this short blog, I wanted to clear the confusion around ADW and ATP. Both have their own use cases and thus are optimized to manage different workloads. Both of these Autonomous Databases have some exciting features. Stay tuned for more on these.

Autonomous Database Tech Day ! Gurgaon

 

#Autonomous #AIOUG Presenting a Full Day Event on 8th September 2018 for North India Oracle User Group in Gurgaon. This is on Oracle Autonomous Database. Would be covering some interesting technical capabilities of Autonomous Databases. I am covering the 2 offerings i.e. Autonomous Data Warehouse and Autonomous Transaction Processing. For Registration, click on the following link :

Meraevents Link

This being an Oracle User Group Session, focus would be on the Technical Capabilities of ADW / ATP, like Parallel Processing, Concurrency, Optimizer Enhancements and Behaviour and most importantly, Competition.

So, North India Folks : See you all on 8th September 2018.