DBMS_CLOUD : Loading a TEXT file without any DELIMITER
January 28, 2020 1 Comment
Recently I was working on an issue related to loading data from a flat file on to Autonomous Data Warehouse using dbms_cloud api. While lot has been written on this subject, I thought of writing on this as the challenge here was that the file in this case was without any delimiter. The Original data to be loaded was around 90 Million, but for testing purpose customer was loading around 200k rows.
What is a Delimiter? Delimiter is one or more character that separates two strings. These characters are used to identify different columns.
One of the best blog on this topic is this blog from Nilay Panchal.
Coming back to my topic of this blog. In this case, customer was trying to load 200k rows into a table and the way they were using it, the time it took to load these 200k rows was around 25 minutes which is too huge considering that actual production number of rows would be 90 Million.
As mentioned, the challenge here was that the content in this flat file had no delimiter or any character to identify column break. The original flat file had around 500 columns, but for testing the file was truncated to 5 columns. The idea here is to demonstrate an efficient way of loading the data into a table when the data in the flat file is without any delimiter. The sample data is as under:
CCODEDEBUTRUNSHUNDREDSPLAYER_NAME IND151119891592151SACHIN TENDULKAR AUS081219951337841 RICKY PONTING SA141219951328945 JACQUES KALLIS SL200720001240038 KUMAR SANGAKARA WI061219901195334 BRIAN LARA IND20062011 720227 VIRAT KOHLI
Our very first step will be to create cloud object storage credentials using dbms_cloud.create_credentials. If you have already created a credential, then you can skip this. Check dba_credentials for the list of Credentials.
begin
dbms_cloud.create_credential(
credential_name=>'VIVEK_CREDENTIAL',
username=>'VIVEKS',
password=>'paste_auth_token_here'
);
end;
/
SQL> column owner for a20
SQL> column credential_name for a30
SQL> column username for a20
SQL> select owner, credential_name, username from dba_credentials;
OWNER CREDENTIAL_NAME USERNAME
-------------------- ------------------------------ --------------------
ADMIN VIVEK_CREDENTIAL VIVEKS
I have uploaded by data.txt file that contains the required data to be loaded on the Object Storage. Next Step will be create a table in which the data is to be loaded. For COPY_DATA, you need to ensure the Target Table is already created as the procedure would fail.
CREATE TABLE CRICKETER_DATA (CCODE VARCHAR2(3) , DEBUT DATE, RUNS NUMBER(5), HUNDREDS NUMBER(2), PLAYER_NAME VARCHAR2(16));
Now, I will create by COPY_DATA procedure with the required inputs. It is important to note the format and field_list option. In my case, the date was in DDMMYYYY format and hence I had to specify the dateformat explicitly to allow copy_data to parse the date accurately. Without the dateformat value, my copy_data failed with an error. The other parameter is field_list, which ensures that in the absence of delimiter, the copy_data parses the column values based on the start and end position. You need to ensure that these position are accurate.
begin
dbms_cloud.copy_data(
table_name =>'CRICKETER_DATA',
credential_name =>'VIVEK_CREDENTIAL',
file_uri_list =>'&MY_OBJECT_STORAGE_URL/data.txt',
format => json_object('skipheaders' value '1', 'dateformat' value 'ddmmyyyy'),
field_list =>'CCODE POSITION(1:3), DEBUT POSITION(4:11), RUNS POSITION(12:16), HUNDREDS POSITION(17:18), PLAYER_NAME POSITION(19:34)');
end;
/
PL/SQL procedure successfully completed.
Elapsed: 00:00:01.50
SQL> SQL> select * from cricketer_data;
CCO DEBUT RUNS HUNDREDS PLAYER_NAME
--- -------------------- ---------- ---------- ----------------
IND 15-NOV-1989 00:00:00 15921 51 SACHIN TENDULKAR
AUS 08-DEC-1995 00:00:00 13378 41 RICKY PONTING
SA 14-DEC-1995 00:00:00 13289 45 JACQUES KALLIS
SL 20-JUL-2000 00:00:00 12400 38 KUMAR SANGAKARA
WI 06-DEC-1990 00:00:00 11953 34 BRIAN LARA
IND 20-JUN-2011 00:00:00 7202 27 VIRAT KOHLI
6 rows selected.
If for some reason the copy_data procedure fails, it creates a error log table, which is reported immediately along with the error number. To drill down to the actual error, you need to query the log table. For example, when I omitted FORMAT option from my COPY_DATA procedure, it failed with following error. I queried the COPY$22_LOG to extract the actual error.
select * from COPY$21_LOG;
RECORD
------------------------------------------------------------------------------------------------------------------------------------
LOG file opened at 01/28/20 05:15:18
Total Number of Files=1
Data File: &MY_OBJECT_STORAGE_URL/data.txt
Log File: COPY$21_331103.log
LOG file opened at 01/28/20 05:15:18
Bad File: COPY$21_331103.bad
Field Definitions for table COPY$Q18IZ07MUUMYRU3IG6MU
Record format DELIMITED BY
Data in file has same endianness as the platform
Rows with all null fields are accepted
Fields in Data Source:
CCODE CHAR (3)
Record position (1, 3)
Terminated by "|"
DEBUT CHAR (8)
Record position (4, 11)
Terminated by "|"
RUNS CHAR (5)
Record position (12, 16)
Terminated by "|"
HUNDREDS CHAR (2)
Record position (17, 18)
Terminated by "|"
PLAYER_NAME CHAR (16)
Record position (19, 34)
Terminated by "|"
error processing column DEBUT in row 1 for datafile &MY_OBJECT_STORAGE_URL/data.txt
ORA-01858: a non-numeric character was found where a numeric was expected
40 rows selected.