External Partition ! Oracle Autonomous Database

Autonomous Database was released around March 2018. The first was a Data Warehouse Implementation that supports business Intelligence and Analytical Workload. It was Autonomous Data Warehouse. Next, the transaction workload was added to this family, called as Autonomous Transaction Processing followed by Autonomous JSON and Autonomous APEX.

One of the coolest feature of Autonomous Data Warehouse (ADW) is easy integration with Data Lake, which means, while we can store our data natively within our Autonomous Database – for faster processing, this feature allows us to store and access our data from an external storage. On Cloud, when we talk about external storage, we usually refer to an Object Storage. This is by way of creating an External Table. In this, the Table Metadata is stored in the Data Dictionary and the actual data is stored externally. This allows us to save on Storage cost. Refer to Oracle Documentation for more details on this feature.

External Table can be Non-Partitioned, Partitioned and Hybrid Partitioned. As with Oracle Partitioning that improves Query performance by pruning the partitions that are not needed for our Query. This same improvement applies to External Tables as well. With Partitioned External Table, you get the benefit of Partition Pruning thus scanning only required external data. Remember, Queries on External Data will not be as fast as queries on database tables. However, if I have an Archived Table that is in-frequently accessed and is not performance sensitive, I can always move these to External Storage. And, to improve the query performance on External Storage, External Partitioning is highly recommended. As mentioned earlier, the other benefit of this feature is COST. Hybrid Partitioning is basically mix of both i.e. I can have few historical partitions on an External Storage and latest performance sensitive data within the database.

In this blog, I will talk about External Partition Tables, and subsequent blogs would take this topic further.

Assuming you have a large table with around 5 year of Data. Users frequently query latest data. The table is currently non-partitioned. As the data is growing, I want to ensure consistent performance. How do I do that?

There are multiple strategies, like :

  • Archive the Data to a New Table and Purge it from the Original Table. Reports requiring latest data would run on the Original Table and Reports on the Historical Data would access the Archived Table.
  • Partition the Table and leverage the benefits of Partition Pruning. This is the most common strategy.

My topic for today is on Strategy#1 i.e. Archiving to a New Table. This solution can also be applied if a table has only historical data. Anyways, many customers do implement strategy#1 and move the table to some other database, called as Archival Database. With Autonomous Database, We don’t need to do that and we will see this in action. My subsequent blogs will cover both the strategies that can be optimized further on Autonomous Databases.

For this demonstration, I will use LINEORDER table which is under SSB (Sample Star Benchmark) Schema. This schema is created by-default when you provision an Autonomous Database. It’s a huge table. Therefore, I will create another table as Sample 2% and will use it as my Source table. The very first step would be to identify the partition strategy. In this case, monthly partition on LO_ORDERDATE column is a first choice. However, the other requirement is to have sub-partition it on LO_TAX column, based on Tax Codes. So, what we need here is an External Table with Monthly partitions and Sub-Partitions on LO_TAX.

OT  OWNER		 NAME				  NUM_ROWS     BLOCKS 
--- -------------------- ------------------------------ ---------- ----------
1T  ADMIN		 LINEORDER			 119993317     644744

Next step is to move the data on to the External Storage (an Object Storage). This data will be moved keeping our partitioning strategy in mind. Before I move to the next step, it is important to know that Autonomous Database support many different file formats and structure. On Autonomous Database you can create partitioned external tables from HIVE style partitioned data or from simple folder partitioned data stored on your Cloud Object Store. For example, a Hive format partitioned file will look like :

table/partition1=partition1_value/partition2=partition2_value/data_file.csv

The advantage of this approach is the partitioned information is available in the data file path name itself and therefore, the partitioned columns need not be part of the actual data. Partition pruning is done directly based on the path name. So, in our case the path name would be :

LINEORDER/MONTH_YEAR=JUN1993/TAX=2/filename.gz

In this case, the files will be compressed so that these occupy less space. MONTH_YEAR is a derived column, but TAX is LO_TAX column of LINEORDER Table. Since these columns and column values are part of my file path name, my CSV files need not contain these columns and column values. With Simple Folder partitioned data, the columns and column values of the partitioned columns have to be in the data files. So, in this case since I am using HIVE style partitioned data, I will dynamically create folders with column name and column values.

In order to access my Object Storage, the very first mandatory requirements is to create a credentials. Alternatively, you can use OCI Resource Principals. I am assuming you have all the required credentials. If not, please follow the documentation.

Next, I will use dbms_cloud.export_data api to export the data into compressed CSV files on to my Object Storage. To speed up the process, I created a CONTROL_TABLE with distinct MONYYYY values from LINEORDER and then scheduled a pl/sql block from three different sessions. This pl/sql block will export the data for a specified Month and a Tax Code. The data inserted into the control table has a condition to_char(lo_orderdate,’MONYYYY’)!=’AUG1998′. I will explain this later.

create table control_table (
    partition_name varchar2(100),
    partition_number  number,
    status char(1)
);

insert into control_table(partition_name, status)
SELECT DISTINCT TO_CHAR(LO_ORDERDATE,'MONYYYY') MONTH_YEAR, 'N' FROM LINEORDER WHERE to_char(lo_orderdate,'MONYYYY')!='AUG1998';
COMMIT;
UPDATE control_table set partition_number=rownum;
commit;


DECLARE
  FILE_URI_O	VARCHAR2(1000);
  FILE_URI_F	VARCHAR2(128);
  FILE_URI	VARCHAR2(1000);
  L_STATEMENT	VARCHAR2(1000);
BEGIN
   FOR OUTER_LOOP IN (SELECT PARTITION_NAME MONTH_YEAR, PARTITION_NUMBER FROM CONTROL_TABLE WHERE PARTITION_NUMBER>=1 AND PARTITION_NUMBER<=25 AND STATUS='N' ORDER BY PARTITION_NUMBER)
   LOOP
      FOR INNER_LOOP IN (SELECT DISTINCT LO_TAX TAX FROM LINEORDER WHERE TO_CHAR(LO_ORDERDATE,'MONYYYY')=OUTER_LOOP.MONTH_YEAR)
      LOOP
            L_STATEMENT:='select  lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_orderpriority, lo_shippriority, lo_quantity, lo_extendedprice, lo_ordtotalprice, lo_discount, lo_revenue, lo_supplycost, lo_commitdate, lo_shipmode from lineorder where to_char(lo_orderdate,''MONYYYY'')='||''''||OUTER_LOOP.MONTH_YEAR||''''||' and lo_tax='||INNER_LOOP.TAX||'';
            FILE_URI_F:='MONTH_YEAR='||OUTER_LOOP.MONTH_YEAR||'/TAX='||INNER_LOOP.TAX||'/';
    	    FILE_URI_O:='https://objectstorage.us-ashburn-1.oraclecloud.com/n/my_tenancy/b/External_Partitions/o/LINEORDER/'||FILE_URI_F;
    	    FILE_URI:=FILE_URI_O||OUTER_LOOP.MONTH_YEAR||'_'||INNER_LOOP.TAX;
    	    DBMS_CLOUD.EXPORT_DATA(CREDENTIAL_NAME => 'MY_CREDENTIAL', FILE_URI_LIST => FILE_URI, QUERY=>L_STATEMENT, FORMAT=> JSON_OBJECT('TYPE' VALUE 'CSV', 'quote' value '"', 'COMPRESSION' VALUE 'GZIP', 'MAXFILESIZE' VALUE '20485760'));
      END LOOP;
      UPDATE CONTROL_TABLE SET STATUS='Y' WHERE PARTITION_NUMBER=OUTER_LOOP.PARTITION_NUMBER;
      COMMIT;
   END LOOP;
END;
/

L_STATEMENT variable has the query that is used to export the data. I have the required columns in the SELECT Clause alongwith the relevant WHERE predicate. Each run of the statement will create a CSV file for a required partition and sub-partition. Further, I dynamically create the folders. If the folders doesn’t exist, the api will create it for us else it will create the relevant file within it’s folder. The folder name has the partition column name and it’s value. For example, I will have folder names as:

HIVE Format Folders and Files

Now, I have all the data on to my Object Storage and these are HIVE Format. I have compressed CSV files for each Month and Tax. I will now create an External Partition Table. This is done using dbms_cloud.create_external_part_table api.

BEGIN
    DBMS_CLOUD.CREATE_EXTERNAL_PART_TABLE (
        table_name => 'LINEORDER_EXT_PART',
        credential_name => 'MY_CREDENTIAL',
        file_uri_list => 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/my_tenancy/b/External_Partitions/o/LINEORDER/*.gz',
        column_list =>
            'LO_ORDERKEY NUMBER,
             LO_LINENUMBER number,
             LO_CUSTKEY NUMBER,
             LO_PARTKEY NUMBER,
             LO_SUPPKEY NUMBER,
             LO_ORDERDATE DATE,
             LO_ORDERPRIORITY CHAR(15),
             LO_SHIPPRIORITY CHAR(1),
             LO_QUANTITY NUMBER,
             LO_EXTENDEDPRICE NUMBER,
             LO_ORDTOTALPRICE NUMBER,
             LO_DISCOUNT NUMBER,
             LO_REVENUE NUMBER,
             LO_SUPPLYCOST NUMBER,
             LO_COMMITDATE NUMBER,
             LO_SHIPMODE CHAR(10),
             MONTH_YEAR VARCHAR2(12),
             TAX NUMBER',
        format => '{"type":"csv", "skipheaders":0, "dateformat":"DD-MON-YYYY HH24:MI:SS", "compression":"gzip", "partition_columns":["MONTH_YEAR","TAX"]}');
END;
/

This creates our external partition table. The data resides on our Object Storage and the metadata is stored in the data dictionary. Now, you can run your queries against this table and the data will be retrieved from the Object Storage.

Time to run some queries and check the run time plan. Run time pan will help us validate whether the queries are able to leverage the benefits of partition pruning. We will join this external partitioned table to another table with data stored natively and from the run time plan, we see that Adaptive Optimization kicks in as well.

# Query 1
select lo_custkey, lo_orderdate, lo_shipmode, lo_quantity from lineorder_ext_part where month_year='AUG1993' and tax=2 and lo_orderkey=3517992324;

LO_CUSTKEY LO_ORDERDATE 	LO_SHIPMOD LO_QUANTITY
---------- -------------------- ---------- -----------
  29663183 10-AUG-1993 00:00:00 MAIL		     9

Elapsed: 00:00:01.75

SQL> select * from table(dbms_xplan.display_cursor(format=>'ALLSTATS LAST -bytes +predicate +note +cost +adaptive'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	dmak3yzpw71a2, child number 0
-------------------------------------
select lo_custkey, lo_orderdate, lo_shipmode, lo_quantity from
lineorder_ext_part where month_year='AUG1993' and tax=2 and
lo_orderkey=3517992324

Plan hash value: 1982497926

--------------------------------------------------------------------------------
| Id  | Operation		    | Name		 | E-Rows | Cost (%CPU)|
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	    |			 |	  |	3 (100)|
|   1 |  PARTITION LIST SINGLE	    |			 |	1 |	3   (0)|
|*  2 |   EXTERNAL TABLE ACCESS FULL| LINEORDER_EXT_PART |	1 |	3   (0)|
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("MONTH_YEAR"='AUG1993' AND "TAX"=2 AND
	      "LO_ORDERKEY"=3517992324))
	 
     
# Query 2
select lo_custkey, c_name, lo_orderdate, lo_shipmode, lo_quantity 
from 	lineorder_ext_part,
	vivek.customer 
where 	month_year='DEC1997' 
and 	tax=0
and 	lo_orderkey=3187415872
and	c_custkey = lo_custkey;

LO_CUSTKEY C_NAME		     LO_ORDERDATE	  LO_SHIPMOD LO_QUANTITY
---------- ------------------------- -------------------- ---------- -----------
  11640322 Customer#011640322	     04-DEC-1997 00:00:00 FOB		       6

Elapsed: 00:00:02.27
SQL> SQL> @apc
SQL> set echo on
SQL> set lines 200
SQL> set pages 1000
SQL> select * from table(dbms_xplan.display_cursor(format=>'ALLSTATS LAST -bytes +predicate +note +cost +adaptive'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	6pn53c37c2dfb, child number 0
-------------------------------------
select lo_custkey, c_name, lo_orderdate, lo_shipmode, lo_quantity from
lineorder_ext_part,  vivek.customer where  month_year='DEC1997' and
tax=0 and  lo_orderkey=3187415872 and c_custkey = lo_custkey

Plan hash value: 1739419959

--------------------------------------------------------------------------------------
|   Id	| Operation			  | Name	       | E-Rows | Cost (%CPU)|
--------------------------------------------------------------------------------------
|     0 | SELECT STATEMENT		  |		       |	|     5 (100)|
|- *  1 |  HASH JOIN			  |		       |      1 |     5   (0)|
|     2 |   NESTED LOOPS		  |		       |      1 |     5   (0)|
|     3 |    NESTED LOOPS		  |		       |      1 |     5   (0)|
|-    4 |     STATISTICS COLLECTOR	  |		       |	|	     |
|     5 |      PARTITION LIST SINGLE	  |		       |      1 |     3   (0)|
|  *  6 |	EXTERNAL TABLE ACCESS FULL| LINEORDER_EXT_PART |      1 |     3   (0)|
|  *  7 |     INDEX UNIQUE SCAN 	  | CUSTOMER_PK        |      1 |     1   (0)|
|     8 |    TABLE ACCESS BY INDEX ROWID  | CUSTOMER	       |      1 |     2   (0)|
|-    9 |   TABLE ACCESS FULL		  | CUSTOMER	       |      1 |     2   (0)|
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("C_CUSTKEY"="LO_CUSTKEY")
   6 - filter(("MONTH_YEAR"='DEC1997' AND "TAX"=0 AND
	      "LO_ORDERKEY"=3187415872))
   7 - access("C_CUSTKEY"="LO_CUSTKEY")

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 1
   - this is an adaptive plan (rows marked '-' are inactive)
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level

A query against all the sub-partitions or partitions works as well.

SQL> select lo_orderkey, lo_orderdate, lo_shipmode, lo_quantity from lineorder_ext_part where month_year='AUG1993' and lo_custkey=29663183;

LO_ORDERKEY LO_ORDERDATE	 LO_SHIPMOD LO_QUANTITY
----------- -------------------- ---------- -----------
 3517992324 10-AUG-1993 00:00:00 MAIL		      9
 2251158885 21-AUG-1993 00:00:00 REG AIR	     29

Elapsed: 00:00:12.28
SQL> @apc
SQL> set echo on
SQL> set lines 200
SQL> set pages 1000
SQL> select * from table(dbms_xplan.display_cursor(format=>'ALLSTATS LAST -bytes +predicate +note +cost +adaptive'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	5m709tc22ua7u, child number 0
-------------------------------------
select lo_orderkey, lo_orderdate, lo_shipmode, lo_quantity from
lineorder_ext_part where month_year='AUG1993' and lo_custkey=29663183

Plan hash value: 146054153

--------------------------------------------------------------------------------
| Id  | Operation		    | Name		 | E-Rows | Cost (%CPU)|
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	    |			 |	  |    14 (100)|
|   1 |  PARTITION LIST ITERATOR    |			 |	7 |    14  (15)|
|*  2 |   EXTERNAL TABLE ACCESS FULL| LINEORDER_EXT_PART |	7 |    14  (15)|
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("MONTH_YEAR"='AUG1993' AND "LO_CUSTKEY"=29663183))

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 1
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level


27 rows selected.

From the queries that we executed against this External Partitioned Table, we can clearly see that for the application it doesn’t matter whether the data is fetched from within the database storage or from an External Storage. It is completely transparent to the application. One important point to note here is that these External Tables are READ ONLY and the Data is not managed by the Database. Now, lets add some more data to this External Partition Table. While generating thh CSV files, I excluded the data for AUGUST 1998. I will not generate the CSV files for AUG1998 and we will see whether I am able to query the data.

DECLARE
  FILE_URI_O	VARCHAR2(1000);
  FILE_URI_F	VARCHAR2(128);
  FILE_URI	VARCHAR2(1000);
  L_STATEMENT	VARCHAR2(1000);
BEGIN
      FOR INNER_LOOP IN (SELECT DISTINCT LO_TAX TAX FROM LINEORDER WHERE TO_CHAR(LO_ORDERDATE,'MONYYYY')='AUG1998')
      LOOP
            L_STATEMENT:='select  lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_suppkey, lo_orderdate, lo_orderpriority, lo_shippriority, lo_quantity, lo_extendedprice, lo_ordtotalprice, lo_discount, lo_revenue, lo_supplycost, lo_commitdate, lo_shipmode from lineorder where to_char(lo_orderdate,''MONYYYY'')='||''''||'AUG1998'||''''||' and lo_tax='||INNER_LOOP.TAX||'';
            FILE_URI_F:='MONTH_YEAR=AUG1998/TAX='||INNER_LOOP.TAX||'/';
    	    FILE_URI_O:='https://objectstorage.us-ashburn-1.oraclecloud.com/n/my_tenancy/b/External_Partitions/o/LINEORDER/'||FILE_URI_F;
    	    FILE_URI:=FILE_URI_O||'AUG1998_'||INNER_LOOP.TAX;
    	    DBMS_CLOUD.EXPORT_DATA(CREDENTIAL_NAME => 'MY_CREDENTIAL', FILE_URI_LIST => FILE_URI, QUERY=>L_STATEMENT, FORMAT=> JSON_OBJECT('TYPE' VALUE 'CSV', 'quote' value '"', 'COMPRESSION' VALUE 'GZIP', 'MAXFILESIZE' VALUE '20485760'));
      END LOOP;
END;
/

Now, let me run a query against this newly added CSV file. While the files are uploaded and available in the correct HIVE format folders, you can see that Oracle could not access the data and the run time plan says PARTITION LIST EMPTY.

SQL> select lo_orderkey, lo_orderdate, lo_shipmode, lo_quantity from lineorder_ext_part where month_year='AUG1998' and tax=6 and lo_custkey=16287940;

no rows selected

Elapsed: 00:00:00.11
SQL> @apc
SQL> set lines 200
SQL> set pages 1000
SQL> select * from table(dbms_xplan.display_cursor(format=>'ALLSTATS LAST -bytes +predicate +note +cost +adaptive'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	5s7wagh8za9bx, child number 1
-------------------------------------
select lo_orderkey, lo_orderdate, lo_shipmode, lo_quantity from
lineorder_ext_part where month_year='AUG1998' and tax=6 and
lo_custkey=16287940

Plan hash value: 3670407871

--------------------------------------------------------------------------------
| Id  | Operation		    | Name		 | E-Rows | Cost (%CPU)|
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	    |			 |	  |	2 (100)|
|   1 |  PARTITION LIST EMPTY	    |			 |	1 |	2   (0)|
|*  2 |   EXTERNAL TABLE ACCESS FULL| LINEORDER_EXT_PART |	1 |	2   (0)|
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(("MONTH_YEAR"='AUG1998' AND "TAX"=6 AND
	      "LO_CUSTKEY"=16287940))

SQL> select count(*) from lineorder_ext_part where month_year='AUG1998';

  COUNT(*)
----------
	 0

As I said earlier, External Tables are READ Only and the Data is not managed by Oracle Database. For this, we need to Refresh the External Table using SYNC_EXTERNAL_PART_TABLE procedure of dbms_cloud. Once done, the newly added CSV files would be considered as a new partition and the data can then be queried.

BEGIN
   DBMS_CLOUD.SYNC_EXTERNAL_PART_TABLE(table_name => 'LINEORDER_EXT_PART');
END;
/

SQL> select lo_orderkey, lo_orderdate, lo_shipmode, lo_quantity from lineorder_ext_part where month_year='AUG1998' and tax=6 and lo_custkey=16287940;

LO_ORDERKEY LO_ORDERDATE	 LO_SHIPMOD LO_QUANTITY
----------- -------------------- ---------- -----------
 3004306725 02-AUG-1998 00:00:00 REG AIR	     14

Elapsed: 00:00:42.34
SQL> @apc
SQL> set echo on
SQL> set lines 200
SQL> set pages 1000
SQL> select * from table(dbms_xplan.display_cursor(format=>'ALLSTATS LAST -bytes +predicate +note +cost +adaptive'));

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID	5s7wagh8za9bx, child number 1
-------------------------------------
select lo_orderkey, lo_orderdate, lo_shipmode, lo_quantity from
lineorder_ext_part where month_year='AUG1998' and tax=6 and
lo_custkey=16287940

Plan hash value: 917046205

----------------------------------------------------------------------------------
| Id  | Operation		      | Name		   | E-Rows | Cost (%CPU)|
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT	      | 		   |	    |	  2 (100)|
|   1 |  PX COORDINATOR 	      | 		   |	    |		 |
|   2 |   PX SEND QC (RANDOM)	      | :TQ10000	   |	  1 |	  2   (0)|
|   3 |    PX BLOCK ITERATOR	      | 		   |	  1 |	  2   (0)|
|*  4 |     EXTERNAL TABLE ACCESS FULL| LINEORDER_EXT_PART |	  1 |	  2   (0)|
----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(("MONTH_YEAR"='AUG1998' AND "TAX"=6 AND
	      "LO_CUSTKEY"=16287940))

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 16 because of degree limit
   - Warning: basic plan statistics not available. These are only collected when:
       * hint 'gather_plan_statistics' is used for the statement or
       * parameter 'statistics_level' is set to 'ALL', at session or system level


31 rows selected.

Elapsed: 00:00:00.08
SQL> select count(*) from lineorder_ext_part where month_year='AUG1998';

  COUNT(*)
----------
     99688

Elapsed: 00:00:01.44
SQL> 

Wow! a cool feature and easy to use. Isn’t it? Last, with the latest enhancement to this feature, two new hidden columns are added to the External Table. These are file$path and file$name. These columns can be queried to check the Source File Path and Name.

SQL> select lo_orderkey, file$path, file$name from lineorder_ext_part where month_year='AUG1993' and tax=2 and lo_custkey=29663183;

LO_ORDERKEY FILE$PATH															     FILE$NAME
----------- -------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------
 3517992324 https://objectstorage.us-ashburn-1.oraclecloud.com/n/my_tenancy/b/External_Partitions/o/LINEORDER/MONTH_YEAR=AUG1993/TAX=2	     AUG1993_2_10_20220704T111509Z.csv.gz

This is very easy to implement on Autonomous Database and is a Cost Saving option. Performance insensitive and historical data can be easily moved to an external storage. Accessing this data is as simple as accessing natively stored data. Do try it out and let me know how it works.

Create Table like External – Hive

There is no end to the Technical Learning. I have been working on Oracle Database Technologies for almost 15 years and thought of learning something new. This thirst got me into Big Data and Hadoop. I started a kind of a self-learning and believe me, I found it very interesting. This blog is about one of my encounter during a practice session on Hive. Usually, I blog to help my readers to know about my experiences. This is again on the same line, but will need resolution or validation from experts across the globe.

My understanding on External and Internal Table is as under:

  • Internal Tables store the data populated into a directory specified under metastore.warehouse.dir or if the location is explicitly specified during table creation. Once the table is dropped, the underlying data is removed as well. This gives you full control over the data. I can relate this to Oracle Tables with a difference that the data is stored inside the database and it is a logical structure.
  • External Tables do not store data but points to the data. Hive doesn’t have the control over the data as it is shared by other tools like Pig etc. Dropping of External table does not  remove the data from the storage. This is similar to the External Tables of Oracle, where we create the structure of the table similar to the format of the txt or csv file.

I created an Internal Table and could observe the behavior explained above. Dropping a table removes the data as well. External table, on the other hand, does not remove it and therefore, my understanding looks fine here. Further, I was also working on a CREATE TABLE syntax and the understanding here was :

  • CREATE TABLE will create an Internal Table as this is the default.
  • For External Table, we need to specify CREATE EXTERNAL TABLE command
  • However, CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table.

 

The first 2 bullet points are fine. The problem was with the third bullet point. The outcome is as under :

## let me use the database that I created for my own practice
use vivek;

drop table emp;

create table emp (
empno int comment 'This is Employee Number',
ename string comment 'Employee Name',
country string,
city string,
zipcode int)
row format delimited fields terminated by ',';

load data local inpath '/home/cloudera/emp.txt' into table emp;

hive (vivek)> select * from emp limit 5;
OK
emp.empno       emp.ename       emp.country     emp.city        emp.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005

## Describing the table. See the Bold and underlined text

hive (vivek)> describe formatted emp;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:22:40 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/hive/warehouse/vivek.db/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   true
        numFiles                1
        totalSize               3501
        transient_lastDdlTime   1486016570

Time taken: 0.255 seconds, Fetched: 34 row(s)

## Now, lets create an External Table. The data for which is already copied onto Hadoop

create external table emp_ext (
empno int comment 'This is Employee Number',
ename string comment 'Employee Name',
country string,
city string,
zipcode int)
row format delimited fields terminated by ','
location '/user/cloudera/emp';

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.116 seconds, Fetched: 5 row(s)

## Describing the table. See the Bold and underlined text

hive (vivek)> describe formatted emp_ext;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:31:25 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486017085

Time taken: 0.216 seconds, Fetched: 37 row(s)

Both the tables contain same data with the difference that EMP is Internal Table and EMP_EXT is an External Table. Internal Table created a directory into HDFS as can be seen from the DESCRIBE FORMATTED OUTPUT for EMP. The directory is /user/hive/warehouse/vivek.db/emp. Dropping EMP will remove this data as well, which is tried and tested. Next, I will create another table from EMP_EXT and will not specify INTERNAL or EXTERNAL keyword.


hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp';
OK
Time taken: 0.21 seconds
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.107 seconds, Fetched: 5 row(s)

## Next, Describe the new table. See the bold and underlined text

hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:41:23 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486017683

Time taken: 0.275 seconds, Fetched: 36 row(s)

The table metadata says that its an Internal Table also referred as MANAGED Tables. My understanding that creating a table from an external table will implicitly create an External table looks to be wrong here. Now, in this case, the location is /user/cloudera/emp which is also shared by EMP_EXT table. I dropped the EMP_EXT1 table to check the impact of this on EMP_EXT.

hive (vivek)> drop table emp_ext1;
OK
Time taken: 0.482 seconds
hive (vivek)> show tables;
OK
tab_name
emp
emp_ext
Time taken: 0.097 seconds, Fetched: 2 row(s)
hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
Time taken: 0.122 seconds

The drop command removed the data from HDFS and querying EMP_EXT fetched no results. This means, even if you create a table from an External table, without specifying EXTERNAL keyword for the CREATE TABLE command, it will create an INTERNAL TABLE. I reconstructed the data again i.e. dropped EMP_EXT and EMP_EXT1.

[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/emp
Found 1 items
-rw-r--r--   1 cloudera cloudera       3501 2017-02-01 22:49 /user/cloudera/emp/emp.txt

use vivek;

hive (vivek)> create external table emp_ext (
            > empno int comment 'This is Employee Number',
            > ename string comment 'Employee Name',
            > country string,
            > city string,
            > zipcode int)
            > row format delimited fields terminated by ','
            > location '/user/cloudera/emp';
OK
Time taken: 0.601 seconds

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.698 seconds, Fetched: 5 row(s)

hive (vivek)> describe formatted emp_ext;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:51:46 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486018306

Time taken: 0.287 seconds, Fetched: 37 row(s)

## Will Create EMP_EXT1

hive (vivek)> create table emp_ext1 like emp_ext location '/user/cloudera/emp';
OK
Time taken: 0.155 seconds
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.131 seconds, Fetched: 5 row(s)

hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:53:31 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             MANAGED_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        totalSize               0
        transient_lastDdlTime   1486018411

Time taken: 0.258 seconds, Fetched: 36 row(s)

The output is same as what it was during our previous execution. I am investigating it further whether my understanding “CREATE TABLE table_name like external_table_name will create an External table as I am creating a Table from an External Table” is wrong or whether it was valid for the earlier versions of Hive. Just for curiosity, I made following changes and Wow..dropping the table didn’t remove the underlying data from HDFS.

hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('table type'='EXTERNAL_TABLE');
OK
Time taken: 0.375 seconds
hive (vivek)> alter table emp_ext1 set TBLPROPERTIES('EXTERNAL'='TRUE');
OK
Time taken: 0.268 seconds
hive (vivek)> describe formatted emp_ext1;
OK
col_name        data_type       comment
# col_name              data_type               comment

empno                   int                     This is Employee Number
ename                   string                  Employee Name
country                 string
city                    string
zipcode                 int

# Detailed Table Information
Database:               vivek
Owner:                  cloudera
CreateTime:             Wed Feb 01 22:53:31 PST 2017
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://quickstart.cloudera:8020/user/cloudera/emp
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        last_modified_by        cloudera
        last_modified_time      1486019158
        numFiles                0
        numRows                 -1
        rawDataSize             -1
        table type              EXTERNAL_TABLE
        totalSize               0
        transient_lastDdlTime   1486019158

Time taken: 0.206 seconds, Fetched: 40 row(s)
 
hive (vivek)> select * from emp_ext1 limit 5;
OK
emp_ext1.empno  emp_ext1.ename  emp_ext1.country        emp_ext1.city   emp_ext1.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.157 seconds, Fetched: 5 row(s)
hive (vivek)> drop table emp_ext1;
OK
Time taken: 0.207 seconds

hive (vivek)> select * from emp_ext limit 5;
OK
emp_ext.empno   emp_ext.ename   emp_ext.country emp_ext.city    emp_ext.zipcode
3980    Lancaster       USA     California      118718
3981    Fort Collins    USA     Colorado        118652
3982    Coral Springs   USA     Florida 117549
3983    Stamford        USA     Connecticut     117083
3984    Thousand Oaks   USA     California      117005
Time taken: 0.159 seconds, Fetched: 5 row(s)

The ALTER TABLE command changed the properties of the Table from Managed to External and droping this table didn’t drop tbe underlying data from HDFS, which was evident from the fact that I could query from EMP_EXT.

Looking for the answer to my query – CREATE TABLE table_name LIKE external_table – Will it create an Internal Table or an External Table ?