Loading data from AWS S3 bucket to Vertica made easy

The COPY command is an obvious choice when you need to load a significant amount of data into a Vertica database. In the past a lot of people have run into difficulties when their data was stored in an AWS S3 bucket. At that time, the COPY command assumed that all the file(s) with data are located on the Vertica node. In addition, it expected that the absolute path to the file(s) on the local file system are given as one of the parameters. To meet those expectations the loading process needed additional steps in which the files would be made available locally on the Vertica node. There was a number of creative solutions: using s3cmd to pre-copy files before loading; using s3fs to mount S3 bucket directly to the Vertica node. However, those extra steps made the loading process much more complicated than it needed to be.

Fortunately for us those days are in the past now. Starting from Vertica 7.2 we can use the Vertica Library for Amazon Web Services, which will make our loading process much more simple. Now we can put an HTTPS URL or an S3 URL directly into the COPY command and Vertica will do the rest. Let me take you through the loading steps to show you how simple the loading process has become.

Do I have the Vertica Library for Amazon Web Services in my cluster?

The Vertica Library for Amazon Web Services was released as a part of Vertica 7.2.2. The official release date for that version is 3/30/16. If your version is 7.2.2 or higher then you should have it.

dbadmin=> select version();
              version
------------------------------------
 Vertica Analytic Database v8.0.0-0
(1 row)

dbadmin=>

A more specific way is to check if the AWS Library functions are present in your database. Generally you will see 4 functions: AWS_GET_CONFIG, AWS_SET_CONFIG, S3 and S3EXPORT. This can be done in vsql by using the \df command.

testdb=> \df AWS*
                         List of functions
 procedure_name | procedure_return_type | procedure_argument_types
----------------+-----------------------+--------------------------
 aws_get_config | Varchar               | Varchar
 aws_set_config | Varchar               | Varchar, Varchar
(2 rows)

testdb=> \df S3*
                         List of functions
 procedure_name | procedure_return_type | procedure_argument_types
----------------+-----------------------+--------------------------
 s3             |                       |
 s3export       | Integer, Varchar      | Any
(2 rows)

testdb=>

If you see an output similar to the one above then your Vertica cluster should have a functional Vertica Library for AWS.

Configuring the Vertica Library for AWS to load one file

If all you need to do is manually load one file then the configuration of the Vertica Library is very simple. In this scenario it is as simple as supplying the AWS Security Credentials by setting session parameters. You will need to provide your Access Key ID and Secret Access Key.

testdb=> ALTER SESSION SET UDPARAMETER FOR awslib aws_id='YOUR-AWS-ACCESS-KEY-ID';
ALTER SESSION
testdb=> ALTER SESSION SET UDPARAMETER FOR awslib aws_secret='Your123Secret123Access123Key1234567890';
ALTER SESSION
testdb=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv');
 Rows Loaded
-------------
           4
(1 row)

testdb=> select * from test_schema.test_load;
  sample_data
----------------
 First record
 SECOND RECORD
 Third Record
 Much more data
(4 rows)

testdb=>
Configuring the Vertica Library for AWS for recurring batch loads

When you are planning to set up a loader, which will do regular data loads, you want to make sure that your access key and secret are safe. Hard coding them could lead to security breaches so you need a method which will allow you to have credentials readily available for the loader, but not accessible for everyone else. This can be done by storing credentials in the table and securing them by row access policy. Before each run the loader will configure its session by pulling credentials from the table. The credentials will be available within the session and will disappear once the session is terminated.

We will start by creating a table with row access policy. In the first statement we will create a table which will hold our credentials. The first column, user_name, will be used in row level access policy and will tell which user will be able to see the credentials.

dbadmin=> CREATE TABLE credentials (user_name varchar, aws_access_key_id varchar, aws_secret_access_key varchar);
CREATE TABLE
dbadmin=>

Once the table is created we will insert a record with our credentials. We will have ‘loader’ as the value for the column user_name. This value will be identical to the username that the loader process will use.

dbadmin=> insert into credentials values('loader','YOUR-AWS-ACCESS-KEY-ID','Your123Secret123Access123Key1234567890');
 OUTPUT
--------
      1
(1 row)

dbadmin=> commit;
COMMIT
dbadmin=>

Next step is to secure the inserted credentials. We will create row level access policy, which will test for a match between the value in the column user_name and the currently logged in username. We will add the keyword ENABLE to the end of the statement so our policy will be enabled immediately.

dbadmin=> CREATE ACCESS POLICY on credentials
dbadmin-> for rows
dbadmin-> WHERE credentials.user_name = CURRENT_USER()
dbadmin-> enable;
CREATE ACCESS POLICY
dbadmin=> select * from credentials;
 user_name | aws_access_key_id | aws_secret_access_key
-----------+-------------------+-----------------------
(0 rows)

dbadmin=>

As you can see the SELECT statement is returning zero records now.

Note: dbadmin user still has the ability to disable or drop the policy. If that happens, users with SELECT privilege to credentials table will be able to see stored credentials. You need to think through your operational scenarios and make sure that your AWS access keys will remain secured all the time.

At this point we are done with the one time configurational steps. Now we can login as loader user and import data from S3 bucket.

$ vsql -U loader
Password:
loader=> select count(*) from credentials;
 count
-------
     1
(1 row)
loader=>

When we are connected as the user loader we can see 1 record in the table credentials. Row level access policy is making a match with the username of this session and allowing us to see the credentials that we are going to use.

loader=> SELECT AWS_SET_CONFIG('aws_id', aws_access_key_id),
loader->        AWS_SET_CONFIG('aws_secret', aws_secret_access_key) FROM credentials;
 AWS_SET_CONFIG | AWS_SET_CONFIG
----------------+----------------
 aws_id         | aws_secret
(1 row)

loader=>

At this point we need to set AWS credentials for our session. We are running SELECT statement with AWS_SET_CONFIG functions. This statement will pull credentials from the protected table and will set them in our session. As you can see, the actual credentials are not visible anywhere and not hardcoded in our loader code. The only thing loader code will have is the SQL statement itself.

loader=> truncate table test_schema.test_load;
TRUNCATE TABLE
loader=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv');
 Rows Loaded
-------------
           4
(1 row)

loader=>

We will truncate the table test_load so it will be clear that our COPY command works properly and then run the COPY command to load data.

loader=> select * from test_schema.test_load;
  sample_data
----------------
 First record
 SECOND RECORD
 Third Record
 Much more data
(4 rows)

loader=>

SUCCESS!!! The table is populated with data from the S3 bucket and we had no need to put the credentials in our code or in the configuration file. The only access to the credentials is through the dbadmin user or loader user himself.

Other AWS_SET_CONFIG parameters

The AWS_SET_CONFIG function will allow you to set a lot of additional parameters besides the access keys. At the time of release it had 7 additional parameters. Using those you could set the AWS region, proxy server, SSL server, enable debug messages and cap send/receive speeds.

Management Console Integration

Starting from Vertica 7.2.3, the Vertica Library for Amazon Web Services integrated in one of the Management Console screens. If you are using it you can take advantage of the easy S3 loads using GUI. In your Management Console, navigate to your target database and click on the “Load” tab at the bottom of the page. Now you should see the “New S3 Data Load” button. My Management Console is version 8.0.0 and you need to make sure you are on the “Instance” portion of the page to see the “New S3 Data Load” button.

Once you click on it you will be presented with the screen where you can provide parameters for the load, which is very similar to the parameters we used for the COPY command.

Click “Submit Load Request” to initiate the load. Once the load is completed you will see a record in the “Load History” section of the page, which will give you detailed information about the status of your load.

The status of the load is a link to a popup containing details which will display the actual SQL command that was executed. Behind the scenes, Vertica will use the COPY command, which will be very similar to the commands that we used above. Here is a screenshot of the command used in my tests.

Exporting data from Vertica to S3 bucket

I am sure that you noticed a reference to the function S3EXPORT in the beginning of this post. At this point you have most likely already guessed that it should also be very easy to push data in the reverse direction too.

loader=> SELECT s3export(sample_data USING PARAMETERS url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_export.csv') OVER () FROM test_schema.test_load;
 rows |                       url
------+-------------------------------------------------
    4 | s3://bucket-3g7vm6kd/sample_export.24289448.csv
(1 row)

loader=>

Here are the contents of the exported file:

Conclusion

After the release of the Vertica Library for AWS, the integration of Vertica and AWS S3 made working with the two simple and pleasant. The all new S3 vertica loaders will be much simpler to build and will have significantly less complexity. If you already have an existing Vertica loader from S3 and are not using this new feature you should revisit the architecture of our loader. I am sure you will be able to simplify your current environment significantly.

3 thoughts on “Loading data from AWS S3 bucket to Vertica made easy”

  1. I am trying to load a Vertica table with s3files that are gzipped. Is that an option in the s3 copy command for Vertica?

    1. COPY command itself has option to load GZIPed files, but that option available when files located on node and you specifying path to it. So you can use GZIP option to load files old way. In case of Vertica Library for Amazon Web Services it is different. Vertica Library for AWS implemented based on the User-Defined Load (UDL) feature. S3 function implemented as User-Defined Source (UDSource), which will pass input to default parser. I know for sure that default parser will handle CSV, but not going to handle GZIPed files correctly. Potentially you can try to write your own User-Defined Parser (UDParser), which will be able to handle GZIP files and add clause to COPY command to make it use your parser.

Leave a Reply

Your email address will not be published. Required fields are marked *