Expanding the power of Vertica Library for AWS with Filter Functions

After publishing the “Loading data from AWS S3 bucket to Vertica made easy” post I had people asking me about the possibility of using the S3 function to load data in different formats. The most frequently asked question was about loading GZipped files from the S3 bucket.

Vertica COPY command expects an uncompressed CSV input by default. GZIP option is available when source data files are located on the node, but not available when using the S3 function since it is based on the User-Defined Load (UDL) feature.

Recently I came across UDx-Examples repo on GitHub with User Defined Extension (UDx) examples. This is a public repo, which is published by the Vertica team. In that repo we can find code for Filter Functions, which will enable us to expand the power of Vertica Library for AWS.

In this tutorial we will run through use cases when we will need to:

load BZip and GZip compressed files;
load data in different characterset;
make data modifications while loading it.

My Environment

Ubuntu Server 14.04 LTS (used ubuntu-14.04.4-server-amd64.iso)
Vertica 8.0.0 (used vertica_8.0.0-0_amd64.deb)

Installing Filter Functions to Vertica Database

After doing some research I realized that the UDx-Examples repo contains sample code from “HP Vertica User Defined Extensions Software Development Kit”. Irony of the situation is that the mentioned SDK is part of the standard Vertica installation so it is sitting on every node of every Vertica cluster. All that I needed to do is just start using it. 🙂

For the purposes of this tutorial all I need is the Filter Function portion of the examples.

root@ubuntu:~# ll /opt/vertica/sdk/examples/FilterFunctions
total 36
drwxr-xr-x  2 root root 4096 Oct 24  2016 ./
drwxr-xr-x 15 root root 4096 May 29 16:20 ../
-rw-r--r--  1 root root 3386 Aug 19  2016 BZip.cpp
-rw-r--r--  1 root root 4583 Aug 19  2016 GZip.cpp
-rw-r--r--  1 root root 5728 Aug 19  2016 Iconverter.cpp
-rw-r--r--  1 root root 4361 Aug 19  2016 SearchAndReplaceFilter.cpp

I will get the CPP portion of the examples compiled and will ignore errors from the Java portion of the examples.

We will start by installing required dependencies:

root@ubuntu:~# apt-get install g++ make zlib1g-dev libbz2-dev libcurl4-openssl-dev libboost-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
g++ is already the newest version.
libboost-dev is already the newest version.
libbz2-dev is already the newest version.
make is already the newest version.
zlib1g-dev is already the newest version.
libcurl4-openssl-dev is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 118 not upgraded.

Note: When functions are installed they are copied to every node of the cluster. When SQL references those functions that are executed it will run on every node of your cluster. Make sure that all dependencies are installed on all nodes of your cluster. I strongly recommend to have all your nodes identical to each other (including OS packages installed).

Next we will compile examples:

root@ubuntu:/opt/vertica/sdk/examples# make
test -d /opt/vertica/sdk/examples/build || mkdir -p /opt/vertica/sdk/examples/build
touch /opt/vertica/sdk/examples/build/.exists
...
g++  -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC  -o /opt/vertica/sdk/examples/build/IconverterLib.so FilterFunctions/Iconverter.cpp /opt/vertica/sdk/include/Vertica.cpp
g++ -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC -I /usr/include -o /opt/vertica/sdk/examples/build/GZipLib.so FilterFunctions/GZip.cpp /opt/vertica/sdk/include/Vertica.cpp -lz
g++ -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC -I /usr/include -o /opt/vertica/sdk/examples/build/BZipLib.so FilterFunctions/BZip.cpp /opt/vertica/sdk/include/Vertica.cpp -lbz2
g++  -I /opt/vertica/sdk/include -I HelperLibraries -g -Wall -Wno-unused-value -shared -fPIC  -o /opt/vertica/sdk/examples/build/SearchAndReplaceFilter.so FilterFunctions/SearchAndReplaceFilter.cpp /opt/vertica/sdk/include/Vertica.cpp
...
/../third-party/jdk/jdk1.6.0_45/bin/javac -g -cp /opt/vertica//bin/VerticaSDK.jar /opt/vertica/sdk/BuildInfo.java -d /opt/vertica/sdk/examples/build/JavaScalarLib
make: /../third-party/jdk/jdk1.6.0_45/bin/javac: Command not found
make: *** [/opt/vertica/sdk/examples/build/JavaScalarLib.jar] Error 127
root@ubuntu:/opt/vertica/sdk/examples#

My output is showing success for the Filter Functions so I am going to ignore “javac: Command not found” error. I know that Java is not installed on this system and I do not need anything from the Java portion of the examples in this tutorial.

File FilterFunctions.sql in /opt/vertica/sdk/examples contains:

SQL statements to load compiled libraries to Vertica database
SQL statements to create filter functions
Commands to generate sample data
COPY statements to use/test functions
SQL statements to drop all created functions and libraries.

You can run FilterFunctions.sql and if it runs without errors then you can be sure you are on your way to success, but when the script completes your Vertica database is not going to have filter functions. FilterFunctions.sql removes them as the final step.

root@ubuntu:/opt/vertica/sdk/examples# /opt/vertica/bin/vsql -f FilterFunctions.sql -U dbadmin
Password:
CREATE LIBRARY
CREATE LIBRARY
CREATE LIBRARY
CREATE LIBRARY
CREATE FILTER FUNCTION
CREATE FILTER FUNCTION
CREATE FILTER FUNCTION
CREATE FILTER FUNCTION
CREATE TABLE
… a lot of successful statements!!!
DROP TABLE
DROP LIBRARY
DROP LIBRARY
DROP LIBRARY
DROP LIBRARY
root@ubuntu:/opt/vertica/sdk/examples#

I am going to create a copy of FilterFunctions.sql, I will keep statements from Step1 and Step 2 and will remove everything for Step 3 and Step 4. This will be easy to do since the file contains comments showing the beginning of each step.

root@ubuntu:/opt/vertica/sdk/examples# sed -n '/-- Step 1:/{p; :loop n; p; /-- Step 3:/q; b loop}' FilterFunctions.sql > create_FilterFunctions.sql
root@ubuntu:/opt/vertica/sdk/examples# /opt/vertica/bin/vsql -f create_FilterFunctions.sql -U dbadmin
Password:
CREATE LIBRARY
CREATE LIBRARY
CREATE LIBRARY
CREATE LIBRARY
CREATE FILTER FUNCTION
CREATE FILTER FUNCTION
CREATE FILTER FUNCTION
CREATE FILTER FUNCTION
root@ubuntu:/opt/vertica/sdk/examples# /opt/vertica/bin/vsql -U dbadmin
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
 
Type:  \h or \? for help with vsql commands
       \g or terminate with semicolon to execute query
       \q to quit
 
dbadmin-> \df *Zip
                         List of functions
 procedure_name | procedure_return_type | procedure_argument_types
----------------+-----------------------+--------------------------
 BZip           |                       |
 GZip           |                       |
(2 rows)
 
dbadmin-> \df Iconverter
                         List of functions
 procedure_name | procedure_return_type | procedure_argument_types
----------------+-----------------------+--------------------------
 Iconverter     |                       |
(1 row)
 
dbadmin-> \df SearchAndReplace
                          List of functions
  procedure_name  | procedure_return_type | procedure_argument_types
------------------+-----------------------+--------------------------
 SearchAndReplace |                       |
(1 row)
 
dbadmin->

Now we can test our setup. I will use the table and the data file from “Loading data from AWS S3 bucket to Vertica made easy” post to make it easier to follow.

Loading GZip data from AWS S3 bucket to Vertica

Loading GZipped data from the S3 bucket without the filter function will result in corrupted data in the table.

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.gz');
 Rows Loaded
-------------
           3
(1 row)
 
dbadmin=> select * from test_schema.test_load;
             sample_data
-------------------------------------
/�jXsample_load.csvs�,*.Q(JM�/J�
 vu��sRA.!�E)
 A)���
      ���T��ĒD.�Sߠ7
(3 rows)

If we apply the GZip() filter function our data will become correct

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.gz') FILTER GZip();
 Rows Loaded
-------------
           4
(1 row)
 
dbadmin=> select * from test_schema.test_load;
  sample_data
----------------
 First record
 SECOND RECORD
 Third Record
 Much more data
(4 rows)

Loading BZip data from AWS S3 bucket to Vertica

Loading BZ2 data from the S3 bucket without the filter function will result in corrupted data in the table.

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.bz2');
 Rows Loaded
-------------
           1
(1 row)
 
dbadmin=> select * from test_schema.test_load;
                             sample_data
---------------------------------------------------------------------
 BZh91AY&SYk~��׀@�.b� TP4�ɐJ��M4h�j=E�h'�*]�[��c3n>�QUc�&{)ʪ��w$S�	��
(1 row)

If we apply the BZip() filter function our data will be correct.

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load.csv.bz2') FILTER BZip();
 Rows Loaded
-------------
           4
(1 row)
 
dbadmin=> select * from test_schema.test_load;
  sample_data
----------------
 First record
 SECOND RECORD
 Third Record
 Much more data
(4 rows)

Loading files saved in non-standard character set from AWS S3 bucket to Vertica

HP Vertica expects to receive all input data in UTF-8. Vertica will store the characters data in UTF-8 and will output results in UTF-8. To illustrate the Iconverter() function I added a Cyrillic translation next to English strings and saved them as UTF-16. Loading this data from the S3 bucket without the filter function will result in corrupted data in the table.

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_UTF16-BE.csv');
 Rows Loaded
-------------
           4
(1 row)
 
dbadmin=> select * from test_schema.test_load;
                   sample_data
-------------------------------------------------
 ��First record - 5@20O 70?8AL
 SECOND RECORD - B>@0O 70?8AL
 Third Record - "@5BLO 70?8AL
 Much more data - I5 <=>3> 40==KE
(4 rows)

If we apply the Iconverter() filter function our data will be correct and readable.

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_UTF16-BE.csv') FILTER Iconverter(from_encoding='UTF-16');
 Rows Loaded
-------------
           4
(1 row)
 
dbadmin=> select * from test_schema.test_load;
            sample_data
-----------------------------------
 First record - Первая запись
 SECOND RECORD - Вторая запись
 Third Record - Третья запись
 Much more data - Еще много данных
(4 rows)

Searching and replacing data while loading from AWS S3 bucket to Vertica

In this example we will search for a specific substring and will replace it with something else. First we will load the original unchanged data set.

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_2_lang.csv');
 Rows Loaded
-------------
           4
(1 row)
 
dbadmin=> select * from test_schema.test_load;
            sample_data
-----------------------------------
 First record - Первая запись
 SECOND RECORD - Вторая запись
 Third Record - Третья запись
 Much more data - Еще много данных
(4 rows)

Now let’s use the SearchAndReplace() function to replace ‘rd’ with ‘222’.

dbadmin=> truncate table test_schema.test_load;
TRUNCATE TABLE
dbadmin=> COPY test_schema.test_load SOURCE S3(url='https://s3.amazonaws.com/bucket-3g7vm6kd/sample_load_2_lang.csv') FILTER SearchAndReplace(pattern='rd', replace_with='222');
 Rows Loaded
-------------
           4
(1 row)
 
dbadmin=> select * from test_schema.test_load;
            sample_data
-----------------------------------
 First reco222 - Первая запись
 SECOND RECORD - Вторая запись
 Thi222 Reco222 - Третья запись
 Much more data - Еще много данных
(4 rows)

Conclusion

Filter functions expand the set of tasks we can do with the help of the Vertica Library for AWS. The example functions provided by the Vertica team in SDK fills missing parts of functionality and gives a very good example to encourage Vertica customers to expand it even more by writing functions for their own use cases.