Exporting Vast Amounts of Data Using Parallel Export for HPE Vertica

Sometimes you can find yourself in a position in which you need to export data out of Vertica. It is not a problem for small data sets, but could be challenging if you have millions of records.

If you try to use vsql for this purpose you will be doing the export through a single node. Every node in a Vertica cluster will be sending its portion of data over the network to the node you are running vsql from. The node you are running vsql on will be accepting data, formatting and writing to the attached disks. The majority of cluster resources will be not utilized, while the network will be heavily utilized. Such an export can disbalance resource utilization on the cluster and can have a negative impact on overall cluster performance.

The export will be much more optimal if every node in the cluster will export its own portion of data. In addition, the export will be way faster, especially if the nodes will do all of the work in parallel.

Fortunately for us HP Vertica has the right tool for the job and that tool is freely downloadable from the HPE Marketplace. This tool is called “Parallel Export” and it is FREE.

Here is the link to HPE Marketplace: https://saas.hpe.com/marketplace/big-data/parallel-export

“Parallel Export” is a user defined function (UDF), which can export data concurrently on all nodes of the cluster. It can export data to files or to external commands.

Downloading “Parallel Export”

In order to download “Parallel Export” you will need an HPE marketplace account. If you do not have one you can sign up. The sign up process is short and easy.

At the time I am writing this post, Parallel Export ver. 1.3.10 is the latest version available. I will be using that version throughout this post. I will install it on a cluster running Vertica ver. 7.1.2-10.

Once you get through all of the download screens you will have a ParallelExport-1.3.10.zip file.

What is included in the downloaded package?

The downloaded ZIP contains the following:

makefile – file containing instructions for make utility
README.txt – documentation provided with UDF
ddl folder – folder with DDL to do install/uninstall
src folder – folder with source code for Parallel Export
test folder – folder with DDL and the data file, which will be used by make utility for tests

Installing “Parallel Export”

As the first step we need to upload the ParallelExport-1.3.10.zip file to one of the nodes of the Vertica cluster and unzip.

Now navigate to the “ParallelExport-1.3.10” folder and execute “make install”. You will be asked for a password several times. In my case it was the password for the dbadmin user. On the image below you can see that UDF functions are created now.

You can verify that these functions exist with the following SQL:

select * from user_functions
where function_name in ('ParallelExport','exportdata');

You can see that installation created 2 functions. The functions are identical and based on the same library with name ParallelExport.so. The function “exportdata” is the old name of “ParallelExport” and is created for backward compatibility. You can use either one of those.

ParallelExport and its parameters

To export data you need to execute an SQL query with the ParallelExport function in it. The query can look like:

select ParallelExport(column_1, column_2, column_3
       using parameters path='/tmp/ParEx_test.csv.${nodeName}')
over (partition auto) from ParEx_test_table;

Inside of the function parentheses we need to provide the following:

comma separated list of columns we want to export
keywords “using parameters”
parameters and their corresponding values

When listing columns you can use expressions and other functions. In the examples below I will show you how to do it.

When exporting data you have 2 options. The first one is to export the data to a plain file. In this case you need to specify the path: parameter and omit cmd:. The second option is to do something with output (like compress it). In this case you need to omit the path: parameter and specify cmd:.

In my experience, you have to specify path: or cmd: and can omit other parameters since they are optional.

Additional available parameters are: buffersize:, separator:, fromcharset: and tocharset:.
Those parameters have default values, which work in most cases. They are explained in detail in README.txt provided with the install package.

Sample table and dataset

For the examples below I will use the following table

create table public.ParEx_test_table
(column_int         int,
 column_datetime    timestamp,
 column_varchar     varchar(64));

I will have 6 records in this table. Here are the contents of the table.

Exporting data to a file

To export data from our test table I will issue the following command:

select ParallelExport(column_int, column_datetime, column_varchar
 using parameters
 path='/tmp/ParEx_test.csv.${nodeName}', separator=',')
over (partition auto) from ParEx_test_table;

In the output you can see that the data has been exported on all nodes in my 3 node cluster. You can see the number of exported records in square brackets after the word “exported”.

Notice ${nodeName} construct in the path: parameter. Vertica will replace it with the actual Vertica node name. This is useful in situations in which you will be copying all files to the same folder after the export completes. This will prevent file names from conflicting with each other.

Vertica also has ${hostName} construct, which will be replaced with actual machine’s hostname.

Let’s examine our output. I will ssh into vertica-testdb-03 node since it has the majority of the data.

You can see that the filename contains the Vertica database node name at the end. If I will copy the files ParEx_test.csv.v_testdb_node0001, ParEx_test.csv.v_testdb_node0002 and ParEx_test.csv.v_testdb_node0003 from their respective nodes I will have the full dataset from the exported table.

Exporting data with extra formatting

You may have noticed from the previous example that the CSV file we just exported is not appropriately formatted. Our data contains commas and double quotes. This will cause errors on data import. To fix this issue with commas we can put double quotes around our data. This will fix CSV line 3, but will break CSV line 6. Line 6 has extra double quote in one of the columns. To deal with the unexpected double quotes we will put an escape character in front of it.

To fix the mentioned issues I will modify my column list. I will put double quotes around data for all columns. In case of varchar column I will also add an escape character ‘\’ before qouble quotes. Here is how my query will look:

select ParallelExport('"'||column_int||'"',
                      '"'||column_datetime||'"',
                      '"'||REPLACE(column_varchar, '"', '\"')||'"'
 using parameters
 path='/tmp/ParEx_test_spec.csv.${nodeName}', separator=',')
over (partition auto) from ParEx_test_table;

Let’s examine our output now.

Now all of our data in the CSV file is well formatted. If on export we specify that we have a comma separated file with all data enclosed in double quotes and ‘\’ as an escape character we will import all data without any issues.

Exporting data to a command

When exporting a lot of data, you will want to get into the habit of compressing it. In this case you want to use the cmd: parameter instead of path:. This will tell Vertica to send the exported data to a command. You can use any compress command. In my query below I will use gzip.

select ParallelExport('"'||column_int||'"',
                      '"'||column_datetime||'"',
                      '"'||REPLACE(column_varchar, '"', '\"')||'"'
 using parameters
 cmd='gzip -9 -c - > /tmp/ParEx_test_gz.csv.${nodeName}.gz', separator=',')
over (partition auto) from ParEx_test_table;

The query above will result in exporting data and compressing it in a single command.

The cmd: parameter is not limited to compression utilities. You can get creative and send the export output to any suitable unix command.

Conclusion

“Parallel Export” is a robust and flexible tool, which will help you if you will have the need to export a significant amount of data out of Vertica. It will help you to balance the extra load on the Vertica cluster by distributing it to all cluster nodes. This will minimize performance impact to queries while you are running the export.