Optimizing Storage Footprint in HP Vertica with Database Designer

One of Vertica’s unique features is encoding. Encoding is an attribute, which you can specify in the column definition in your CREATE PROJECTION statement. This attribute is optional. If you omit it Vertica will use default encoding with the name AUTO. AUTO encoding is made to be universal and as any universal thing it is not the best answer in every possible situation. Besides AUTO Vertica supports a lot of other encodings tuned to be the best encoding for very specific datasets. A detailed description of supported encodings can be found in the Vertica documentation.

If you skim through the detailed description of supported encodings you will see that it is not a trivial task to find the best suitable encoding for a dataset in your column. Luckily for us, Vertica provides us with Database Designer. Database Designer is a tool which is included in the Vertica software package. It will help you build the most optimal schema for your database. One of the things Database Designer does for you is it makes a recommendation on the best encoding for your data for each individual column.

Let me take you through the steps which will show how to choose the best encoding for your data.

Load Data for the First Time

For this post I am going to use a test dataset with about 140 million records in it. Below you can find the contents of one sample record out of the dataset.

CLIENT_ID	79381183
PRODUCT_CD	Auto\|Item:1515
MANUFACTURER_ID	Brand:Z5687550
ORDER_DATE	3/17/03 12:40
QUANTITY	1
ORDER_TOTAL	222
SHIPPED_DATE	12/1/14 0:00
SOURCESYSTEM_CD	NorthEast

I am going to create a new table to load this data. At this stage I am not specifying any encodings, constraints or sort orders. All I want is to load the data into Vertica and let the Database Designer analyze it and give me encoding recommendations. Therefore, I am not going to create projections. I will let Vertica auto create a superprojection for me.

Here is the definition of table structure, which I am going to use.

CREATE TABLE ORDER_FACTS ( 
CLIENT_ID            INT             NOT NULL,              
PRODUCT_CD           VARCHAR(50)     NOT NULL,
MANUFACTURER_ID      VARCHAR(50)     NOT NULL,
ORDER_DATE           TIMESTAMP       NOT NULL,
QUANTITY             INT default (1) NOT NULL,
ORDER_TOTAL          DECIMAL(18,5)       NULL,
SHIPPED_DATE         TIMESTAMP           NULL,
SOURCESYSTEM_CD      VARCHAR(50)         NULL);

Notice that this table has a nice mix of different data types.

Now we need to load the data. I have my test dataset in one big CSV file. First, I will copy the CSV file to one of the nodes in my cluster and will make sure that the dbadmin user has read permissions to directory and CSV file. Loading it into Vertica is as simple as running one COPY command.

COPY ORDER_FACTS FROM '/tmp/encoding_testing_data.csv' ON v_testdb_node0001
EXCEPTIONS '/tmp/encoding_testing_data_bad.log' ON v_testdb_node0001
REJECTED DATA '/tmp/encoding_testing_data_bad.csv' ON v_testdb_node0001
DELIMITER AS ','
NULL AS '' 
REJECTMAX 10000
DIRECT
SKIP 1;

Run Database Designer

To run the Database Designer you need to start Vertica admintools.

ssh to one of the nodes of your Vertica cluster
run sudo su – dbadmin
Run admintools

You will see Administration Tools Main Menu

On the “Main Menu”: Navigate to “Configuration Menu”.

Next, hit “Run Database Designer” on the Configuration Menu.

Select the database you loaded your data into, provide the password for the dbadmin user and enter the directory for the Database Designer output (this can be any directory dbadmin can write to, for example, I use /tmp in a lot of cases).

On “Database Designer” screen: Provide a name for the design we are working on (this name will be the prefix of all output files).

Next, on the “Design Type” screen, select “Comprehensive” design type and choose which schema your table resides in.

On “Design Options” screen:

[ ] If you already know what SQL queries you will be running you can paste them into a text file and provide it to the Database Designer. I am not going to do it in this post so I will uncheck “Optimize with queries” option.

[ * ] The Database Designer needs statistics updated to provide good advice on encoding so I will make sure that “Update statistics” is checked.

[ ] I do not want the Database Designer to make modifications to schema at this point so I will make sure that “Deploy design” is unchecked.

At this point we have provided all the required info. Now the Database Designer will confirm if we want to “Proceed” and after giving us a series of progress messages it will return us to admintools.

We can exit admintools at this point.

Analyze the Recommendations

Database Designer will generate a series of files as a result of its work. Filenames will start with the name of design we provided above. In my case, the files are located in /tmp and the filenames start with “encode”.

encode_design.sql contains a clean schema design if you need to create all objects from scratch. Below is a fragment of CREATE PROJECTION statement with encoding types that Database Designer thinks will be the most optimal.

CREATE PROJECTION ORDER_FACTS_DBD_4_rep_encode 
(
CLIENT_ID ENCODING DELTARANGE_COMP,
PRODUCT_CD ENCODING AUTO,
MANUFACTURER_ID ENCODING AUTO,
ORDER_DATE ENCODING GCDDELTA,
QUANTITY ENCODING RLE,
ORDER_TOTAL ENCODING RLE,
SHIPPED_DATE ENCODING RLE,
SOURCESYSTEM_CD ENCODING RLE
)
AS
SELECT CLIENT_ID,
...

Note that Database Designer kept AUTO encoding on 2 columns and came up with different encodings for all the remaining columns.

Implement Recommendations and Compare Storage Footprint

To verify the results I will run the CREATE PROJECTION ORDER_FACTS_DBD_4_rep_encode statement manually and will run “select refresh(‘ORDER_FACTS’);” to populate new projection with data.

Here is the information on the projection storage footprint for ORDER_FACTS table.

Projection ORDER_FACTS_super is the one with all columns with AUTO encoding and consumes 1.9 Gb of space. Projection ORDER_FACTS_DBD_4_rep_encode is the one with encoding recommended by Database Designer, and it consumes 1.1 Gb of space.

By doing this little exercise we found a way to save 40% of hard drive space. Obviously your mileage will vary depending on what kind of data you are storing, but the potential space savings make this worthwhile to experiment with.

Load Data for the First Time

Run Database Designer

Analyze the Recommendations

Implement Recommendations and Compare Storage Footprint

Leave a Reply Cancel reply