Using Database Designer to Optimize Specific Class of Queries in HP Vertica

A lot of people who are new to Vertica ask if you can create an index on a table in a Vertica database. The answer they get is NO. Vertica does not have a CREATE INDEX statement and does not have the concept of an index. Let’s see if we can come up with something equivalent to an index in Vertica.

What is a relational database index?

Let’s start by looking at a table which does not have any indexes or constraints. In relational databases all of the data is stored in data blocks. RDBMS engine will take all of the columns of a record and will write them into the same data block. Generally, data blocks are big enough to fit multiple records. So when a block is full it contains multiple records from a single table. Since the table columns will have different datatypes, the block will be a mix of different columns of different data types.

If your query will need to scan data of a single column then you are in a tough situation, because the data of that column is scattered all over data blocks of the entire table. The column storage footprint could be small by itself, but the query will need to read every single block of a table and use only the relevant portions of blocks. Without indexes all RDBMS can do is a FULL TABLE SCAN, which is very costly and slow.

This is where indexes come to help. Index is an extra copy of data, which is stored separately in its own blocks. Those blocks have data for indexed column(s) only, which results in a much smaller storage footprint. When you have an index which is matching your query, RDBMS can do a FULL INDEX SCAN, which is way more efficient.

One additional feature of an index is that it will sort data before writing it to blocks and will keep it sorted when new data is inserted. When your data is sorted RDBMS can look at the query conditions and try to use a RANGE INDEX SCAN.

To summarize: Index is an extra copy of data for columns we are interested in. Index data is sorted. We create indexes to give the database a means to come up with optimal execution plans.

What is an HP Vertica projection?

Projection is an object which stores data and defines physical attributes like encoding, segmentation and sort order. There are 2 kinds of projections. One is a superprojection and the other is a query-specific projection.

Superprojection is a projection which contains all of the columns of a table and will store the table’s data. For redundancy purposes, each superprojection should have a mirrored copy called a buddy projection. You must have at least one superprojection with a buddy to be able to insert any data into the table. Since a superprojection has every column of the base table it resembles a heap table in RDBMS. Take note of the word “resemble”. I do not want you to be under the false impression that a superprojection and a heap table are virtually the same. Even if they look similar conceptually, they are still very different inside.

The other type of projection is a query-specific projection. We will examine it much more closely. A query-specific projection usually has a small subset of table columns. It is a separate structure so it will create an extra copy of the data. As any other projection, it will allow you to define its own encoding, segmentation and sort order. Generally it is built to match a specific class of queries. When built, it will allow Vertica to come up with more efficient execution plans in relation to the execution plans created based on the superprojection.

Can I create an index in Vertica?

Answer will be NO, because Vertica does not have a structure called an index. However, I think that it is very clear from the above analysis that an index in RDBMS is very similar conceptually to a query-specific projection in Vertica.

How can I design a query-specific projection?

You can use Database Designer if you want to design one. It has an option called “Incremental” on the “Design type” screen. The process is very similar to the one I described in “Using Database Designer to Build Projections with Optimal Sort Order in HP Vertica” post. I strongly recommend reading the mentioned post since it has a very detailed description.

You will need a running Vertica cluster with existing tables. The table should have some data already loaded. You will need to prepare a file with the SQL queries you want to create the query-specific projection for. When it is all ready, run Database Designer and follow the instructions from the post mentioned above. When you reach the “Design Type” screen choose “Incremental” instead of “Comprehensive”.

The screens after this will have less questions, but will be very similar. At the end you will have a set of files with additional projection(s).

For this post I used the following query as input:

select ORDER_DATE, MANUFACTURER_ID, sum(ORDER_TOTAL) as CLIENT_TOTAL_BY_YEAR
from ORDER_FACTS
group by ORDER_DATE, MANUFACTURER_ID
order by ORDER_DATE, MANUFACTURER_ID;

Database Designer created the following query-specific projection for me.

CREATE PROJECTION ORDER_FACTS_DBD_1_rep_qsp /*+createtype(D)*/
  (
     MANUFACTURER_ID ENCODING RLE,
     ORDER_DATE ENCODING RLE,
     ORDER_TOTAL ENCODING RLE
  )
AS
SELECT MANUFACTURER_ID,
     ORDER_DATE,
     ORDER_TOTAL
FROM ORDER_FACTS
ORDER BY ORDER_DATE,
     MANUFACTURER_ID,
     ORDER_TOTAL
UNSEGMENTED ALL NODES;

As you can see, this projection has only 3 columns out of the 8 columns that the table has. Those 3 columns were chosen because the query I supplied uses them. Other columns are omitted in this projection. The sort order was aligned with GROUP BY / ORDER BY clauses. Segmentation is recommended based on the amount of data in the table and the number of nodes in the cluster.

When this projection is created and refreshed it will be ready to serve the data. Vertica optimizer will start taking it into consideration and will use it for queries similar to the query we supplied during its design.

Creating query-specific projections is easy and can give a performance boost to a specific class of queries. Having extra projection(s) is not going to create an extra burden during data loads. When new data is inserted into the table, Vertica will populate all of the projections associated with it automatically.