Building Projections with Optimal Segmentation in HP Vertica

The performance level of your queries in Vertica depend on the design of your projections. In the projection definitions you specify how your data will be segmented and how your data will be sorted. If you are going to make the right choices your queries will perform really well. In this post we will look at projection segmentation.

Why good segmentation is important

The main goal of segmentation is to make sure that an equal portion of your data will be placed on each node in the Vertica cluster. In Vertica, each node will be processing its own portion of data. Results of the query will be provided to you as soon as all nodes are done.

When nodes have an identical amount of data it will take an identical amount of time for each node to process its portions. As a result, nodes will finish almost at the same time and there will be no waiting before the results are served to you.

If the nodes will get an unequal amount of data then the node with the largest portion of data will take more time. In this situation, Vertica will wait for this single node since it needs results from all nodes to produce the query result. It is in your best interest to minimize or eliminate this wait time.

How to define segmentation in Vertica

Segmentation is defined by “SEGMENTED BY“ clause in CREATE TABLE or CREATE PROJECTION statements.

Here is a sample DDL:

CREATE TABLE my_table (
    column_key                   integer NOT NULL,
    column_data_01               varchar(8),
    column_data_02               varchar(8)
)
SEGMENTED BY HASH(column_key) ALL NODES;

If segmentation is defined on the table level it will be applied to the auto created superprojection and its buddies. CREATE PROJECTION statement has similar syntax and applies only to the projection where it is defined.

CREATE PROJECTION my_projection (
    column_key,
    column_data_01,
    column_data_02
)

as SELECT column_key,
          column_data_01,
          column_data_02

      FROM my_table

...
SEGMENTED BY HASH(column_key,column_data_01,column_data_02) ALL NODES;

How to choose columns to have good segmentation

The best candidates for segmentation columns are columns with a very high number of unique values. If you have a primary key in your table then it will be a good candidate for a segmentation expression.

In my CREATE TABLE example above I am using a single column as a parameter to Vertica HASH function. That column is the primary key for my table so segmentation should be good.

A common mistake with single column segmentation is to allow NULL values in the column. Keep in mind that records with the same value in the segmentation column will be placed to the same node. If you have a lot of NULLs in your segmentation column then all of them will end up on the same node and as a result will build a very significant skew. In this case your performance will get worse as more and more records with NULL in the segmentation column are inserted. Note that the column I am using as the single segmentation column in the CREATE TABLE sample above has NOT NULL defined. This will help in preventing a “skew by NULL records” scenario.

If your table has a multiple column primary key (composite key) you can still use those columns for segmentation. Vertica HASH function allows multiple parameters. In the CREATE PROJECTION sample I am defining segmentation with multiple columns listed in HASH function as parameters.

When defining segmentation you should avoid the situation when your segmentation columns have a lot of records with the same values in the segmentation key. All those records will be placed to the same node and will be building a data skew. As you already figured out, you should not use columns with a small number of values and a high number of records for each value for the same data skew reasons.

Here is an example of bad segmentation: You have a 24 node Vertica cluster and you decide to use a single column as the segmentation key. That column has data with 15 distinct values and millions of records for each value. In this scenario all of the data will be placed to 15 nodes. The remaining 9 nodes will not get any data at all and will not be used for query processing. By doing this kind of segmentation you effectively lost almost half of the computing resources of your cluster.

How to verify data skew

The easy way will be to query Vertica system tables.

You can use a query similar to this one:

select /*+label(Source: www.dbjungle.com )*/ node_name , projection_name , row_count
from projection_storage
where anchor_table_name = 'my_table'
  and anchor_table_schema = 'my_schema'
order by projection_name , node_name ;

I inserted 3000 records with unique values in column_key and my data distribution on the nodes was very close to equal. I know for a fact that when millions of records are inserted it will be identical percentagewise.

In conclusion, segmentation is one of the key things when you are designing your cluster. If you are going to make the right decisions about segmentation you will be able to evenly distribute data on the cluster nodes and will get the best possible performance from it.

Leave a Reply

Your email address will not be published. Required fields are marked *