Working with IDENTITY columns in HP Vertica

HP Vertica has 2 different types of columns that incremented automatically by the database. They are AUTO_INCREMENT and IDENTITY. The key difference is in the level of control the user has. The AUTO_INCREMENT column always starts with 1 and increments by 1. Those values are hardcoded and can not be changed. IDENTITY provides more control. You can specify the start value, increment value and cache size. In other aspects, both of them work in a similar fashion. In this article I am going to cover the IDENTITY columns, but a lot of concepts could be easily applied to the AUTO_INCREMENT column.

What is the IDENTITY column?

IDENTITY is a column on a table that the database increments automatically. All you need is to provide the column name and specify that it is the IDENTITY column. Behind the scenes Vertica will use the INT data type for this type of column so you can have up to 2^63 unique values. That is equal to 9.2 quintillion, which is a 19 digit number.

Here is a sample CREATE TABLE statement:

CREATE TABLE identity_tests
(  identity_column    IDENTITY,
   data_column        INT);

If you want to specify IDENTITY attributes you can do:

CREATE TABLE identity_tests
(  identity_column    IDENTITY(1,5,10000),
   data_column        INT);

The above statement will make IDENTITY start with 1, increment by 5 and have 10000 values cached.

How does value caching work for IDENTITY columns?

Caching is a very important component in understanding how Vertica generates values for IDENTITY columns. Vertica has 2 goals when generating values: a) make sure that the generated values are unique; b) make sure that the value generation is not creating performance bottlenecks. In order to achieve those 2 goals, Vertica is ready to make some sacrifices. In this case it is perfection on sequencing of numbers. Values stored in the IDENTITY column will have gaps in them and the CACHE value will impact how big those gaps are. Lets look at how this works and why this happens.

The default value for cache attribute is 250K. When you open a session and insert at least one record into a table with an IDENTITY column your session will immediately take a block of values for the IDENTITY column, which is equal to the cache size. If the cache attribute is set to the default then your session will get a block of 250K values (from 1 to 250K). If you insert less than 250K records and exit your session you will have a gap in identity values. This gap will represent numbers which had been assigned to your session and were not used by you.

If someone else will open a second session and insert at least one record into a table with an IDENTITY column they will be assigned a range of values from 250001 to 500K. It does not matter if the sessions happen in parallel or sequentially. Each session will be taking data from its own range. If the sessions happen in parallel, the values inserted into the IDENTITY column will be out of order.

Here is a snippet, which is illustrating this:

$ vsql
Password: *****
dbadmin=> CREATE TABLE identity_tests(identity_column IDENTITY, data_column INT);
CREATE TABLE
dbadmin=> insert into identity_tests (data_column) values (1);
dbadmin=> insert into identity_tests (data_column) values (2);
dbadmin=> commit;
COMMIT
dbadmin=> \q

$ vsql
Password: *****
dbadmin=> insert into identity_tests (data_column) values (3);
dbadmin=> insert into identity_tests (data_column) values (4);
dbadmin=> insert into identity_tests (data_column) values (5);
dbadmin=> commit;
COMMIT
dbadmin=> select * from identity_tests order by identity_column;

 identity_column | data_column 
-----------------+-------------
               1 |           1
               2 |           2
          250001 |           3
          250002 |           4
          250003 |           5
(5 rows)

The above listing has 2 sessions. In the first session we created a table and inserted 2 records. In the second session we inserted 3 extra records. If we will analyze the table content we will see that we created a gap in the numbers.

Values generated for the IDENTITY column are not guaranteed to be contiguous nor inserted in order, but they are guaranteed to be unique. The mentioned sacrifices are intentionally made for the sake of performance.

If you really want to have clean, sequential and contiguous numbers you can set the cache size to 1, but you will need to assess the performance impact of this change. If your cache size is equal to 1 Vertica will need to acquire an exclusive lock to this sequence for every inserted record. Vertica is a distributed system. To acquire a lock it needs to coordinate it with every node in the cluster. This coordination is generally done through the Global Catalog lock. Having excessive Global Catalog locking will have a negative impact on every process, which will need the same kind of lock.

Having contiguous numbers is nice, but performance is much more valuable. One needs to find a good balance between perfectly sequenced numbers in identity columns and performance.

It may be possible to get contiguous numbers if you can control data loading. A batch of 250K records will use all the numbers from the 250K range assigned to session. If you can batch all of the data inserts to have the same number of records as your cache value you will have no gaps.

How to modify IDENTITY properties?

Before making any changes we have to clearly understand that you cannot change the value of an IDENTITY column once data is inserted into the table. Making modifications to IDENTITY properties will have no effect on data in the table. Modified IDENTITY properties will have effect only on newly inserted data. When you insert more data Vertica will continue from the highest unassigned number and will increment according to new IDENTITY properties.

The identity column in Vertica uses a sequence to generate values. To make changes to IDENTITY properties we can modify the underlying sequence. You can find that sequence with the following query:

select sequence_schema, sequence_name, minimum, increment_by, 
session_cache_count
from sequences where identity_table_name = 'identity_tests';

Modifying sequence:

alter sequence public.identity_tests_identity_column_seq CACHE 1;

$ vsql
Password: ***** 
dbadmin=> insert into identity_tests (data_column) values (6);
dbadmin=> insert into identity_tests (data_column) values (7);
dbadmin=> commit;
COMMIT
dbadmin=> \q

$ vsql
Password: *****
dbadmin=> insert into identity_tests (data_column) values (8);
dbadmin=> insert into identity_tests (data_column) values (9);
dbadmin=> insert into identity_tests (data_column) values (10);
dbadmin=> commit;
COMMIT
dbadmin=> select * from identity_tests order by identity_column;
 identity_column | data_column 
-----------------+-------------
               1 |           1
               2 |           2
          250001 |           3
          250002 |           4
          250003 |           5
          500001 |           6
          500002 |           7
          500003 |           8
          500004 |           9
          500005 |          10
(10 rows)

After modifying the sequence we had 2 more sessions and inserted 2 records in one session and 3 records in another one. We can see that the data loaded into the table before the sequence change remains unchanged. The values from 250004 to 500000 are left unused. The first value inserted into the IDENTITY column after the sequence modifications was 500001 since all of the previous values have been assigned to the previous session. New sessions after the change will have contiguous numbers.

Note that I am using a cache equal to 1 to illustrate how the IDENTITY column functions. A cache equal to 1 may not be suitable for your environment due to performance reasons. Refer above to understand the performance implications.

Here is an example if we want to truncate the table and restart the sequence

dbadmin=> truncate table identity_tests;
TRUNCATE TABLE
dbadmin=> alter sequence public.identity_tests_identity_column_seq RESTART WITH 1;
ALTER SEQUENCE
dbadmin=> insert into identity_tests (data_column) values (11);
dbadmin=> commit;
COMMIT
dbadmin=> select * from identity_tests order by identity_column;

  identity_column | data_column 
-----------------+-------------
               1 |          11
dbadmin=>

You can see now that the old data was truncated and newly inserted data starts with 1.

I strongly recommend truncating the underlying table if you will be RESTARTing SEQUENCE. Vertica does not check constraints automatically and does not enforce them. If you restart SEQUENCE with data present in the table then you will get duplicate values inserted into the IDENTITY column.

dbadmin=> alter sequence public.identity_tests_identity_column_seq RESTART WITH 1;
ALTER SEQUENCE
dbadmin=> insert into identity_tests (data_column) values (12);
dbadmin=> commit;
COMMIT
dbadmin=> select * from identity_tests order by identity_column;
 identity_column | data_column 
-----------------+-------------
               1 |          11
               1 |          12

Now our actions created a data consistency issue. We have a duplicate value in our identity column.

Possible errors related to the IDENTITY column

You can not modify the data in the IDENTITY column:

dbadmin=> update table_test set identity_column = 2;
ERROR 2444:  Cannot insert into or update IDENTITY/AUTO_INCREMENT column "identity_column"

You can insert a lot of records into a table with an IDENTITY column, but there is a limit. By default it is 9.2 quintillion. I know that number is high, but modern computers are very fast. Especially if you have an automated test INSERTING/DELETING in a loop. Think about your strategy to prevent the error below. Just in case!

dbadmin=> insert into table_test (data_column) values (9999999999999999);
ERROR 4704:  Sequence exceeded max value

 

Leave a Reply

Your email address will not be published. Required fields are marked *