Teradata Columnar
Teradata 14.0 introduces Teradata Columnar – a new option to organize the data of a user-defined table or join index on disk.
Teradata Columnar offers the ability to partition a table or join index by column. It introduces column-storage as an alternative choice to row-storage for a column partition and autocompression. Column partitioning can be used alone in a single-level partitioning definition or with row partitioning in a multilevel partitioning definition.
Teradata Columnar is a new paradigm for partitioning, storing data, and compression that changes the cost-benefit tradeoffs of the available physical database design choices and their combinations. Teradata Columnar provides a benefit to the user by reducing I/O for certain classes of queries while at the same time decreasing space usage.
A column-partitioned (CP) table or join index has several key characteristics:
- It does not have a primary index (a future blog entry will discuss why).
- Each of its column partitions can be composed of a single column or multiple columns.
- Each column partition usually contains multiple physical rows. Physical rows are the disk-based structures that the Teradata file system uses to store data in order based on the rowid associated with each physical row. The first part of a rowid indicates the partition number so that physical rows are ordered by partition number first and within partition by the rest of the rowid.
- A new physical row format COLUMN may be utilized for a column partition; such a physical row is called a container. This is used to implement column-storage, row header compression, and autocompression for a column partition. This provides a compact way to store a series of column partition values.
- Alternatively, a column partition may have physical rows with ROW format that are used to implement row-storage; such a physical row is called a subrow. Each column partition value is in its own physical row. Usually a subrow is wide (multicolumn, large character strings, etc.) where the row header overhead for each column partition value is insignificant and having each column partition value in its own physical row provides more direct access to the value.
- A CP table is just another type of table that can be accessed by a query. A single query can access multiple kinds of tables.
Learn more about Teradata Columnar by listening to the presentation from the 2011 Teradata PARTNERS User Group Conference
Rows vs. Columns? Why not Both?
http://developer.teradata.com/database/training/rows-versus-columns-why-not-both
Also, refer to the Orange Book: Teradata Columnar
PPI stands for partitioned primary index which means the table has a primary index and the rows are partitioned on the AMPs (and within a partition, the rows are ordered by a hash of the primary index columns).
A CP table is not a PPI table since a CP table doesn't have a primary index. But a CP table can have RANGE_N and CASE_N row partitioning (the kind of partitioning of rows that is used in PPI) but since there is no primary index, the rows within a row partition are not ordered by a hash of some columns of each row -- they are just in insert order.
A CP table could have a join index on it where the join index does have primary index (but not column partitioning).
A PI or PPI table could have a join index on it where the join index has column partitioning (but not a primary index) plus optionally 1 or more levels of row partitioning.
A NoPI table can't have row partitioning unless it also has column partitioning.
For example:
CREATE TABLE SALES (
TxnNo INTEGER,
TxnDate DATE,
ItemNo INTEGER,
Quantity INTEGER )
PARTITION BY COLUMN,
UNIQUE INDEX (TxnNo);
This creates a column-partitioned (CP) table that partitions the data of the table vertically. Each column is in its own column partition that is stored using column-storage with row header compression and autocompression. All the data for TxnNo comes first, followed by the data for TxnDate, followed by the data for ItemNo, and then the data for Quantity. Note that a primary index is not specified so this is NoPI table. Moreover, a primary index must not be specified if the table is column partitioned.
The following adds a level of row partitioning (so the table has multilevel partitioning). All the data for TxnNo for the first day comes first, followed by the next day of data for TxnNo, etc. then all the data for TxnDate for the first day, the second day, etc, ending with the last day of data for Quantity.
CREATE TABLE SALES (
TxnNo INTEGER,
TxnDate DATE,
ItemNo INTEGER,
Quantity INTEGER )
PARTITION BY (
COLUMN,
RANGE_N(TxnDate BETWEEN
DATE '2011-01-01' AND DATE '2011-12-31' EACH INTERVAL '1' DAY) ),
UNIQUE INDEX (TxnNo);
P - Paul's
A - architected
R - release 5
T - Teradata
I - innovation
T - to reduce
I
O
N - now
Paul, a very nice article and a great presentation at Partners on columnar. We hope to get TD 14 in the near future to be able to use it.
Hi Paul,
In regards to the columnar feature, I would like to understand clearly about the limits of "Container & Subrow". As far as I know, as the container/subrow is similar to the physical row, it is limited by the maximum physical row size of 65KB. But I saw the below sentence in one of your articles: "Note that the table row limit of 65KB would be exceeded before the subrow limit could be exceeded". Can you please be eloborated about this.
Also one more question about the subrow. Subrow means it can only holds one column parition value right? OR it can hold subset of the columns for a table row? I am confused about subrow. Can you give me some more info on this please?
Thank you.
Geeta Kudumula.
A container can contain one or more column partition values. A subrow contains only one column partition value. A column partition value can consist of one or more column values. Containers and subrows are both physical rows for the file system (one or more physical rows occur in a data block). A "row" of a table (consisting of values for all the columns of the table) is limited to 65KB. For a column-partitiond (CP) table, a table row is broken up into one more column partition values (one for each column partition). For a subrow to be bigger than 65KB, the table row that it is part of would have to be bigger than 65KB but that would exceed the limit for a table row.
A subrow is the same as a regular row (in regard to physcial rows in which they are stored) except that subrow contains values for a subset of one or more columns of the table while a regular row has values for all the columns of the table.
CREATE TABLE t1 (COLULMN(a1 INT, a2 INT), ROW(a3 int, a4 CHAR(60000)) PARTITION BY COLUMN;
This has 2 column partitions: one with a1 and a2 that is stored in containers and another with a3 and a4 that is stored in subrows.
Insert t1 (1, 2, 3, 'big string');
The values 1 and 2 are appended into a container for the first column partition as one column partition value (other column partition values can be in the container -- another container is started when it gets full) while the value 3 and 'big string' are appended as a subrow of the second partition (only one column partition value can go in this subrow -- the next column partition value for this column partition would go into a separate subrow).
Paul Sinclair
Thanks Paul, Great clarification.
I am reiterating the things about what i understood on CONTAINER & SUBROW with below example. Please correct me if i apprehend anything wrong. I am using the same table that you have defined in your explanation with 4 rows inserted
***Table Data***
a1 a2 a3 a4
1 2 3 big string
4 5 6 big string2
7 8 9 big string3
So the data would be arranged into the container and subrows as shown below right?
CONTAINER#1: 1,2,4,5,7,8,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_....up to 65KB
CONTAINER#2: _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_....up to 65KB
SUBROW#1: 3, big string
SUBROW#2: 6, big string
SUBROW#3: 9, big string
SUBROW#4: _, _
I am thinking I got a clear picture if I am correct on above example.
But i still didnt understand about the 65K limit of the subrow. I am thinking a CONTAINER cannot exceed 65K limit. When it gets full (upto 65K), one more container will be started.
On subrow you said ""For a subrow to be bigger than 65KB, the table row that it is part of would have to be bigger than 65KB but that would exceed the limit for a table row""". That means to exceed the 65K limit of subrow, first off all the table row limit of 65KB should be exceeded?
***Table Data***
a1 a2 a3 a4
1 2 3 big string <-- up to 65KB
4 5 6 big string2 <-- up to 65KB
7 8 9 big string3 <-- up to 65KB
So the data would be arranged into the container and subrows as shown below right?
CONTAINER#1: 1,2,4,5,7,8,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_, _,_,_....up to 65KB
CONTAINER#2: _,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_, _,_,_....up to 65KB
SUBROW#1: 3, big string <-- subrow can't be bigger than 65KB; if a1, a2, a3, a4 can't be bigger than 65KB then
this will be less than 65KB since it only has values for a3 and a4.
SUBROW#2: 6, big string
SUBROW#3: 9, big string
SUBROW#4: _, _
That means to exceed the 65K limit of subrow, first off the table row limit of 65KB should be exceeded? Yes.
Paul Sinclair
Perfect, thank you.
Hi Paul,
Can you help me with my question regarding RowID of the CP table (NoPI table).
I know the CP table RowID is the combination of "Partition#, HashBucket# & Uniquenessvalue ". The first row going into the NoPI table starts off with a uniqueness of 1. The uniqueness value increases as more rows are inserted into the table. So that means
the row partition number will be unique for each CP, hash bucket will be same, but the uniqueness value will be differed (i mean increamented by 1) for each CP value.
Now my question is how the below statement will be true?
The rowids(combination of "Partition#, HashBucket# & Uniquenessvalue ") for the different column partition values of a logical row are the same except for the column partition number which is unique to each column partition.
I am assuming for the different CP values of a row, the row id should be different, isn't it?
Hi Paul,
Great article. I have a question with regards to Autocompression, I'm just wondering why Autocompression is most effective for single-column partitions with COLUMN format and less so for multicolumn partitions with COLUMN format? Is this only because of significant CPU increase or there are any other reasons?
Thanks in advance.
For each of the column partition values of a row, the rowid for each of those column partition values is different but the only portion of the rowid that is different is the column partition number (they have the same row partition numbers, hash bucket, and uniqueness value). The column partition values are stored in the file system in rowid order (that is, in order by the internal partition number, and then for the same internal partition number, in order by the hash bucket value, and within the same internal partition number and hash bucket pair by the uniquness value).
By the name, one might think that a uniqueness value is unique for a table but it is really only unique within an internal partition number/hash bucket pair (or for a PI table, internal partition number/hash value pair). It is the combination of internal partition/hash bucket/uniqueness (that is, a rowid) that is unique for the column partition values for a table.
Note that the uniqueness value starts at one and is incremented by one for each internal partition / hash bucket pair for which a column partition value is inserted. If the table is not also row partitioned, this is the same as starting at one and incrementing by one for each row inserted. But with row partitioning, it starts at one and incremented by one for each row inserted into a combined row partition. So the uniqueness value is not unique across all rows of the table -- it is only unique within an internal partition number/hash bucket pair.
For more information on rowids for a NoPI table and a CP table, see the Orange Book on Columnar.
Paul Sinclair
Autocompression is most effective for single-column partitions with COLUMN format and less so for multicolumn partitions with COLUMN format because of the compression techniques used. Effective here means effective in providing a reduction in disk space usage.
For run length compression and value-list compression, compression is better if column partition values are very nonunique. Adding more column values into a column partition value tends to make the column partition values more unique and makes the compression less effective.
The trimming compression techniques only trim the column partition value so adding more column values into a column partition value usually means only the last column value in the column partition value is being trimmed. If the column values were in separate column partitions, trimming might be applicable to each of the column values.
Similiar issues occur for the other techniques if there are multiple columns in a column partition.
Note that with ROW format there are currently no techniques being used for autocompression. So to a have chance that autocompression is effective, currently the column partition must have COLUMN format.
Autocompression and other forms of compression have some CPU overhead so a consideration can be whether the compression ratio achieved, reduction in disk space usage, and the reduced I/O due to compression is worth the extra CPU overhead to compress and uncompress the data. On an I/O bound system or disk space limited system, compression may be worth it but if CPU bound, it may not.
Paul Sinclair
Will there be an in-database SQL or Java utility that will allow pivoting a table or subset of a table from row to column in V14?
The simple answer to my knowledge is no. A user-defined table function could possibly be used to do this. Also, this might be done with some join logic though that might be complex and not perform well. If readers have other suggestions, please let us know.
Aside: Note that an insert does take a row and properly insert the data into a column-partitioned table but this isn't changing a row to be column.
Paul Sinclair
Hi Paul, Thanks for your great articel and explonation. I have one Question here. I am going to partition the table based on the date columns. And we will have daily feed, and it is one of the long running project. So while creating the table, if i give '01-01-2012 To 2299-12-31' Interval by '1' month, and till date i have the data so the data will go to that particular partition. My Question is, will teradata create the partition for all the ranges that specified that is 2299-12-31 in the initial stage or it will create the partition once the data comes in the feed. If all the partition is created on the very initial stage, then will create any performance issues. Could you please clarify me.
Thanks!
A table is just a series of physical rows, each with a rowid, where the physical rows are kept in in rowid order. The first portion of the rowid is the partition number (or combined partition number for multilevel partitioning) so rows with the same partition number would be grouped together (with no gaps between if there are empty partitions between them). If there are no rows for a particular partition, it does not take any space (and just isn't there). A partition is not "created" until a physical row is inserted into it (there is really no actual creation process for a partition other than to insert a row for it). For the CREATE TABLE statement, you are just defining the partitioning expression which is later to used to compute the partition number (and part of determining the rowid) for a row when it is inserted. So there are not any performance issues related to "creating" partitions. Similarily, if all the rows for partition are deleted, there is no special action -- you just delete the rows for it and the partition just no longer has any rows.
When creating a table, a table header row is written to each AMP and the data dictionary is updated. But no other space is actually used until rows start being inserted.
This is unlike some other database systems where a partition has to have a data space allocated for it prior to insertion of rows into the partition.
Paul Sinclair
Hi Paul
This is a good article about the TD Columnar.
When I refer through, there are some areas in which needed more clarity.
As I understand, though it is Column Partition, the data is stored as Row only. But it contains the Row or Column values in the Subrow or Containers.
Can you please explain how the the Containers and Subrows are distributed in the System?
is it distributed among AMPs equally?
Will this cause any Hot AMP Conditions?
Is the CP Tables are good candidates for Huge Volume of Data?
Thanks,
Mani
Hi Paul,
I have a small clarification..
consider a column partitioned table with three column partitions C1,C2 and C3.
The table has say a billion rows.
Since it is big table it will have multiple datablocks.
so will each datablock contains only only one kind of Container C1,C2 or C3.
or is it mandatory that a datablock should have combination of all Container C1,C2 and C3.
If the Former is true then a logical row in the table will split across different datablocks right?
is that allowed? or else should a logical row must exist within a single datablock?
Mani, I am working on a response to a your comments (delayed because I just came back from a sailing around some exotic islands in French Polynesia).
In the meantime, here is a quick response to Cvinodh's comment:
A partition can end in the middle of data block and then another partition can start in the data block. The physical rows are ordered by their internal partition number in the Teradata file system -- a data block is just one or more physical rows (and a data block is the unit of what can be written or read at a time). For a large table, there will be, on an AMP, data blocks for containers of only C1, then possibly a data block with some containers for C1 and then some containers for C2, then data blocks for only containers of C2, then possibly a data block with some containers for C2 and then some containers of C3, and finally data blocks for only containers of C3. That is, partition boundaries are not necessarily on data block boundaries but all the C1 containers comes first then the containers for C2 and then the containers for C3. So usually a data block only has containers for one column partition but, at the boundaries, it will change from one partition to the next partition. So for your question, the "former" is true except for this minor caveat and the "latter" is false.
All the column values of a logical row do not usually exist in same data block -- the column values of a logical row are split across column partitions. For a large table, the column partitions would be large and span many data blocks. It would be only in very small table that the column values of a logical row could exist in the same data block (that is, when all the column partitions fit in the same data block). However, do note that all the column values for a logical row would be on the same AMP.
Note that this is not a "hybrid" columnar approach where the columns are only partitioned within a "data block" or "compression unit" and all the column values of a logical row can be found within that data block or compression unit. Teradata supports a fully columnar approach where each column is separate from the others so to read the data for a particular column only entails reading the data blocks for that column (where possibly the first and the last data block read may contain some values for a preceding column or a following column, respectively). Teradata also supports multicolumn partitions.
For more information, refer to the Orange Book: Teradata Columnar and the Orange Book: Increased Partition Limit and other Partitioning enhancements.
Paul Sinclair
Thanks for the reply Paul..
but one final question..
if the logical row is now distributed across different datablocks or cylinders in a column partitioned NOPI table... Then conventional Master Index and Cylinder Index (with starting RowID and highest Row hash value) entries are not applicable for these tables right? so these tables dont have any of these structures?
Mani, below I try to answer your questions:
1. As I understand, though it is Column Partition, the data is stored as Row only.
>> All data in Teradata is stored in what I refer to as physical rows by the Teradata file system. A physical row does not always correspond to what is considered to be a row in a table (or to what I refer to as a table row or logical row). The data in a physical row can be a table header, a portion of table header, a regular row (which does correspond to a table row), an index row, container (a series of column partition values; that is, a column partition with COLUMN format), a subrow (one column partition value; that is, a column partition with ROW format), and other miscellaneous data structures. A physical row always has a row header which includes the length of the physical row and a rowid. The Teradata file system keeps physical rows in order by rowid (of which the internal partition number is the first component and therefore, phsyical rows with the same internal partition number are grouped together). Information in the tableid and in the table header is used to determine what kind of data is in a physical row.
2. But it contains the Row or Column values in the Subrow or Containers.
>> If a column partition has ROW format, the physical rows for that column partition will have subrow format (that is, a one column partition value). If a column partition has COLUMN format, the physical rows for that column partition will have container format (that is, one or more column partition values – usually a lot of column partition values). For a single-column partition, a column partition value is one value of the column. For a multicolumn partition, a column partition value is the concatenation of the values of those columns from the same table row.
3. Can you please explain how the Containers and Subrows are distributed in the System?
>> Since a CP table is a NoPI table, single-row and array inserts are randomly distributed to the AMPs. On the AMP, the table row is then split into column partition values for each of the column partitions and, if the column partition has COLUMN format, a column partition value is appended to last existing container if there is one and there is room, or a new container is appended with that column partition value. If the column partition has ROW format, the column partition value for the column partition is appended to the column partition as a subrow. The random distribution should usually provide a fairly even distribution of the rows to the AMPs.
>> For an INSERT-SELECT, rows from the source are not distributed by default but rather are locally copied on each AMP from the source (the result of the SELECT) to the CP table and then each table row is split up as described above. If the source is distributed fairly equally, the data for the CP table should be fairly evenly distributed among the AMPs; but if not, the CP table would have the same skew as the source. However, there is an option to force the INSERT-SELECT to randomly distribute the rows from the source before inserting them into a NoPI or CP table and an option to specify columns on which to hash distribute rows – these options (using the HASH BY clause) should be used in the case the source is or might be skewed to make sure table rows are distributed equally among the AMPs.
4. Is it distributed among AMPs equally?
>> This depends on how the table rows are inserted into the CP table as described above. Also, a copy/restore to a different configuration or a reconfig can cause a NoPI or CP table to become skewed.
5. Will this cause any Hot AMP Conditions?
>> This can cause hot AMP conditions for SELECTs from the table if an INSERT-SELECT is used to load the table from a skewed source without using the HASH BY clause. However, as mentioned above the HASH BY option should be used when the source is skewed in order to avoid an unequal distribution of the rows among the AMPs.
6. Is a CP Table a good candidate for Huge Volume of Data?
>> Having a large volume of data is an indication that CP should be considered but there are also a number of other things to consider in regard to whether a table is actually a good candidate to be CP. In particular, the kinds of queries that will occur is an important factor to consider. There is also a cost to loading data into a CP table (that is, to partition the rows of the table into the column partitions) which needs to be considered. There is also usually a trade-off in that I/O may decrease but CPU usage may increase. For other considerations and more information about Teradata Columnar, see the Orange Book: Teradata Columnar.
Paul Sinclair
Cvinodh asked:
If the logical row is now distributed across different datablocks or cylinders in a column partitioned NOPI table... Then conventional Master Index and Cylinder Index (with starting RowID and highest Row hash value) entries are not applicable for these tables right? so these tables dont have any of these structures?
>> Actually, they do have these structures. These structures are used by the Teradata file system to maintain physical rows in rowid order and to find physical rows based on a rowid (for instance, rowids coming from an index or rowid spool). These structures have the starting rowid of a cylinder or data block and the internal partition number (this is zero for a nonpartitioned table) and the first 32 bits of the rest of the highest rowid for the cylinder or data block. In a primary index table, these 32 bits would correspond to a row hash; however, for a NoPI table (and a CP table is a NoPI table), part of these 32 bits are a hash bucket value chosen from the NoPI hash map and part of them are the high order bits of the uniqueness (a NoPI table has more uniqueness bits in the rowid than a PI table). The Teradata file system really doesn't care what these bits represent -- it just orders the physical rows by the rowid it is given for each of the physical rows. For more information about the rowid structure for a NoPI and a CP table, see the Orange Book: No Primary Index (NoPI) Table User's Guide and the Orange Book: Teradata Columnar.
Paul Sinclair
Hi Paul,
One question :
Why the columnar is not allowed on PI tables? what is the reason/logic behind this?
Why the columnar is not allowed on PI tables? What is the reason/logic behind this?
There is a discussion about this in the blog entry http://developer.teradata.com/blog/paulsinclair/2011/10/why-no-primary-index-nopi-for-a-column-partitioned-table. There are definitely valid use cases. This is on the roadmap for implementation in a future release.
Paul Sinclair