Teradata 13.10 Statistics Collection Recommendations
Statistical information is vital for the optimizer when it builds query plans. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.
The following statistics collection recommendations are intended for sites that are on any of the Teradata 13.10 software release levels. Most of these recommendations apply to releases earlier than Teradata 13.10, however some may be specific to Teradata 13.10 only.
Collect Full Statistics
- Non-indexed columns used in predicates
- Single-column join constraints, if the column is not unique
- All NUSIs (but drop NUSIs that aren’t needed/used)
- USIs/UPIs if used in non-equality predicates (range constraints)
- Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
- Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
- PARTITION for all tables whether partitioned or not[1]
Can Rely on Dynamic AMP Sampling [2]
- USIs or UPIs if only used with equality predicates
- NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
- See “Other Considerations” for additional points related to dynamic AMP sampling
Option to use USING SAMPLE
- Unique index columns
- Nearly-unique [3] columns or indexes
Collect Multicolumn Statistics
- Groups of columns that often appear together with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics are used for single-table estimates.
- Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics.
Other Considerations
- Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination are not chosen unless statistics have been collected on the relevant columns.
- NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) benefits from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, dynamic AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:
EXPLAIN SELECT DISTINCT nupi-column FROM table;
- For a partitioned primary index table, it is recommended that you always collect statistics on:
- PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
- The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
- For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
- (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
- (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey-based merge join costing.
- Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). For small tables, with less than 25 rows per AMP, all-AMP sampling is done automatically. It is also the default for volatile tables and sparse join indexes. All-AMP sampling comes with these tradeoffs:
- Dynamic all-AMP sampling provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
- Statistics extrapolation for any column in a table is triggered only when the optimizer detects that the table has grown. The growth is computed by comparing the current row count with the last known row count to the optimizer. If the default single-AMP dynamic sampling estimate of the current row count is not accurate (which can happen if the primary index is skewed), it is recommended to enable all-AMP sampling or re-collect PARTITION statistics.
- Parsing times for queries may increase when all AMPs are involved, as the queries that perform dynamic AMP sampling will have slightly more work to do. Note that dynamic AMP samples will stay in the dictionary cache until the periodic cache flush, or unless they are purged from the cache for some reason. Because they can be retrieved once and re-used multiple times, it is not expected that dynamic all-AMP samping will will cause additional overhead for all query executions.
- For temporal tables, follow all collection recommendations made above. However, statistics are currently not supported on BEGIN and END period types. That capability is planned for a future release.
[1] Collecting PARTITION statistics on tables that have grown supports more accurate statistics extrapolations. Collecting on PARTITION is an extremely quick operation (as long as table is not over-partitioned).
[2] This is sometimes referred to as a random AMP sample since, in early releases of Teradata, a random AMP was picked for obtaining the sample; however, in current releases, the term dynamic AMP sample is more appropriate.
[3] Any column which is over 95% unique is considered as a nearly-unique column.
These recommendations were compiled by: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair
Do you have script for this?
This is a list of recommendations only. Sorry, but this posting does not include accompanying scripts.
In some cases, implementing these recommendations could probably be automated (collecting on PARTITION for all tables), but other recommendations will involve some judgement calls (is the table small? Are USIs used with range constraints in some queries? Are the NUPI values evenly distributed?). Decisions on whether or not to collect multicolumn statistics in particular do not lend themselves very easily to automation at this time.
There are several postings on developer exchange that do discuss techniques for automating statistics collection, but I believe that is after decisions as to which columns/indexes to collect on have already been made.
For example, check out Marcio Moura's blog (mtmoura) postings on refreshing statistics.
Thanks, -Carrie
Carrie, thanks for the explanations. Would it be possible to understand the precedence of datatypes used internaly while parsing a multicolumn stats definition longer than 16bytes.
The columns involved in the multicolumn stat will be ordered in the order they appear in the CREATE TABLE statement both when the statistics are collected and as they are represented in the histogram. This is true no matter what their data types are. All varchar columns are expanded out to their full length during the collection process. This means if a varchar appears earlier in a table definition than the other columns in a multicolumn statistic, and it is specified as varchar(16) or larger, then any subsequent columns will not be represented in the histogram values fields. They would be truncated as a result of the 16-byte size limit for the histogram values fields.
In Teradata 14.0, this 16-byte limit for histogram values fields has been lifted.
Do you know the specific lifted limitation for the histogram values fields In Teradata 14.0? The columns involved in the multicolumn stat will be ordered in the order we defined when we collect stat in Teradata14.0. Is that true?
The new default limit for the size of the histogram values fields is 25 bytes in Teradata 14.0. However, the limit can be increased by specifying a greater value on the COLLECT STATISTICS statement itself. The ordering of the multicolumn statistics can now be specified on the COLLECT STATISTICS statement as well (with the exception of statements that specify INDEX).
In additon, each participating column in a multicolumn stat will be represented by at least 25 bytes (if it needs it), and trunctation will only take place after the column's leading 25 bytes have been represented, on a column by column basis.
Prior to the 14.0, not only was the truncation at 16 bytes, but it was the first 16 bytes in the combined values strung together. If you had several different columns in the statistic, often the later ones would not be represented at all. That will no longer be the case.
Thanks, -Carrie
PLEASE NOTE:
A small change was made to the above text on April 13, to add this new bullet under the "Collect full statistics" list. It was left out in the original posting:
• Single-column join constraints, if the column is not unique
Thank you so much for the detailed information.
You should also consider the CPU taken by collect stats operation.
You should practice stats collection on a basis of % change of data.
I say a minimum of 10% of data load should be happening to make sure stats are refreshed.
Also for unique columns like "cust_id, txn_id" can be refreshed once in a week
For date columns like "txn_date" stats should be collected daily.
multo column stats are more proper when you have non-uniform data distribution over the combination of collumns. for Eg: tcn_type, country_cd
there may be a lot of txns for a country_cd, but on a txn_type = 99 there may be no txns at all. in those cases multi-column stats become more important.
Speaking of the resources the collect stats operations can use, there are some enahncements available in Teradata 14.0 that will reduce that.
In Teradata 14.0 performance improvements to stats collection include the ability to submit a single statement with multiple statistics collections, and where it is possible, the optimizer will perform rollup aggregations. For example if you collect on STATE and on ZIPCODE and (STATE,ZIPCODE) the optimizer may choose to do one scan of the table for all 3 different statistics. The decision whether to use this optimization or others in 14.0 is cost-based.
Thanks for the great tips. Would you please explain the reasoning behind collecting stats in following situations?
1) non-index columns used in predicates: Even with stats collected, FTS can't be avoided. Is the idea behind collect stats to give optimizer better cardinality estimates that may affect some other steps within the same query?
2) range predicate utilizing UPI and USI: Since neither UPI nor USI can be value-ordered, is the rationale behind this recommendation same as previous item?
Also is there any general guidance on what would be called as "over-partitioned"?
Thanks again.
For your point #1, if all your queries against a table are always and only going to do a full table scan and never join to other tables, then you are correct that collecting statistics on predicate columns may be a waste of time. Understanding query usage is always an important aspect of decising which statistics to collect on. The purpose of such stats is to let the optimizer estimate more accurately the size of the spool coming out of that step. This can be useful for joins that may happen later in the plan, and even for being able to cost just that one step. If it's the only step in the plan, you still might want the optimizer to establish an accurate cost if, for example, you are using estimated processing time as classification criteria in TASM for either a workload or a throttle.
For point #2, the rationale is the same. The stats in such as case would be used to estimate the number of rows that qualify when the range constraint is applied against the UPI or USI. Without a histogram, there is no way the optimizer can make a reasonable estimate of processing time or spool size, and it could be needed in subsequent steps.
The best source of guidance for over-partitioning is to read through Paul Sinclair's orange book titled:
Partitioned Primary Index Usage (Single-Level and Multilevel Partitioning)
In particular, multilevel partitioning, or column partitioning and row partitioning combined, can significantly increase the number of combined partitions. If you have a lot of AMPs in your configuration, this could lead to a situation where there ends up being just a few data blocks per AMP in each combined partition. According to the Columnar Orange Book, over-partitioning exists if non-empty combined partitions per AMP typically have less than 10 data blocks.
Thanks, -Carrie
Please note: This posting was updated on May 29, 2012, to reflect more accurate information on dynamic AMP sampling.
Hi Carrie -- are there any plans to lift the limit of 32 multi-column stats per table in v14 or a future release?
Revising the leading, 16-byte limitation on the histograms is great news.
Thanks, Bill
Hi Bill,
Good question. That limit has already been lifted.
In 14.0 you may define up to 512 single or multicolumn statistics.
Thanks, -Carrie
Thank you very much for the recommandations regarding stats.
I would like to know how the cost profiling related to optimizer estimation process.
To add into this, i can see that there are many cost profiling available(not sure how many). But would like to know if the each cost profile will change the estimation process and if yes, then what are your recommandations for the same.
let me give you some issue background and why above question raised in my mind.
when i was working in earlier project, Client has upgraded from TD 12 to TD 13 and then some of the SP's were running slower and Client request to go for backout. TD CS people suggested to alternative solution and they have changed the cost profiling after then those sp's were running fine.
Thanks in adavance for some light on this issue.
Regards
Pinal Patel
Pinal,
Different cost profiles can have an impact on optimizer estimates. However, I do not have any direct experience trying out different cost profiles, so I am not able to make a recommendation in this area. Please consult the support center if you believe you are having issues related to cost profiles in use at your site. In addition, there is an orange book titled "OCES Type2 Costing" that should help to provide more information.
Thanks, -Carrie
Hi Carrie,
Thanks for the information.
I can see that there are two table under DBC which are having the information regarding the cost profiles.
i.e dbc.costprofiles and dbc.costprofilestype.
Thanks
Pinal
Carrie, may I reproduce this content on an internal website for our developers? I would like of course to give full credit to you and those you named at the end of the article. This content has great educational value for my developers, but not all of them will have internet access.
Dave Hays
Senior Analyst
Enterprise Information Management
Caterpillar, Inc.
Hi David,
Per the DevX Terms of Use you may freely redistribute/re-host this content (but give attribution to Carrie!). This is the relevant section from the TOU:
>> Teradata hereby dedicates all present and future intellectual property rights (except patent rights) it holds in Teradata Public Content to the public domain. For example, Teradata agrees that any Person obtaining a copy of Teradata Public Content may freely reproduce, distribute, publish, transmit, publicly perform, publicly display, modify, prepare derivative works of, lease, license, sublicense, or otherwise transfer all or part of Teradata Public Content, as well as to authorize others to do any or all of the above, for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been conceived. Teradata agrees that this dedication relinquishes all rights it and its successors have to enforce such rights with respect to Teradata Public Content, whether by lawsuit or otherwise.
When exercising the above rights with respect to Content (even Public Content), attribution should be given to the authors of Content. <<
DevX Platform Architect
Thanks, Neil!
Carrie thank you !
I'll of course credit you, Paul Sinclair and Rama Krishna Korlapati.
-- Dave Hays
Dave Hays
Senior Analyst
Enterprise Information Management
Caterpillar, Inc.
Hello Carrie,
Thanks a lot for providing a quick summary for statistics collection recommendations.
I have following doubts in statistics collection using Teradata Statistics Wizard:-
1. What are stale statistics?
2. Collection of statistics is advisable to be collected as per recommendation for a particular workload or for all tables which are part of the workload?
3. If I define a workload using "Create from SQL statement" and then get recommendations for collecting statistics on that particular workload, recommendation would include statistics collection on columns/ indexes for the query mentioned in workload. Here it might happen that statistics are collected on some columns/ indexes which may be not required for other queries. In other words, would the statistics collected for a particular query be stale for other queries?
4. Is it advisable to collect statistics for all tables that are frequently used regularly?
Thanks,
Shrey
Hi Carrie,
I did some tests by USING SAMPLE option.
I set the SAMPLESIZE to 50%.
Each time I collect stat on column of my secondary index, the sampled pourcent is always 1%.
Can you explain why ?
My table IDV_SI is define with INDEX ( GIN_IDV );
Column GIN_IDV is nearly unique (over 99%).
DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=50" ON FOR SESSION;
collect statistics using sample on IDV_SI column GIN_IDV;
help statistics IDV_SI column GIN_IDV; : the sampled pourcent remains to 1% ???
It's always 1% for secondary index.
thanks
I am only guessing, but it may be possible that the NUSI sample code is not recognizing the diagnostic percentage. Once you set the diagnostic as you did, it should effect all stats collections in the session. I know that that is how it works for non-NUSI columns.
Please contact the support center and open an incident on this, so we can evaluate what is actually happening.
Thanks, -Carrie
Hi Carrie,
I am running Teradata 13.10 and was wandering if it is possible to identify stats that are not being used by the optimizer?
There is no mechanism in 13.10 that would allow you to identify stats that are not being used by the optimizer. A future release of Teradata will be collecting that information because it has been recognized as important. But unfortunately we're not there yet.
Thanks, - Carrie
Hi Carrie,
Thanks a lot for providing a quick summary for statistics collection recommendations.
Your one of full statistics recommandation is "PARTITION for all tables whether partitioned or not".
Is it need to collect stats "PARITION" on lots of small table (like code or dimension table) ?
I think that you mean "PARTITION" on some large table because of extrapolation (PARTITION on NPPI table collect table row count very quickly).
You do not need to collect PARTITION stats on small tables that do not undergo growth, such as most dimension tables. You are correct that PARTITION stats are most important on large tables that do grow, and where stats are more likely to be out of date. If your small tables undergo growth, then recollect fulls stats on them, because it will be very fast with small tables.
Thanks, -Carrie
I've always had this question and unable to find a clear cut answer. For a partitioned table, is collect stats on "PARTITION" the same as collecting stats on the column on which PPI is defined?
I'm trying to understand whether we should do both(collect stats on partition and collect stats on ppi column) or just one of these?
Pardon me if the question sounds silly.
-Suhail
Hi Carrie,
We are running TD 13.10 and I remember from earlier versiosn that there was a recommendation to not have too many empty partitions, ie not to over-partition the table. Does the collecting of stats on PARTITION mean that this is less restrictive now? We have just had to go through an excersise to alter tables and increase the end date of many date range partitions and have this tinme consuming, so I am wondering if we were to specify a end date 5 years or 10 years from now that has interval '1' day would be OK, if the stats on PARTITION were refreshed reguarlay!
Suhail,
PARTITION stats counts the rows in each partition of the table, so the optimizer knows how many are empty, and if not empty, how full. This is used for estimating I/O costs for the optimizer.
Stats on the partitioning column is used to determine the number of qualifying rows for a given query.
You should collect on both PARTITION and the partitioning column for PPI tables.
Thanks, -Carrie
Hi Mark,
Empty partitions are not an issue if PARTITION stats are collected. PARTITION stats are important because they tell the optimizer how many and which partitions are empty. So you should be fine.
Thanks, -Carrie
Carrie,
We have a 13.10 TD system and have TASM configured based on estimatedCPU in some cases. Please let me know if there is a reason for the estimate of a stats collection is always 0. Since we have TASM classification criteria based on the EstimateCPU Times and these stats collections go into the <1 sec bucket when they really take up more than 100 CPU ticks.
The below query comes back with 0 rows.
sel
logdate,
Cast(TDWMEstTotalTimeas int) estime1,
cast(TDWMEstMaxRowsas int),
cast(TDWMEstLastRowsas int) ,
cast(EstResultRowsas int) ,
cast(EstProcTime as int) estime2 ,
cast(EstMaxRowCount as int)
from
pdcrdata.dbqlogtbl_hst
whereStatementType ='Collect Statistics'
groupby 1,2,3,4,5,6,7
havingestime1>0 or estime2 >0
orderby 1,7,6,5,4,3,2
Statistics collection statements never produce a total estimated time. It has always been that way. So you don’t want to try to use estimated time for classification of those types of requests.
I know this does not help you today, but in 14.10 you will see a new classification criteria under Query Characteristics in Workload Designer for statement type = "Statistics Collection". That will allow you to run all stats collections, no matter who submits them, in the same workload if you like.
Thanks, -Carrie
Hi Carrie,
Thanks for the info.
I have a huge partitioned table (10+TB). I am collecting stats on "partition" and primary index.
Do I still need stats on combination of (pattition, pi)?
i don't see any benefit of it.
Please reply.
This is an optional recommendation. The only reason you might want to collect on both as a multicolumn statistic is if a given PI value exists in multiple partitions, and you do sliding window and rowkey based merge joins. It can also help for dynamic partition elimination. This is pointed out in the section above under partitioned primary index recommendations.
But if your plans are acceptable without this stat, then there is no need to collect it.
Thanks, -Carrie