Statistical information is vital for the optimizer when it builds query plans.  But collecting statistics can involve time and resources.  By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.

The following statistics collection recommendations are intended for sites that are on any of the Teradata 13.10 software release levels.  Most of these recommendations apply to releases earlier than Teradata 13.10, however some may be specific to Teradata 13.10 only.

Collect Full Statistics

  • Non-indexed columns used in predicates
  • Single-column join constraints, if the column is not unique
  • All NUSIs (but drop NUSIs that aren’t needed/used)
  • USIs/UPIs if used in non-equality predicates (range constraints)
  • Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
  • Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
  • PARTITION for all tables whether partitioned or not[1]

Can Rely on Dynamic AMP Sampling [2]

  • USIs or UPIs if only used with equality predicates
  • NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
  • See “Other Considerations” for additional points related to dynamic AMP sampling

Option to use USING SAMPLE

  • Unique index columns
  • Nearly-unique [3] columns or indexes  

Collect Multicolumn Statistics

  • Groups of columns that often appear together with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics are used for single-table estimates.
  • Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them.  With no multicolumn statistics collected, the optimizer assumes complete independence among the column values.  The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics.

Other Considerations

  • Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination are not chosen unless statistics have been collected on the relevant columns.
  • NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that.  A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) benefits from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values.  However, if it is close to being 75% unique, dynamic AMP samples are adequate.  To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:
	EXPLAIN SELECT DISTINCT nupi-column FROM table;
  • For a partitioned primary index table, it is recommended that you always collect statistics on:
    • PARTITION.  This tells the optimizer how many partitions are empty, and how many rows are in each partition.  This statistic is used for optimizer costing.
    • The partitioning column.  This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
  • For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
    • (PARTITION, PI).  This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions.  It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
    • (PARTITION, PI, partitioning column).   This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination.  It is used in rowkey-based merge join costing.
  • Dynamic AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default).  For small tables, with less than 25 rows per AMP, all-AMP sampling is done automatically.   It is also the default for volatile tables and sparse join indexes.  All-AMP sampling comes with these tradeoffs:
    • Dynamic all-AMP sampling provides a more accurate row count estimate for a table with a NUPI.  This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
    • Statistics extrapolation for any column in a table is triggered only when the optimizer detects that the table has grown. The growth is computed by comparing the current row count with the last known row count to the optimizer. If the default single-AMP dynamic sampling estimate of the current row count is not accurate (which can happen if the primary index is skewed), it is recommended to enable all-AMP sampling or re-collect PARTITION statistics.
    • Parsing times for queries may increase when all AMPs are involved, as the queries that perform dynamic AMP sampling will have slightly more work to do. Note that dynamic AMP samples will stay in the dictionary cache until the periodic cache flush, or unless they are purged from the cache for some reason.  Because they can be retrieved once and re-used multiple times, it is not expected that dynamic all-AMP samping will will cause additional overhead for all query executions.
  • For temporal tables, follow all collection recommendations made above.  However, statistics are currently not supported on BEGIN and END period types.  That capability is planned for a future release.

[1]  Collecting PARTITION statistics on tables that have grown supports more accurate statistics extrapolations.  Collecting on PARTITION is an extremely quick operation (as long as table is not over-partitioned).

[2]  This is sometimes referred to as a random AMP sample since, in early releases of Teradata, a random AMP was picked for obtaining the sample; however, in current releases, the term dynamic AMP sample is more appropriate.

[3]  Any column which is over 95% unique is considered as a nearly-unique column.

These recommendations were compiled by: Carrie Ballinger, Rama Krishna Korlapati, Paul Sinclair

Discussion
shamsul 1 comment Joined 10/09
03 Apr 2012

Do you have script for this?

carrie 412 comments Joined 04/08
06 Apr 2012

This is a list of recommendations only. Sorry, but this posting does not include accompanying scripts.

In some cases, implementing these recommendations could probably be automated (collecting on PARTITION for all tables), but other recommendations will involve some judgement calls (is the table small? Are USIs used with range constraints in some queries? Are the NUPI values evenly distributed?). Decisions on whether or not to collect multicolumn statistics in particular do not lend themselves very easily to automation at this time.

There are several postings on developer exchange that do discuss techniques for automating statistics collection, but I believe that is after decisions as to which columns/indexes to collect on have already been made.

For example, check out Marcio Moura's blog (mtmoura) postings on refreshing statistics.

Thanks, -Carrie

TeraAbe 37 comments Joined 08/09
07 Apr 2012

Carrie, thanks for the explanations. Would it be possible to understand the precedence of datatypes used internaly while parsing a multicolumn stats definition longer than 16bytes.

carrie 412 comments Joined 04/08
10 Apr 2012

The columns involved in the multicolumn stat will be ordered in the order they appear in the CREATE TABLE statement both when the statistics are collected and as they are represented in the histogram. This is true no matter what their data types are. All varchar columns are expanded out to their full length during the collection process. This means if a varchar appears earlier in a table definition than the other columns in a multicolumn statistic, and it is specified as varchar(16) or larger, then any subsequent columns will not be represented in the histogram values fields. They would be truncated as a result of the 16-byte size limit for the histogram values fields.

In Teradata 14.0, this 16-byte limit for histogram values fields has been lifted.

amy 2 comments Joined 02/11
11 Apr 2012

Do you know the specific lifted limitation for the histogram values fields In Teradata 14.0? The columns involved in the multicolumn stat will be ordered in the order we defined when we collect stat in Teradata14.0. Is that true?

carrie 412 comments Joined 04/08
13 Apr 2012

The new default limit for the size of the histogram values fields is 25 bytes in Teradata 14.0. However, the limit can be increased by specifying a greater value on the COLLECT STATISTICS statement itself. The ordering of the multicolumn statistics can now be specified on the COLLECT STATISTICS statement as well (with the exception of statements that specify INDEX).

In additon, each participating column in a multicolumn stat will be represented by at least 25 bytes (if it needs it), and trunctation will only take place after the column's leading 25 bytes have been represented, on a column by column basis.

Prior to the 14.0, not only was the truncation at 16 bytes, but it was the first 16 bytes in the combined values strung together. If you had several different columns in the statistic, often the later ones would not be represented at all. That will no longer be the case.

Thanks, -Carrie

carrie 412 comments Joined 04/08
13 Apr 2012

PLEASE NOTE:

A small change was made to the above text on April 13, to add this new bullet under the "Collect full statistics" list. It was left out in the original posting:

• Single-column join constraints, if the column is not unique

amy 2 comments Joined 02/11
17 Apr 2012

Thank you so much for the detailed information.

rajanimandava 5 comments Joined 01/11
26 Apr 2012

You should also consider the CPU taken by collect stats operation.

You should practice stats collection on a basis of % change of data.

I say a minimum of 10% of data load should be happening to make sure stats are refreshed.

Also for unique columns like "cust_id, txn_id" can be refreshed once in a week

For date columns like "txn_date" stats should be collected daily.

multo column stats are more proper when you have non-uniform data distribution over the combination of collumns. for Eg: tcn_type, country_cd

there may be a lot of txns for a country_cd, but on a txn_type = 99 there may be no txns at all. in those cases multi-column stats become more important.

carrie 412 comments Joined 04/08
26 Apr 2012

Speaking of the resources the collect stats operations can use, there are some enahncements available in Teradata 14.0 that will reduce that.

In Teradata 14.0 performance improvements to stats collection include the ability to submit a single statement with multiple statistics collections, and where it is possible, the optimizer will perform rollup aggregations. For example if you collect on STATE and on ZIPCODE and (STATE,ZIPCODE) the optimizer may choose to do one scan of the table for all 3 different statistics. The decision whether to use this optimization or others in 14.0 is cost-based.

padhia 11 comments Joined 06/10
21 May 2012

Thanks for the great tips. Would you please explain the reasoning behind collecting stats in following situations?

1) non-index columns used in predicates: Even with stats collected, FTS can't be avoided. Is the idea behind collect stats to give optimizer better cardinality estimates that may affect some other steps within the same query?

2) range predicate utilizing UPI and USI: Since neither UPI nor USI can be value-ordered, is the rationale behind this recommendation same as previous item?

Also is there any general guidance on what would be called as "over-partitioned"?

Thanks again.

carrie 412 comments Joined 04/08
22 May 2012

For your point #1, if all your queries against a table are always and only going to do a full table scan and never join to other tables, then you are correct that collecting statistics on predicate columns may be a waste of time. Understanding query usage is always an important aspect of decising which statistics to collect on. The purpose of such stats is to let the optimizer estimate more accurately the size of the spool coming out of that step. This can be useful for joins that may happen later in the plan, and even for being able to cost just that one step. If it's the only step in the plan, you still might want the optimizer to establish an accurate cost if, for example, you are using estimated processing time as classification criteria in TASM for either a workload or a throttle.

For point #2, the rationale is the same. The stats in such as case would be used to estimate the number of rows that qualify when the range constraint is applied against the UPI or USI. Without a histogram, there is no way the optimizer can make a reasonable estimate of processing time or spool size, and it could be needed in subsequent steps.

The best source of guidance for over-partitioning is to read through Paul Sinclair's orange book titled:

Partitioned Primary Index Usage (Single-Level and Multilevel Partitioning)

In particular, multilevel partitioning, or column partitioning and row partitioning combined, can significantly increase the number of combined partitions. If you have a lot of AMPs in your configuration, this could lead to a situation where there ends up being just a few data blocks per AMP in each combined partition. According to the Columnar Orange Book, over-partitioning exists if non-empty combined partitions per AMP typically have less than 10 data blocks.

Thanks, -Carrie

carrie 412 comments Joined 04/08
29 May 2012

Please note: This posting was updated on May 29, 2012, to reflect more accurate information on dynamic AMP sampling.

wgregg 1 comment Joined 09/10
31 May 2012

Hi Carrie -- are there any plans to lift the limit of 32 multi-column stats per table in v14 or a future release?

Revising the leading, 16-byte limitation on the histograms is great news.

Thanks, Bill

carrie 412 comments Joined 04/08
01 Jun 2012

Hi Bill,

Good question. That limit has already been lifted.

In 14.0 you may define up to 512 single or multicolumn statistics.

Thanks, -Carrie

pinaldba 5 comments Joined 09/10
11 Jun 2012

Thank you very much for the recommandations regarding stats.

I would like to know how the cost profiling related to optimizer estimation process.
To add into this, i can see that there are many cost profiling available(not sure how many). But would like to know if the each cost profile will change the estimation process and if yes, then what are your recommandations for the same.

let me give you some issue background and why above question raised in my mind.
when i was working in earlier project, Client has upgraded from TD 12 to TD 13 and then some of the SP's were running slower and Client request to go for backout. TD CS people suggested to alternative solution and they have changed the cost profiling after then those sp's were running fine.

Thanks in adavance for some light on this issue.

Regards
Pinal Patel

carrie 412 comments Joined 04/08
11 Jun 2012

Pinal,

Different cost profiles can have an impact on optimizer estimates. However, I do not have any direct experience trying out different cost profiles, so I am not able to make a recommendation in this area. Please consult the support center if you believe you are having issues related to cost profiles in use at your site. In addition, there is an orange book titled "OCES Type2 Costing" that should help to provide more information.

Thanks, -Carrie

pinaldba 5 comments Joined 09/10
14 Jun 2012

Hi Carrie,

Thanks for the information.

I can see that there are two table under DBC which are having the information regarding the cost profiles.

i.e dbc.costprofiles and dbc.costprofilestype.

Thanks
Pinal

28 Jun 2012

Carrie, may I reproduce this content on an internal website for our developers? I would like of course to give full credit to you and those you named at the end of the article. This content has great educational value for my developers, but not all of them will have internet access.

Dave Hays
Senior Analyst
Enterprise Information Management
Caterpillar, Inc.

neilotoole 23 comments Joined 11/08
28 Jun 2012

Hi David,

Per the DevX Terms of Use you may freely redistribute/re-host this content (but give attribution to Carrie!). This is the relevant section from the TOU:

>> Teradata hereby dedicates all present and future intellectual property rights (except patent rights) it holds in Teradata Public Content to the public domain. For example, Teradata agrees that any Person obtaining a copy of Teradata Public Content may freely reproduce, distribute, publish, transmit, publicly perform, publicly display, modify, prepare derivative works of, lease, license, sublicense, or otherwise transfer all or part of Teradata Public Content, as well as to authorize others to do any or all of the above, for any purpose, commercial or non-commercial, and in any way, including by methods that have not yet been conceived. Teradata agrees that this dedication relinquishes all rights it and its successors have to enforce such rights with respect to Teradata Public Content, whether by lawsuit or otherwise.

When exercising the above rights with respect to Content (even Public Content), attribution should be given to the authors of Content. <<

DevX Platform Architect

carrie 412 comments Joined 04/08
28 Jun 2012

Thanks, Neil!

29 Jun 2012

Carrie thank you !
I'll of course credit you, Paul Sinclair and Rama Krishna Korlapati.

-- Dave Hays

Dave Hays
Senior Analyst
Enterprise Information Management
Caterpillar, Inc.

Shrey Takkar 4 comments Joined 06/12
30 Jun 2012

Hello Carrie,

Thanks a lot for providing a quick summary for statistics collection recommendations.

I have following doubts in statistics collection using Teradata Statistics Wizard:-

1. What are stale statistics?

2. Collection of statistics is advisable to be collected as per recommendation for a particular workload or for all tables which are part of the workload?

3. If I define a workload using "Create from SQL statement" and then get recommendations for collecting statistics on that particular workload, recommendation would include statistics collection on columns/ indexes for the query mentioned in workload. Here it might happen that statistics are collected on some columns/ indexes which may be not required for other queries. In other words, would the statistics collected for a particular query be stale for other queries?

4. Is it advisable to collect statistics for all tables that are frequently used regularly?

Thanks,
Shrey

vincent91 13 comments Joined 02/10
16 Jul 2012

Hi Carrie,

I did some tests by USING SAMPLE option.
I set the SAMPLESIZE to 50%.
Each time I collect stat on column of my secondary index, the sampled pourcent is always 1%.
Can you explain why ?

My table IDV_SI is define with INDEX ( GIN_IDV );

Column GIN_IDV is nearly unique (over 99%).

DIAGNOSTIC "COLLECTSTATS, SAMPLESIZE=50" ON FOR SESSION;
collect statistics using sample on IDV_SI column GIN_IDV;
help statistics IDV_SI column GIN_IDV; : the sampled pourcent remains to 1% ???

It's always 1% for secondary index.

thanks

carrie 412 comments Joined 04/08
18 Jul 2012

I am only guessing, but it may be possible that the NUSI sample code is not recognizing the diagnostic percentage. Once you set the diagnostic as you did, it should effect all stats collections in the session. I know that that is how it works for non-NUSI columns.

Please contact the support center and open an incident on this, so we can evaluate what is actually happening.

Thanks, -Carrie

Wam10698 1 comment Joined 03/08
17 Oct 2012

Hi Carrie,
I am running Teradata 13.10 and was wandering if it is possible to identify stats that are not being used by the optimizer?

carrie 412 comments Joined 04/08
18 Oct 2012

There is no mechanism in 13.10 that would allow you to identify stats that are not being used by the optimizer. A future release of Teradata will be collecting that information because it has been recognized as important. But unfortunately we're not there yet.

Thanks, - Carrie

esnam 1 comment Joined 06/11
14 Jan 2013

Hi Carrie,
Thanks a lot for providing a quick summary for statistics collection recommendations.

Your one of full statistics recommandation is "PARTITION for all tables whether partitioned or not".
Is it need to collect stats "PARITION" on lots of small table (like code or dimension table) ?
I think that you mean "PARTITION" on some large table because of extrapolation (PARTITION on NPPI table collect table row count very quickly).

carrie 412 comments Joined 04/08
21 Jan 2013

You do not need to collect PARTITION stats on small tables that do not undergo growth, such as most dimension tables. You are correct that PARTITION stats are most important on large tables that do grow, and where stats are more likely to be out of date. If your small tables undergo growth, then recollect fulls stats on them, because it will be very fast with small tables.

Thanks, -Carrie

suhailmemon84 51 comments Joined 09/10
04 Feb 2013

I've always had this question and unable to find a clear cut answer. For a partitioned table, is collect stats on "PARTITION" the same as collecting stats on the column on which PPI is defined?

I'm trying to understand whether we should do both(collect stats on partition and collect stats on ppi column) or just one of these?

Pardon me if the question sounds silly.

-Suhail

MarkVYoung 20 comments Joined 03/12
04 Feb 2013

Hi Carrie,
We are running TD 13.10 and I remember from earlier versiosn that there was a recommendation to not have too many empty partitions, ie not to over-partition the table. Does the collecting of stats on PARTITION mean that this is less restrictive now? We have just had to go through an excersise to alter tables and increase the end date of many date range partitions and have this tinme consuming, so I am wondering if we were to specify a end date 5 years or 10 years from now that has interval '1' day would be OK, if the stats on PARTITION were refreshed reguarlay!

carrie 412 comments Joined 04/08
06 Feb 2013

Suhail,

PARTITION stats counts the rows in each partition of the table, so the optimizer knows how many are empty, and if not empty, how full. This is used for estimating I/O costs for the optimizer.

Stats on the partitioning column is used to determine the number of qualifying rows for a given query.

You should collect on both PARTITION and the partitioning column for PPI tables.

Thanks, -Carrie

carrie 412 comments Joined 04/08
06 Feb 2013

Hi Mark,

Empty partitions are not an issue if PARTITION stats are collected. PARTITION stats are important because they tell the optimizer how many and which partitions are empty. So you should be fine.

Thanks, -Carrie

mukkamal 1 comment Joined 01/12
22 Feb 2013

Carrie,
We have a 13.10 TD system and have TASM configured based on estimatedCPU in some cases. Please let me know if there is a reason for the estimate of a stats collection is always 0. Since we have TASM classification criteria based on the EstimateCPU Times and these stats collections go into the <1 sec bucket when they really take up more than 100 CPU ticks.

The below query comes back with 0 rows.

sel
logdate,
Cast(TDWMEstTotalTimeas int) estime1,
cast(TDWMEstMaxRowsas int),
cast(TDWMEstLastRowsas int) ,
cast(EstResultRowsas int) ,
cast(EstProcTime as int) estime2 ,
cast(EstMaxRowCount as int)
from
pdcrdata.dbqlogtbl_hst
whereStatementType ='Collect Statistics'
groupby 1,2,3,4,5,6,7
havingestime1>0 or estime2 >0
orderby 1,7,6,5,4,3,2

carrie 412 comments Joined 04/08
25 Feb 2013

Statistics collection statements never produce a total estimated time. It has always been that way. So you don’t want to try to use estimated time for classification of those types of requests.

I know this does not help you today, but in 14.10 you will see a new classification criteria under Query Characteristics in Workload Designer for statement type = "Statistics Collection". That will allow you to run all stats collections, no matter who submits them, in the same workload if you like.

Thanks, -Carrie

praving 1 comment Joined 02/13
25 Feb 2013

Hi Carrie,

Thanks for the info.
I have a huge partitioned table (10+TB). I am collecting stats on "partition" and primary index.
Do I still need stats on combination of (pattition, pi)?
i don't see any benefit of it.
Please reply.

carrie 412 comments Joined 04/08
27 Feb 2013

This is an optional recommendation. The only reason you might want to collect on both as a multicolumn statistic is if a given PI value exists in multiple partitions, and you do sliding window and rowkey based merge joins. It can also help for dynamic partition elimination. This is pointed out in the section above under partitioned primary index recommendations.

But if your plans are acceptable without this stat, then there is no need to collect it.

Thanks, -Carrie

MarkVYoung 20 comments Joined 03/12
21 Jun 2013

Carrie,
    How do we now if we are running with ALL-AMP Sampling, is this in the dbscontrol settings? We are running TD13.10 and finding that some table statistics refreshes take over an hour, so we are wondering if we might be running Single AMP Sampling, which we could drop some stats if we were using ALL-AMP Sampling!

carrie 412 comments Joined 04/08
24 Jun 2013

Hi Mark,
 
All-AMP sampling is set in DBS Control.   It is Internal Field #65:
 
   65.  RandomAmpSampling      = D
 
RandomAmpSampling - This field determines the number of amps to be sampled for getting the row estimates of a table. The valid values are D,L,M,N or A.
    D - The default one AMP sampling.
    L - Maximum Two AMPs sampling.
    M - Maximum Five AMPs sampling.
    N - Node Level sampling i.e. all the AMPs in a node would be sampled.
    A - System Level sampling i.e. all the AMPs in a system would be sampled.
  
Option D is the default. Note that higher the number of AMPs sampled better would be the estimates, but can cause short running queries to run slower and long running queries to run faster.
 
That is what the documenation on DBS Control says.
 
However, this setting only controls random AMP sampling (RAS), which happens infrequently, mainly when the table header is accessed from the data dictionary.   (Random AMP samples are likely to stay in cache for up to 4 hours). RAS is used to help produce estimates when no collected statistics exist.  RAS is not directly related nor does it influence performance when collecting statistics statements are issued.  The statistics collection process always uses all AMPs no matter how you have set RandomAMPSampling.  
 
More information an random AMP sampling can be found in the Teradata documentation:   See the SQL Transaction Processing manual.  For the 13.10 edition there is an entire section on random AMP sampling starting on page 177.
 
You may be confusing RAS with the option to sample a subset of the data when you issue a statistics collection statement. With the USING SAMPLE clause you can reduce the effort made and the time in collecting statistics, but it is different functionality from random AMP sampling.
 
Thanks, -Carrie

teradatauser2 13 comments Joined 04/12
09 Jul 2013

Hi  Carrie,
We have a table with PPI on Col1 , which is a date column. We have a query running uses condition  
Col1 >= DATE '2013-07-06'
AND Col1<= DATE '2013-07-08'
 
There was no data for date = 2013-07-06 and 2013-07-07 (as no data comes on weekend).
On monday 2013-07-08, we have 4 million records. Data for 2013-07-05 was 3.5 million.
We have a weekly stats stragegy i.e stats was taken on this table (and we have a stats defined on Col1 and partition as well) on 2013-07-07. 
But, if you see the explain plan for the query, for the above mentioned condition, it assumes 1 row , with high confidence. Now, the query goes for a toss and doesn't complete.
When i collect stats on Col1 again today, the query runs in 7 minutes. 
If you see help stats on 7th - unique values were -3152 and today after stats it is 3153 (just 2013-07-08 is added , i suppose). I am unable to understand why is it not able to use the weekend stats as the data has not changed much. This table is 1 TB table and 4 million is not a huge addition for a day.
So, do we need to collect this column stats on a daily basis as my query runs daily (is this recomended ?).On a broader perspective, what is the recomendation for daily stats collection. We are facing problem of this table only, rest all tables are fine (with only weekly stats). Could there be any other reason ?
-Thanks !

carrie 412 comments Joined 04/08
11 Jul 2013

Based on what you say, when stats were collected on Sunday (day 07), there are no rows in the partition for day 06 and no rows in the partition for day 07 and no rows yet in the partition for day 08.   So the explain text row count estimate of 1 row for a query that selects only partitions for day 06, 07 and 08 seems accurate to me.   (You will always get a one row estimate even when no rows are detected in a table or a partition).
 
The recommendation is to re-collect statistics if either the table or any partition in the table changes by 10% or more. Going from an empty to a full partitions qualifies as more than a 10% change to a partition.  So you need to recollect on PARTITION and the partitioning column after any load activity into an empty partition. 
 
If you ever find yourself getting bad plans on your current schedule of recollections, but are able to get good plans with more frequent recollections, then it makes a lot of sense to recollect more frequently. 
 
Thanks, -Carrie

surup.sarit 2 comments Joined 08/13
26 Aug 2013

Hi Carrie,
I am in trouble .I am declaring the collect statistics on column level .Below is the syntax is working for TD14.0 ,but is not working for the TD13.0 .
Collect statistics using system sample  INDEX  (<columnname1>,<columnname2>....)  on <databasename>.<tablename>;
Please advice me .

surup.sarit 2 comments Joined 08/13
26 Aug 2013

Hi Carie,
Could you please tell me whether the syntax of TD 14.0 will work  in TD13.0

carrie 412 comments Joined 04/08
26 Aug 2013

Surup,
 
You cannot use the new 14.0 collect statistics syntax on a database with earlier software, such as 13.10.  It will not be recognized.
 
 
Here is an example of the legacy style syntax, pre-Teradata 14.0, that works in 13.0.  It can also continue to be used in 14.0:
 
COLLECT STATISTICS ON supplier column s_suppkey;
 
Here is the new style in Teradata 14.0 (it is not backward compatible). 
 
COLLECT STATISTICS COLUMN s_suppkey ON supplier;
 
You will have to manually re-write the collect stats statements that are written using the new syntax for 14.0 to the legacy syntax shown above.
 
Thanks, -Carrie

padhia 11 comments Joined 06/10
03 Dec 2013

Hello Carrie,
 
Is collecting stats on (PARTITION, PI) or (PARTITION, PI, partitioning column), subject to the 16 byte limit of multi-column stats? That is, if a table has a PI that is on a really long set of column(s), will the stats on PARTITION+PI be misleading to the optimizer if the first 16 bytes aren't distinct enough?
 
Thanks!

carrie 412 comments Joined 04/08
05 Dec 2013

The number of distinct values is more significant for these types of multicolumn statistics you are asking about below, than is the length of the values stored in the statistics histogram.  
 
When you issue a COLLECT STATISTICS statement, the aggregation activity calculates the number of distinct values before any truncation is performed, so that metric is a more reliable value than would be an interval value truncated at 16 bytes.
 
Thanks, -Carrie

rraghuram64 3 comments Joined 10/13
2 months ago

Hi carrie,
I have some confusion in choosing a right one for the multi column statistic ..
I have table which have
Multi column USI ,Multi column partition columns ,muti columns participating in NUSI ,multiple columns in non equlaity join conditions
 
so which one should i collect the multi stats on I am really confused....

carrie 412 comments Joined 04/08
2 months ago

You do not usually need statistics on a USI.  The optimizer knows a unique column is unique and that there will be one row for each distinct value, so statistics will not add any benefit, unless you are referencing the USI with range constraints in queries.    (See the 4th bullet under Full Statistics category above.)
 
Also, you do not require statistics on multiple columns that make up a join constraint if you are joining with unequal conditions.   Multicolumn stats won’t help in-equal joins.
 
It is recommended that you do collect statistics on partitioning columns, whether multicolumn or single column, and on multiple columns participating in a NUSI (see the 5th bullet under Full Statistics above).
 
Thanks, - Carrie

cmedved 1 comment Joined 02/14
1 week ago

Hi Carrie. Do you have any recommendations for statistics on a multi-level PPI where the columns used in the PPI are not in the PI?
 
If Col1 is the PI and Col2 and Col3 are the MLPPI columns, would we want to collect statistics on: (PARTITION, Col1, Col2, Col3)? Are there any others that would be important?
 
Thanks in advance!

carrie 412 comments Joined 04/08
1 week ago

Whether the PPI table uses a single level of partitioning, or multiple level of partitioning, the guidelines are still the same.  The fourth bullet under "Other Considerations" above provides a couple of recommendations that would apply to MLPPI tables.  Whether or not you want to implement those recommendations depends on how partitioning is being used by queries and the type of joins being performed against the PPI table.    And of course, whether or not they improve your query plans!
 
•For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
 
◦(PARTITION, PI).  This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions.  It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
 
 ◦(PARTITION, PI, partitioning column).   This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination.  It is used in rowkey-based merge join costing.
 
Thanks, -Carrie

You must sign in to leave a comment.