Teradata 14.0 offers some very helpful enhancements to the statistics collection process.   This posting discusses a few of the key ones, with an explanation of how these enhancements can be used to streamline your statistics collection process and help your statistics be more effective.

For more detail on these and other statistics collection enhancements, please read the orange book titled Teradata 14.0 Statistics Enhancements, authored by Rama Korlapati, Teradata Labs. 

New USING options add greater flexibility

In Teradata 14.0 you may optionally specify a USINGclause within the collect statistics statement.  As an example, here are the 3 new  USINGoptions that are available in 14.0 with parameters you might use:

. . . USING MAXINTERVALS 300

. . . USING MAXVALUELENGTH 50

. . . USING SAMPLE 10 PERCENT

MAXINTERVALS allows you to increase or decrease the number of intervals one statistic at a time in the new version 5 statistics histogram.  The default maximum number of intervals is 250.  The valid range is 0 to 500.  A larger number of intervals can be useful if you have widespread skew on a column or index  you are collecting statistics on, and you want more individiual high-row-count values to be represented in the histogram.   Each statistics interval highlights its single most popular value, which is designates as its “mode value” and lists the number of rows that carry that value.  By increasing the number of intervals,  you will be providing the optimizer an accurate row count for a greater number of popular values.

MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic.  The new default length is 25 bytes, when previously it was 16.  If needed, you can specify well over 1000 bytes for a maximum value length.  No padding is done to the values in the histogram, so only values that actually need that length will incur the space (which is why the parameter is named MAXVALUELENGTH instead of VALUELENGTH).   The 16-byte limit on value sizes in earlier releases was always padded to full size.  Even if you statistics value was one character, you used the full 16 bytes to represent it.     

Another improvement around value lengths has to do with multicolumn statistics.  In earlier releases the 16 byte limit for values in the intervals was taken from the beginning of the combined value string, and truncation took place at the end.  Truncation will still take place at the end of the combined string in 14.0, however in this more current release you can support a much longer combined value.   In addition, during the aggregation process that builds up the statistical histogram, each column within the statistic will be able to have its its first 32 bytes represented when determining the number of distinct values, ensuring that a significant  number of bytes in each column can contribute to understanding how distinct the combination of column values are.

SAMPLE n PERCENTallows you to specify sampling at the individual statistics collection level, rather than at the system level.  This allows you to easily apply different levels of statistics sampling to different columns and indexes.  

Here's an example of how this USINGsyntax might look:

COLLECT STATISTICS 
        USING MAXVALUELENGTH 50 
        COLUMN ( P_NAME ) 
ON CAB.product; 

Combining multiple collections in one statement

Statistic collection statements for the same table that share the same USINGoptions, and that request full statistics (as opposed to sampled), can now be grouped syntactically.  In fact it is recommended that once you get on 14.0 that you collect all such statistics on a table as one group.  The optimizer will then look for opportunities to overlap the collections, wherever possible, reducing the time to perform the statistics collection and the resources it uses. 

Here is an example

The old way:

COLLECT STATISTICS COLUMN 
(o_orderdatetime,o_orderID) 
ON Orders;
COLLECT STATISTICS COLUMN 
(o_orderdatetime) 
ON Orders;
COLLECT STATISTICS COLUMN 
(o_orderID) 
ON Orders;

The new, recommended way: 

COLLECT STATISTICS 
  COLUMN (o_orderdatetime,o_orderID) 
, COLUMN (o_orderdatetime) 
, COLUMN (o_orderID) 
ON Orders;

This is particularly useful when the same column appears in single and also multicolumn statistics, as in the example above.   In those cases the optimizer will perform the most inclusive collection first (o_orderdatetime,o_orderID), and then re-use the spool built for that step to derive the statistics for the other two columns.  Only a single table scan is required, instead of 3 table scans using the old approach.  

Sometimes the optimizer will choose to perform separate collections (scans of the table) the first time it sees a set of bundled statistics. But based on demographics it has available from the first collection, it may come to understand that it can group future collections and use pre-aggregation and rollup enhancements to satisfy them all in one scan.   

But you have to remember to re-code your statistics collection statements when you get on 14.0 in order to experience this savings.

Summary statistics

New in Teradata 14.0, table-level statistics known as “summary statistics”  are collected alongside of the column or index statistics you request.  Summary statistics do not cause their own histogram to be built, but rather they create a short listing of facts about the table undergoing collection that are held in the new DBC.StatsTbl. It is a very fast operation.   Summary stats report on things such as the table’s row count, average block size, and some metrics around block level compression and (in the future) temperature.  An example of actual execution times that I ran are shown below, comparing regular column statistics collection against summary statistics collection for the same large table.  Time is reported in MM:SS:

COLLECT STATISTICS ON Items COLUMN I_ProductID;
 Elapsed time (mm:ss):  9:55 
 COLLECT SUMMARY STATISTICS ON Items;
 Elapsed time (mm:ss):  00:01

You can request summary statistics for a table, but even if you never do that, each individual statistics collection statement causes summary stats to be gathered.  For this reason, it is recommended that you group your statistics collections against the same table into one statement, in order to avoid even the small overhead involved in building summary stats repetively for the same table within the same script. 

There are several benefits in having summary statistics.  One critical advantage is that the optimizer now  uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations.  It no longer needs to depend on primary index or PARTITIONstats, as was the case in earlier releases,  to perform good extrapolations when it finds statistics on a table to be stale.

Here’s an example of what the most recent summary statistic for the Items table looks like:

SHOW SUMMARY STATISTICS VALUES ON Items;

COLLECT SUMMARY STATISTICS 
            ON CAB.Items 
            VALUES
(
 /** TableLevelSummary **/ 
 /* Version           */ 5, 
 /* NumOfRecords      */ 50, 
 /* Reserved1         */ 0.000000, 
 /* Reserved2         */ 0.000000, 
 
 /* SummaryRecord[1] */ 
 /* Temperature           */ 0, 
 /* TimeStamp             */ TIMESTAMP '2011-12-29 13:30:46', 
 /* NumOfAMPs             */ 160, 
 /* OneAMPSampleEst       */ 5761783680, 
 /* AllAMPSampleEst       */ 5759927040, 
 /* RowCount              */ 5759985050, 
 /* DelRowCount           */ 0, 
 /* PhyRowCount           */ 5759927040, 
 /* AvgRowsPerBlock       */ 81921.871617, 
 /* AvgBlockSize          */ 65024.000000, 
 /* BLCPctCompressed      */ 0.00, 
 /* BLCBlkUcpuCost        */ 0.000000, 
 /* BLCBlkURatio          */ 0.000000, 
 /* RowSizeSampleEst      */ 148.000000, 
 /* Reserved2             */ 0.000000, 
 /* Reserved3             */ 0.000000, 
 /* Reserved4             */ 0.000000 
);
Discussion
carrie 595 comments Joined 04/08
07 May 2014

Srini,
 
The asterisk that you see in the first row of output for the HELP STATS under the "Column Names" column indicates that you are viewing row count of the table at the time the statistics were collected.   The value in the "Unique Values" column of that first row is the number of rows in the table.
 
Thanks, -Carrie

ssrini 2 comments Joined 04/08
14 May 2014

Thank you Carrie.

Regards,
Srini

17 Jun 2014

Hi Carrie,
Starting with Teradata 12 or 13 (cannot remember), we collected stats on column PARTITION
on ALL our tables (partition tables and non partition tables),as we were advised.
With TD14.10 now, should we still keep that column PARTITION collection on non-partition tables? 
Should we replace it with Summary stats?
 
Is the * we see when doing help stats the result of the summary stats?
 
Will collection on column PARTITION also updates the Summary stats?
 
Thanks

carrie 595 comments Joined 04/08
18 Jun 2014

Hi Nazy,
 
On 14.10 (and 14.0 as well) you no longer need to collect statistics on PARTITION, unless your table is a partitioned table.  For partitioned tables you should definitely collect on PARTITION.
 
You may not need to consciously replace PARTITION stats on non-partitioned tables with summary stats, as summary stats will automatically be collected at the time any column/index on the table has its stats collected.  Collection on PARTITION will update summary stats as well.  But if you have not collected any stats on the table for a while, you can explicitly collect stats on just summary stats for the table by issuing.  It's a very fast execution, and a good idea to use it after loading activity on a table when you don't have time for full stats recollections.
 
COLLECT SUMMARY STATISTICS ON table-name;
 
The asterisk in HELP STATS tells you the number of rows in the table on which the statistics were taken.  It will change as summary stats are refreshed, even if other statistics on the table have not been refreshed.  So you could say that number is a result of summary stats.  That's the correct way to look at it.
 
Thanks, -Carrie

pinaldba 7 comments Joined 09/10
21 Aug 2014

Hi Carrie,
 
Thanks for nice explanation. 
I have query related to Summary statistics for the staging tables.
Following steps are performed before loading data into final Model tables.

Step 1:- Loading data into Staging table 

 

CREATE SET TABLE staging.ABC ,NO FALLBACK ,

     NO BEFORE JOURNAL,

     NO AFTER JOURNAL,

     CHECKSUM = DEFAULT,

     DEFAULT MERGEBLOCKRATIO

     (

      RECORD_TYPE CHAR(2) CHARACTER SET LATIN NOT CASESPECIFIC,

      ACCOUNT_NO INTEGER NOT NULL,

      PARENT_ID INTEGER,

      CHILD_COUNT INTEGER,

      HIERARCHY_ID INTEGER,

PRIMARY INDEX ( ACCOUNT_NO );

 

Step 2:- Collecting statistics on staging table - staging.ABC

 

Step 3:- Rename staging.ABC to staging.ABC_XYZ and then renamed table "staging.ABC_XYZ" will be further used will be in batch processing and in transafomration.

 

Step 4:- Re-create the staging table ipstaging.AR_A100 for next run.
Questions:- 
1). Does Teradata recommanded to collect stats after renaming the table staging.ABC_XYZ? if Answer is Yes, then I would like to know the reason for the same?
2). Does renaming the tables can drop Random Amp sample from summary statistics history as Table-id has been changed during staging process. Does optimizer may extrapulate wronly here as stats has been not refreshed after renaming the staging table ""staging.ABC_XYZ".
 
Thanks for your help in Advance and eagerly awaiting answer from you.
 
Regards
Pinal
 
 
 
 
 
 
 
 
 

carrie 595 comments Joined 04/08
22 Aug 2014

Pinal,
 
Statistics rows in the new DBC.StatsTbl in 14.0 refer to table id not table name. So, there is no problem with renaming a table, the stats from the old name (collected and RAS) will still be usable.
 
See page 893 of the SQL Data Definition Language Detailed Topics manual for validation of this: 
 
Function of RENAME Table Requests
 
When you rename a table, Teradata Database only changes the table name. All statistics and
privileges belonging to the table remain with it under the new name.
 
It's easy to see for yourself that the stats are still there: create table, add some rows, collect some stats, rename the table and you will see the stats are associated with the renamed table.    If they are outdated, it's always a good idea to recollect them whether or not you have renamed the table.  But the act of renaming does not in and of itself require a recollection.
 
Thanks, -Carrie

carrie 595 comments Joined 04/08
29 Aug 2014

Srini,
 
My apologies, but I did not get notified of your comment.
 
The asterisk in the first line of the HELP STATS output represents the row count of the table.  It comes from the SUMMARY statistics that are collected for every table at the time any column or index stat is collected or recollected.  So you can think of it as representing the SUMMARY statistics for that table.  You will see it change as SUMMARY stats are refreshed.   It does not represent anything you have explicitly asked for.   But if you have even one stat collected on the table, you will see that row in HELP STATS.
 
Thanks, -Carrie

02 Sep 2014

Ciao Carrie,
I have a very simple and quick question regarding STATS and CURRENT STATS.
I create a table T and define some stats without insert any rows.
If I run
     HELP STATS ON T 
I get the right estimation: 0 rows in summary ('*') and column where I define the stats.
If I run
     HELP CURRENT STATS ON T
I get 1 row on each stats.
Why ?
I'm asking this because many times in a query with a lot of joins the plan build by the optimizer can be very different with 0 estimation instead of 1 row estimation !
Thanks & regards,
Pietro.

Pietro Nardella
pietro.nardella@teradata.com
Teradata Italia

carrie 595 comments Joined 04/08
02 Sep 2014

Hi Pietro,
 
Even when a table has no rows at the time you collect stats, the optimizer will always adjust that up to one row before it attempts to build the query plan.   The cost calculations used by the optimizer include many formulas.  Some of these formulas include division or multiplication by row count, so zero as a row count could be problematic.
 
You can see this adjustment in your HELP CURRENT STATs output, which displays the extrapolated row count.  The optimizer always uses what it extrapolates.  
 
To validate this run an explain of a select * on an empty table and you will see the plan specifying that one row will be returned.
 
Thanks, -Carrie
 

07 Oct 2014

Hi Carrie
I have SQL Assistant v.15.00 and I cannot drop STAT.
Can you help me with the syntax for dropping STAT?

CREATE MULTISET TABLE T1 (val INTEGER);
COLLECT STATISTICS T1 COLUMN val;
DROP STATISTICS T1;

then line DROP STATISTICS T1; fails with the Error "Query is invalid".
The code is working ok when I use Assistant 13.11
Peter Schwennesen

carrie 595 comments Joined 04/08
08 Oct 2014

Hi Peter,
 
I tried your syntax on a pre-14.0 version of SQLA and it worked fine for me.  
 
Unfortunately, I'm not going to be able to help you with SQLA behavior or release changes.   You will need to find someone with client side background or experience with SQLA for that. 
 
I suggest that you post your question on the Teradata Forum, or see if you can find any blog postings here in Dev X related to client functionality.
 
Best regards, -Carrie

09 Oct 2014

I already did. Becaus there was a new systax, I thought that there also was a new for drop, but I now know that there is not, and that it is an bug in SQL Assistant version 15.00.00.00, it was fixed i versin 15.00.00.03.
But this version are not to be found under downloads
br
Peter Schwennesen

30 Oct 2014

Hi carrie,
In the 14.0 orange book, section 2.1.5 Column Ordering for Multicolumn Stats, it says that column ordering is honored with the exception of the statistics collected with Index specification.
Would you please explain...  What order is used for indexes?

carrie 595 comments Joined 04/08
31 Oct 2014

Nazy,
 
If the INDEX specification is included in the multicolumn collect stats statement, the columns will be reordered to match the index.  This is because the index columns are already ordered in a certain way, and stats that are being collected on it need to use column ordering that is the same as the index itself, or the statistics will be less useful for the index-oriented optimizations.  But if you collect using the COLUMN specification on the same columns, you can order them however you like.
 
Here is a simple example: 
 
Create table t1(x1 int, y1 int, d1 date);
Create index (x1, d1) on t1;
--COLLECT STATS with INDEX spec. Columns gets reordered to match index on these two columns.
 
collect stats index (d1, x1) on t1;
help stats t1;
 
*** Help information returned. 2 rows.
*** Total elapsed time was 1 second.
 
Date     Time     Unique Values        Column Names
-------- -------- -------------------- ------------
14/10/30 11:59:07                    1      *
14/10/30 11:59:07                    1      x1,d1
 
--Stats are collected with COLUMN spec. Column ordering is maintained.
 
collect stats column (d1, x1) on t1;
help stats t1;
 
*** Help information returned. 2 rows.
*** Total elapsed time was 1 second.
 
Date     Time     Unique Values        Column Names
-------- -------- -------------------- ------------------------------------
14/10/30 12:00:05                    1      *
14/10/30 12:00:05                    1      d1,x1
 
Even though you have asked for COLUMN  statistics, if they are on the same columns as the index uses but just in a different order they will get used in many cases on behalf of the index even though they are not explicitly INDEX stats.    But there are cases where the COLUMN-specified stats would not be useful when considering the index, like if you have truncation of the combined values in the histogram, and the optimizer needs the full values to determine whether or not use an index.  
 
Thanks, -Carrie

HateOra 1 comment Joined 10/13
31 Oct 2014

Hi Carrie,
I have a MERGE query which runs 2 -3 times a day on a table which is PPI.Suddenly yesterday the query ran for more than 5 hours where the avg execution time for the query is 15min.
When i checked the latest EXPLAIN PLAN and the Older EXPLAIN plan where query completed in 15 mins, i could see that Optimizer choosed different explain plan for the both the run..
 
Could you please help me with this so that i can nail down this problem to avoice this situation in future.
 
Thanks

04 Nov 2014

Thanks Carrie. That was very helpful...
I have another question.
If I understand correctly, if we drop stats on a table, we lose history.
If we drop a table, we will also lose history.
So, when we modify an existing table in a way that we need to drop and recreate it (redo compression, redo, partitioning, change PI, etc...),, for the most part, stats are to be the same as before,,, may be we add a couple column stats, if columns added for example. 
So, is there a way to still keep the history on stats?  because its still the same table.
The orange book says to export stats by using show statistics values....   and then import.
How do you Import stats back to the new table?
Thanks
Nazy

carrie 595 comments Joined 04/08
05 Nov 2014

Response to HateOra:
 
From a statistics point of view, make sure you have followed all the recommendations for PPI table statistics that are provided in this blog posting:
 
http://developer.teradata.com/blog/carrie/2013/02/statistics-collection-recommendations-teradata-14-0
 
 
Unfortunately, I am not in a position where I can debug your query, but if you post your question on Teradata Forum, and include the CREATE TABLE, query syntax, and the explains before and after, there might be someone there with strong query tuning experience that will look at your detail and come up with suggestions.  
 
A different plan may or may be responsible for the slower execution time.  Other things to check include:
- Did the query selection criteria require reading more data?
- Had the table and/or partition(s) increased in row count?
- Was the system heavily-loaded at the time the longer-running query executed?
- Were any system resources exhausted at the time of execution (such as AMP worker tasks or CPU)?
- Was the bad query executing at a lower priority?
- Was the query delayed by a throttle?
 
Thanks, -Carrie
 

carrie 595 comments Joined 04/08
05 Nov 2014

Response to Nazy:
 
You can export stats and then after you have inserted into all rows into the new table, you can import them back.    When you issue this command: 
 
SHOW STATISTICS VALUES ON table-name;
 
You get COLLECT STATISTICS statements that include a VALUES clause, which causes all the values and histogram detail and history records that have been collected for each statistic in that table to be produced in the output.
 
When you submit those COLLECT STATISTICS commands in a script (they are just regular SQL) after the new table has been created (assuming the new talbe is named the same as the table you exported from, or that you have modified the table-name in the COLLECT STATISTICS commands)  it will copy in that histogram and history detail just as if you had manually one-by-one collected the statistics yourself.
 
Using the export (which creates the SQL to repopulate statistics after the table has been recreated) followed by an import (which can be submitted as regular SQL) you will preserve the histograms and the history records.
 
Thanks, -Carrie

13 Nov 2014

Hi Carrie,
I am not sure I quite understand the Syssampleoption in DBSControl.
When its set to 0 or 1(default), does it mean all stats collections are sampled based on the optimizer deciding on the %, or does it mean only when we specify a sample %, then all recollections will use that % without having to specify?

13 Nov 2014

Asking see I see this:
 
 
show statistics column (orig_sys_id) on NA_EDW.WHSE_INV_LOC
COLLECT STATISTICS
                   -- default SYSTEM SAMPLE PERCENT
                   -- default SYSTEM THRESHOLD PERCENT
            COLUMN ( ORIG_SYS_ID )
                ON NA_EDW.WHSE_INV_LOC ;

carrie 595 comments Joined 04/08
17 Nov 2014

Hi Nazy,
 
The SysSampleOption is to enable or disable system-determined sample percent globally.   When it is enabled, the system decides whether or not to using sampling, and at what percent.
 
When an explicit sample percent is specified in the COLLECT STATS statement level, it overrides the system-determined sampling decisions and uses the specified sample percent for all recollections.
 
In your second comment, those clauses (default SYTEM SAMPLE PERCENT, for example) indicate that global defaults have been set for SYSTEM SAMPLE and SYSTEM THRESHOLDS defined DBSControl.  What you show above in the SHOW STATS output is indicating that this collect statistics statement is elgible for system-determined sample and also system-determined thresholds.
 
Thanks, -Carrie

SudhakarP 1 comment Joined 08/09
22 Dec 2014

Hi Carrie,
Thanks for introducing me to the world of collect stats in TD 14.10.
With the TD 14.1 recommendations on collect stats, I could see savings upto 33% in TotalIOCount, 47% in AMPCPUTime, 2% in SpoolUsage, 38% in ReqIOKB. However ReqPhysIO and ReqPhysIOKB showed drop in performance of upto -790% and -977% respectively.
Our Box is EDW 67xx series. Is this acceptable performance for this configuration. Can I recommend my team to use new way of collecting stats (by bundling the columns in single collect). Please advise.
 

carrie 595 comments Joined 04/08
02 Jan 2015

Sudhakar,
 
Different statistics collections on different tables will behave differently when you move to either 14.0 or 14.10 from an earlier release.    I am glad to hear that you are seeing resource savings in some areas.  Under varying circumstances it is possible to experience greater resource requirements for certain stats collections on the newer releases, but I would expect this to be counterbalanced by savings elsewhere, with other stats.
 
It is advisable to recollect at the table level so the optimizer can attempt to bundle the stats collections for the entire table.
 
If you detect any resource usage increase with stats collections that is greater than 10%, I'd suggest you open an incident with the support center.
 
Thanks, -Carrie

gonela.ram 1 comment Joined 04/13
15 Jan 2015

I read the below from Developers exhange earlier (for 13.10), wondering whether these guidelines are still good with 14.10 version too, or anything changed?

Guidelines for Stats Collection                                                                                                                                        
For a partitioned table, it is recommended that you always collect statistics on:
·         PARTITION: This tells the optimizer how many row partitions are empty, a histogram of how many rows are in each row partition, and the compression ratio for column partitions. This statistic is used for optimizer costing.
·         Any partitioning columns. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria
For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
·   (PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
·   (PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.

carrie 595 comments Joined 04/08
15 Jan 2015

There is posting here on Dev that addresses 14.10 statistics recommendations.  Information contained there should answer your question:
 
http://developer.teradata.com/blog/carrie/2014/09/statistics-collection-recommendations-teradata-database-14-10
 
Thanks, -Carrie

24 Mar 2015

Hello,
I just heard from someone that in TD14.10 stats can be collected on only NEW Partition of a table.
And they are not talking about collecting stats on column PARTITION.
 
Can we collect stats on specific partition of a table?
 

carrie 595 comments Joined 04/08
27 Mar 2015

Hi Nazy,
 
Statistics are always collected on the entire table.  There is no capability in 14.10 to just collect stats on a single partition.  That capability may be added at some future time, but I cannot tell you when that might be.
 
Thanks, -Carrie

Sandeepyadav 27 comments Joined 09/13
14 May 2015

Hi Carrie,
I have some doubt. please help:
1. If a table is being accessed only by PI then do we still need to collect stats on table.
2. if table is well distributted(zero skewness) and this table is being accessed by PI or non-PI column then do we need stats collect on this able. will randaom Amp Sampling help on this case as stats does.
3. the proccess of the optimizer searches for statistics in teradata 14/14.10 is same as teradata 12. if anything differ please help on this

Thanks, -Sandeep.

carrie 595 comments Joined 04/08
19 May 2015

Hi Sandeep,
 
Here are responses to your questions:
 
1.  Whether or not you will need to collect stats depends on whether it is a UPI or a NUPI.  For example, a UPI does not generally require statistics, unless it is used in range constraints by the queries.  There are recommendations on stats collection for both UPIs and NUPIs on the Dev X blog posting titled: 
 
Statistics Collection Recommendations – Teradata Database 14.10
 
2.  Random AMP sampling only delivers an estimate on the table's row count and is not a substitute for stats collected on a column or an index that belongs to that table.   The same blog posting mentioned above has guidelines for when column-level statistics are recommended.   A column may have an uneven distribution of values across the rows in a table, even if the table's rows themselves are spread evenly across AMPs.   Zero skewness at the table level is not the same thing as zero skewness for a statistic's values.
 
3.  The process the optimizer uses for searching for statistics is the same today as it was for 12.0. However, there are many new features added after 12.0  which may look for more stats or different kind of stats.
 
Thanks, -Carrie

jcsheu 1 comment Joined 12/09
23 Jul 2015

I believe I was told that I should collect statistics on both join columns and predicator columns.  But I am not sure if I should collect statistics on the combined columns or collect one for join and another for predicator?  Also, does column order matter?  For the example below, which collect statistics statement should I use?

SELECT     a.C1, b.C2

FROM       tableA a

INNER JOIN tableB b ON a.JC1 = b.JC1 AND a.JC2 = b.JC2

WHERE      a.C3 BETWEEN 1 AND 100

;

 

(1) COLLECT STATISTICS COLUMN(JC1,JC2), COLUMN(C3) ON tableA;

(2) COLLECT STATISTICS COLUMN(JC1,JC2,C3) ON tableA;

(3) COLLECT STATISTICS COLUMN(C3,JC1,JC2) ON tableA;

 
 

carrie 595 comments Joined 04/08
27 Jul 2015

 
The following recommendations for multicolumns statistics are made in my blog posting on statistics collection recommendations for 14.10:
 
Collect Multicolumn Statistics:
 • Groups of columns that often appear together with equality predicates. These statistics are used for single-tables estimates.
• Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them. With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics is in this situation.
• Specify a name for such statistics, for ease of recollection, viewing, and/or dropping.
 
In the absence of column correlation information, the optimizer assumes that columns in a multicolumn statistic are completely independent.  The optimizer assumes that the number of unique values of the join columns doesn’t get reduced after applying the single table predicates.  Further, it assumes that the single table predicates remove the rows evenly from every value group of the join column until there is one row per value. 
 
If single table predicate columns and join columns have a high degree of correlation and the join selectivity estimate by itself is not accurate, then you want to collect base table multicolumn stats with single table predicate columns as leading columns following by join columns.
 
Thanks, -Carrie

super25 4 comments Joined 07/11
27 Oct 2015

Hi Carrie, I believe summary statistics is not allowed on volatile tables but still it is receommended when you do help stats on which will throw an error "COLLECT failed. 3706:summary option is not allowed on volatile tables". I am on TD14.
Does it do any good if stats are collected on volatile tables though the engine recommends and allows?

carrie 595 comments Joined 04/08
28 Oct 2015

Statistics  on volatile tables are not saved in DBC.statstbl, as are other statistics on base tables.  Volatile table stats are saved in memory for the given session.
 
Currently, there is no slot in memory for SUMMARY stats. By nature, volatile tables are specific to a session and gets deleted when the session is logged off. So, the assumption made by developers is that volatile table statistics don’t need any history records, extrapolations, etc.   Hence the restriction. 
 
In fact, the optimizer always does an all-AMP random AMP sampling for volatile tables.  This makes explicitly collecting SUMMARY stats less important.
 
Thanks, -Carrie

saif 1 comment Joined 12/15
28 Dec 2015

Hi Carrie,
I have a question.
Creating new stats on modified table or copying the stats from existing table will it make any difference for time perspective?
Here New stats which I am creating is exactly same as on old table.
Actualy I prefer to create new stats when moving any modified table to new database but if copying the stats from existing table create less time to collect stats then its better for me to copy it. Can you please clarify my doubt on it.
 

carrie 595 comments Joined 04/08
04 Jan 2016

If you wish to copy the statistics that already exist on table_A to a new table that is identical to table_A, call it table_B, the fastest and easiest way to do this is to export the stats from table_A and import them to table_B.   There is no collection overhead when you do this, just the statistics histograms being copied from Table_A and associated to Table_B.   
 
Before you do this, you need to be sure that the tables are identical, with the same number of rows and values, etc. 
 
Issue this command:   SHOW STATISTICS VALUES ON table_A
 
Save the output to a text file, change the name "table_A" to "table_B" in all of the collect stats statements in the output, then submit the SQL statements.    This will copy all the histograms and their values onto table_B.
 
Try it on a small test table first to get familiar with how it works.
 
Thanks, -Carrie

firstj 3 comments Joined 05/08
14 Jan 2016

Carrie
Just wanted to check whether there is a need to use the key word TEMPORARY when collecting stats on GLOBAL TEMPORARY tables under V14.0 onwards?
 
COLLECT STATISTICS
        INDEX ( COLUMN1, COLUMN2 )
        ,COLUMN (COLUMN3)
     ON $DATABASEVARIABLE.GT_TABLENAME       
;
 
Thanks
 

carrie 595 comments Joined 04/08
19 Jan 2016

If you want to collect statistics on the instance of a global temp table within your session, you would need to include the word TEMPORARY.  If you do not include TEMPORARY, then stats will be collected on the GTT definition that is in the data dictionary (which has no rows).  That will have no impact on your session's copy of the GTT.   If you collect on several columns on the GTT definition by omitting the word TEMPORARY, then later, when you have a session with an instance of that GTT you can collect stats by the GTT table name (using TEMPORARY) and all of the stats associated with the GTT definition will be included in the table level collection for that instance.
 
So if you only want to collect stats on your session's instance of the GTT, include the word TEMPORARY.   But if you want to set up a template for stats collection on multiple columns of the GTT to be available for all instances of this GTT, then omit the word TEMPORARY.  After collecting the column stats against the dictionary definition, then when a session is using an instance of the GTT you can collect stats (with TEMPORARY) at the table level and all those predefined stats will be collected for your session.
 
Thanks, -Carrie

firstj 3 comments Joined 05/08
25 Jan 2016

Thanks Carrie, very much appreciated

firstj 3 comments Joined 05/08
25 Jan 2016

Carrie, one more question - would the same rules apply volatile tables and would the syntax be: -
COLLECT STATISTICS
        INDEX ( COLUMN1, COLUMN2 )
        ,COLUMN (COLUMN3)
     ON $DATABASEVARIABLE.VT_TABLENAME       
;
Thanks

carrie 595 comments Joined 04/08
28 Jan 2016

The rules that apply to a global temp table in terms of stats collection do not apply to volatile tables.   Global temp table have a definition in the dictionary, but volatile tables do not.  Volatile tables only exist within one session, they do not have a global version.  Stats for VTTs are held only in the memory of the session, not in a DBC table. 
 
Your syntax below should work.   Stats collection success on a VTT can be validated by running an explain that uses the column that has stats for selection and checking the confidence level and the estimated row counts.  
 
Thanks, -Carrie

jman0670 1 comment Joined 03/10
10 May 2016

Carrie,

When collecting stats in the following format, it was my understanding that the table is read once; similar to what I have been told about just collecting at the table level.
COLLECT STATS
COLUMN(COL_A),
COLUMN(COL_B),
COLUMN(COL_C),
COLUMN(PARTITION)
ON <databasename>.<tablename>;
However, when reading the explain, it appears that it doing steps similar to the following for each individual column: 
We do a SUM step to aggregate from <database>.table by way of an all-rows scan.
Aggregate Intermediate Results are computed globally, then placed in Spool 35472.
We Then we save the UPDATED STATISTICS from Spool 35472(Last Use) into Spool 35474, which is built locally on a single AMP derived
from the hash of the table id
Questions:
1.  Is my understanding incorrect? 
2.  Will Teradata only do one pass when collecting at the table level only?
Thanks,
Mark
 
 
 
 

carrie 595 comments Joined 04/08
12 May 2016

Hi Mark,
 
The optimizer will ATTEMPT to collect multiple stats in one pass of the table when you use the coding below, but often it decides there is no cost savings in doing so and will revert to individual table scans/aggregations.  The cases where it tends to bundle up multiple collections in a single table scan are when there are collections on multicolumn stats as well as collections on the component columns that make up the mutlicolumn stat.  The other cases where the optimizer frequently bundles up collections is when the columns have a low number of distinct values. 
 
The optimizer has to be able to re-use the spool file it creates in the single pass to make the bundling a viable choice.  For that reason, columns with higher numbers of distinct values that have no relationship one to another almost always involve their own full table scan.  
 
It works the same whether you are using the syntax below or recollecting at the table level.  If the optimizer can detect a benefit in bundling multiple collect statements within a table,  or a subset of them, it will.  Otherwise it won't.
 
Thanks, -Carrie
 

27 Jun 2016

Hi carrie,
Is there a benefit in collecting stats on the ROWID column in a join index?
For the first time today, we saw a recommendation from Stats Manager to collect stats on ROWID column of a join index.  We have never  collected stats on ROWID before. 
The stats I collected on RowID has the same number of Unique Values as the Summary stats.
 
 

carrie 595 comments Joined 04/08
27 Jun 2016

Hi Nazy,
 
I can't think of any reason there would be a value in collecting stats on rowID.   RowIDs are often used in single-table join indexes as a means for the database code to get from the join index row directly to the unique base table row, such as:
 
CREATE JOIN INDEX  PriceJI   AS
SELECT  o_price, o_orderkey , ordertbl.ROWID
FROM  ordertbl
PRIMARY INDEX ( o_price );
 
Statistics will not add any performance advantage for a query that uses the above join index or one similar to it.   There might be some relevance in stats collection if the rowID value was passed in the query, but that is not something we usually recommend doing.  And even if that happened, since RowID directly accesses a single row, similar to a unique primary index, collected statistics are never required because they would not provide any useful information to the optimizer.
 
Thanks, -Carrie 

06 Sep 2016

Thanks For the clarification on ROWID......
 
I have another question.  In the original blog you have the explanation below.  Could you please expand  eon this?  I am not sure I quite understand where that 32 bytes come from?   In order for the optimizer to use the first 32 bytes of each coumn in the multicolumn stats, do we need to have the using maxvaluelenght be > 32 for as many columns as are used in the multicolumn stats? 
 
ex:  tableA has stats on 4 columns: ( I am exagerating on these column lengths so I can make my question more clear - I hope!)
A is 10 bytes
B is 35 bytes
C is 42 bytes
D is 50 bytes
 
collect stats COLUMN (A,B,C,D) on TableA;     do we get real number of unique values here since B is > 25, and therefore th whole length is > 25?
collect stats USING MAXVALUELENGTH 70 COLUMN (A,B,C,D) on TableA:  does this give the correct number, since we only have 70 characters (instead of 137)?
collect statistics USING MAXVALUELENGTH 140 COLUMN (A,B,C,D) on TableA;  This should work since we cover teh length... But does it only look at the 32 bytes of columns B, C & D?
 
.....  In addition, during the aggregation process that builds up the statistical histogram, each column within the statistics will be able to have its first 32 bytes represented when determining the number of distinct values, ensuring that s significant number of bytes in each column can contribute to understanding how distinct the combination of column values are.......
 
 

carrie 595 comments Joined 04/08
07 Sep 2016

Hi Nazy,
 
The 32 bytes I mentioned is not directly related to MAXVALUELENGTH.   MAXVALUELENGTH only applies to the number of characters that are stored in the histogram intervals value fields at the time those values are moved into the histogram intervals.     
 
The 32 bytes is used during the aggregation process to determine the number of distinct values and the interval buckets.  The stats aggregation process  uses first 32 bytes of each column in the multicolumn stat when it performs its aggregation.  This happens automatically, and you don't need to set anything up to make it happen.  Also, you cannot increase or decrease that value. 
 
After the aggregation process is complete, MAXVALUELENGTH is used to determined how many characters should be moved into the values fields in the histogram intervals.   The default is 25.
 
So you do not need to change MAXVALUELENGTH to be 32 bytes for each column included.   
 
For your four columns (A, B, C, D), the number of unique values for the entire statistics will be calculated based on taking the first 32 bytes from the beginning of each column and combining them.   Depending on your demographics, the numer of unique values that results should be pretty close to accurate.
 
With no MAXVALUELENGTH specified, the histogram value columns will contain all of colum A (10 bytes) plus part of column B (the first 15 bytes) and nothing from column C or D.  The greater you make the MAXVALUELENGTH, the more complete the histogram interval information will be.   And as you say, doing MAXVALUELENGTH = 140 will cover all four columns completely.
 
However, increasing MAXVALUELENGTH too greatly will tend to increase the size of the histogram.   Usually the optimizer will reduce the number of intervals or the number of biased values to compensate, so the entire histrogram doesn't grow too large.   So it's best to only use MAXVALUELENGTH up to a size that you actually need for differentiation.
 
Thanks, -Carrie

07 Sep 2016
Hi Carrie,  Thanks for the explanation... 
We sometimes have tables with several multicolumn stats having the same number
 of unique values.

In this example:

collect stats column (A,B)     on tableA;
collect stats column (A,B,C)   on tableA;
collect stats column (A,B,C,D) on tableA;

are collected, becasue any of those combinations could be used in a 
join or filter....

All three collections give the same number of unique values.
Even when I increase the MAXVALUELENGTH to the max length of 140, 
I still see the same number of unique values for all 3,,,  
And column C & D do not have constant values..... 
 So that's kind of strange for me...   

should we still collect all 3 sets? 




carrie 595 comments Joined 04/08
08 Sep 2016

Hi Nazy,
 
If you don't collect stats for each separate combination, then there will no histogram where the optimizer can see the number of uniques and the interval values.  So if you are using those different combinations of those columns for different joins it would be advisable to collect stats on them.
 
However, since there are overlaps in the columns represented, the optimizer should choose to scan the table one time for the collection of all three stats.  It will first collect stats on the broadest set (A,B,C) and use the spool from that collection for the (A,B) and the (A) stats.   But for that to happen you will happen to bundle all three statistics in the same collect statistics statement.  
 
Thanks, -Carrie

Pages

You must sign in to leave a comment.