# New opportunities for statistics collection in Teradata 14.0

Teradata 14.0 offers some very helpful enhancements to the statistics collection process. This posting discusses a few of the key ones, with an explanation of how these enhancements can be used to streamline your statistics collection process and help your statistics be more effective.

For more detail on these and other statistics collection enhancements, please read the orange book titled *Teradata 14.0 Statistics Enhancements*, authored by Rama Korlapati, Teradata Labs.

### New USING options add greater flexibility

In Teradata 14.0 you may optionally specify a `USING`

clause within the collect statistics statement. As an example, here are the 3 new `USING`

options that are available in 14.0 with parameters you might use:

. . . USING MAXINTERVALS 300 . . . USING MAXVALUELENGTH 50 . . . USING SAMPLE 10 PERCENT

`MAXINTERVALS`

allows you to increase or decrease the number of intervals one statistic at a time in the new version 5 statistics histogram. The default maximum number of intervals is 250. The valid range is 0 to 500. A larger number of intervals can be useful if you have widespread skew on a column or index you are collecting statistics on, and you want more individiual high-row-count values to be represented in the histogram. Each statistics interval highlights its single most popular value, which is designates as its “mode value” and lists the number of rows that carry that value. By increasing the number of intervals, you will be providing the optimizer an accurate row count for a greater number of popular values.

`MAXVALUELENGTH`

lets you expand the length of the values contained in the histogram for that statistic. The new default length is 25 bytes, when previously it was 16. If needed, you can specify well over 1000 bytes for a maximum value length. No padding is done to the values in the histogram, so only values that actually need that length will incur the space (which is why the parameter is named `MAXVALUELENGTH`

instead of `VALUELENGTH`

). The 16-byte limit on value sizes in earlier releases was always padded to full size. Even if you statistics value was one character, you used the full 16 bytes to represent it.

Another improvement around value lengths has to do with multicolumn statistics. In earlier releases the 16 byte limit for values in the intervals was taken from the beginning of the combined value string, and truncation took place at the end. Truncation will still take place at the end of the combined string in 14.0, however in this more current release you can support a much longer combined value. In addition, during the aggregation process that builds up the statistical histogram, each column within the statistic will be able to have its its first 32 bytes represented when determining the number of distinct values, ensuring that a significant number of bytes in each column can contribute to understanding how distinct the combination of column values are.

`SAMPLE n PERCENT`

allows you to specify sampling at the individual statistics collection level, rather than at the system level. This allows you to easily apply different levels of statistics sampling to different columns and indexes.

Here's an example of how this `USING`

syntax might look:

COLLECT STATISTICS USING MAXVALUELENGTH 50 COLUMN ( P_NAME ) ON CAB.product;

### Combining multiple collections in one statement

Statistic collection statements for the same table that share the same `USING`

options, and that request full statistics (as opposed to sampled), can now be grouped syntactically. In fact it is recommended that once you get on 14.0 that you collect all such statistics on a table as one group. The optimizer will then look for opportunities to overlap the collections, wherever possible, reducing the time to perform the statistics collection and the resources it uses.

Here is an example

**The old way:**

COLLECT STATISTICS COLUMN (o_orderdatetime,o_orderID) ON Orders; COLLECT STATISTICS COLUMN (o_orderdatetime) ON Orders; COLLECT STATISTICS COLUMN (o_orderID) ON Orders;

**The new, recommended way: **

COLLECT STATISTICS COLUMN (o_orderdatetime,o_orderID) , COLUMN (o_orderdatetime) , COLUMN (o_orderID) ON Orders;

This is particularly useful when the same column appears in single and also multicolumn statistics, as in the example above. In those cases the optimizer will perform the most inclusive collection first (`o_orderdatetime,o_orderID`

), and then re-use the spool built for that step to derive the statistics for the other two columns. Only a single table scan is required, instead of 3 table scans using the old approach.

Sometimes the optimizer will choose to perform separate collections (scans of the table) the first time it sees a set of bundled statistics. But based on demographics it has available from the first collection, it may come to understand that it can group future collections and use pre-aggregation and rollup enhancements to satisfy them all in one scan.

But you have to remember to re-code your statistics collection statements when you get on 14.0 in order to experience this savings.

### Summary statistics

New in Teradata 14.0, table-level statistics known as “summary statistics” are collected alongside of the column or index statistics you request. Summary statistics do not cause their own histogram to be built, but rather they create a short listing of facts about the table undergoing collection that are held in the new `DBC.StatsTbl`

. It is a very fast operation. Summary stats report on things such as the table’s row count, average block size, and some metrics around block level compression and (in the future) temperature. An example of actual execution times that I ran are shown below, comparing regular column statistics collection against summary statistics collection for the same large table. Time is reported in `MM:SS`

:

COLLECT STATISTICS ON Items COLUMN I_ProductID; Elapsed time (mm:ss): 9:55

COLLECT SUMMARY STATISTICS ON Items; Elapsed time (mm:ss): 00:01

You can request summary statistics for a table, but even if you never do that, each individual statistics collection statement causes summary stats to be gathered. For this reason, it is recommended that you group your statistics collections against the same table into one statement, in order to avoid even the small overhead involved in building summary stats repetively for the same table within the same script.

There are several benefits in having summary statistics. One critical advantage is that the optimizer now uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations. It no longer needs to depend on primary index or `PARTITION`

stats, as was the case in earlier releases, to perform good extrapolations when it finds statistics on a table to be stale.

Here’s an example of what the most recent summary statistic for the Items table looks like:

SHOW SUMMARY STATISTICS VALUES ON Items; COLLECT SUMMARY STATISTICS ON CAB.Items VALUES ( /** TableLevelSummary **/ /* Version */ 5, /* NumOfRecords */ 50, /* Reserved1 */ 0.000000, /* Reserved2 */ 0.000000, /* SummaryRecord[1] */ /* Temperature */ 0, /* TimeStamp */ TIMESTAMP '2011-12-29 13:30:46', /* NumOfAMPs */ 160, /* OneAMPSampleEst */ 5761783680, /* AllAMPSampleEst */ 5759927040, /* RowCount */ 5759985050, /* DelRowCount */ 0, /* PhyRowCount */ 5759927040, /* AvgRowsPerBlock */ 81921.871617, /* AvgBlockSize */ 65024.000000, /* BLCPctCompressed */ 0.00, /* BLCBlkUcpuCost */ 0.000000, /* BLCBlkURatio */ 0.000000, /* RowSizeSampleEst */ 148.000000, /* Reserved2 */ 0.000000, /* Reserved3 */ 0.000000, /* Reserved4 */ 0.000000 );

As always, thank you for bringing new updates to us.

Can you please clarify

1) Will existing table level collect stats statements utilize multiple stats in one scan when possible?

2) Is collecting stats at table level still the best practice? That is, if only collecting stats that are known to have changed, will that affect optimizer's extrapolation or other ability to choose best plan?

3) I wasn't clear on SYSTEM MAXVALUELENGTH option. Does SYSTEM option refer to default value of 25 or the calculated max width of concatenated columns?

4) With many cost based optimizationns applied to COLLECT STATS statement, does INDEX specification have any advantage over COLUMN specification for COLLECT STATS?

5) You mentioned that summary level stats are used for extrapolation. Does TD14 still use Dynamic AMP Sampling for extrapolation? Under what circumstances will the optimizer prefer to use Dynamic over Summary stats?

6) Also, is there now an ability to determine when stats on a table need to be refreshed? It would be nice, for example, if TD can conditionally collect full stats if % data change exceeds given threshold or collect summary level stats if not.

Thanks

Padhia, here are responses to your questions:

1. First, you have to re-issue statistics collections individually when you get on 14.0 using the grouped (multiple collections in one statement) format. Only statistics that share the same USING options can be grouped. After collecting statistics using this new syntax, then you can recollect at the table level and where possible, multiple stats will be collected in a single scan.

2. If you only collect on statistics that have changed, you may not benefit from the grouping capability in 14.0 to the same degree as if you collected all statistics. However, if the demographics of a statistic have not changed, there will be no advantage to a recollection in terms of extrapolation potential or anything else related to the optimizer.

3. The USING SYSTEM MAXVALUELENGTH option refers to the default of 25 bytes. This could be useful if you previously collected with a max value length of 40, for example, and you wanted to revert back to the system default.

4. The optimizer considers all access paths in 14.0, including index access, in order to minimize the read cost during a collect statistics operation.

5. First the optimizer determines the current table row count based on the table-level summary stats and random AMP sample (RAS). The current row count is computed as summary stats row count + delta row difference from saved RAS and current RAS. Once the current table row count is determined, it is compared with the histogram row count to detect stale histograms. Only if stale stats are detected will extrapolation be attempted.

6. In 14.0 there is no such capabiity. More enhancements to statistics capabilities will be available in future releases.

Thanks, -Carrie

Thanks! it is veryhelpful information.

Few questions still around and have a doubt on this

There is always question while collecting a stat which one is good or best cost optimization

*) COLLECT STATS statement do i need to use INDEX COLUMN NAME OR COLUMN COLUMN NAME

*) COLLECT STAT ON TABLE COLUMN PARTITION; or COLLECT STAT ON TABLE COLUMN partiton column name;

*) How to identify "STALE STATs"

*) One tabel we have 10 collected STAT if i want to delete only selected 2 ..is it possible?

*) Can you elaborate on Intervals ..why it is required for STAT

Regards,

Sam

Hi Sam,

I have tried to respond to your questions below:

Q) COLLECT STATS statement do i need to use INDEX COLUMN NAME OR COLUMN COLUMN NAME

A) You can use either INDEX OR COLUMN syntax when you collect statistics, assuming an index exists. If it is a single column index, the stats will remain, even if you drop the index.

However, if it is a multicolumn index, and the index pre-exists the stats collection, the stats will be lost if you drop the index. This is true whether you use INDEX or COLUMN. So it doesn't matter which form you use in this case either.

But if the stats are collected BEFORE the index is built, and COLUMN syntax was used for the stats collection, then if you create the index and some time later drop the index, the stats will remain (assuming you have continued to use the COLUMN syntax to do any recollections). This is the only difference I know of between the two syntax options.

Q) COLLECT STAT ON TABLE COLUMN PARTITION; or COLLECT STAT ON TABLE COLUMN partiton column name;

Collecting stats on PARTITION, if the table is partitioned, tells the optimizer how many rows are in each partition. It is very fast because it reads the cylinder indexes, rather than the data tables. Collecting on the partitioning column, on the other hand, provides estimates to the optimizer when the partitioning column is part of the query selection criteria, just like any other selection column. Base table is data is accessed. If a table is partitioned, always collect both. Read the blog posting on Statistics Collection Recommendations for more on this topic.

Q) How to identify "STALE STATs"

A) If you want to know if your stats are stale, one way is to collect stats on PARTITION for the table (which will give you the current row count), and then compare the row counts within the PARTITION histogram against the row counts in the older stats histogram for the statistic that you are concerned about (which will give you the row count at the time of the last statistics collection for that stat). If they are off by 10% or more they are usually considered stale, and should be recollected.

Q) One tabel we have 10 collected STAT if i want to delete only selected 2 ..is it possible?

A) You can drop any statistics you wish, by issuing the DROP STATISTICS command.

Q) Can you elaborate on Intervals ..why it is required for STAT

A) When you collect statistics a histogram will be produced. Statistics histograms are made up of intervals. When a query executes that uses that statistics, the histogram and its intervals are used to get demographic information about the column. You can read more about histograms and intervals in the Interval Histogram section of the SQL Request and Transaction Processing manual, available at www.info.teradata.com.

Thanks, -Carrie

Hi,

Can you plz explain about DBQL and PDCR?How DBQLl and PDCR are useful for performance tuning?what are the DBQL tables we are using mostly?How to access DBQL Tables with queries?

Thnaks&Regards,

Anil Kumar

can you plz write shell script for collecting statistics automatically?

Thanks&Regards,

Anil Kumar

Anil,

PDCR is a performance and capacity reporting offering from Teradata Professional Services that I believe is offered as part of a service. You should contact your Teradata Account Team or your PDCR PS representative to find out more about this service. This offering creates summary tables of from various Teradata monitoring tables that then allow custom performance and capacity planning reports to be generated. I believe some of the reports (and there are dozens of them) access historical DBQL data. You could also check for questions and answers on PDCR on the Teradata Forums, or elsewhere in Developer Exchange.

Thanks, -Carrie

Anil,

Sorry, but I do not have such a script. You can post the question on one of the forums and see if anyone else does. You can also look at mtmoura's blog on Developer Exchange, he has an approach to automating statistics collection that he describes there.

Thanks, -Carrie

Q ) Is there a plan to build an automatic COLLECT STATS for tables. At least for some critical tables in the database ?

can anybody tell us that how to make teradata DB to send its logs ( audit logs ) to a syslog system like RSA envision .

Response to rayravi:

There is an AutoStats feature that will be available in Teradata 14.10, which will support the automation of stats collection and also analysis for which stats to collect on for a given table or set of tables. There is a completely new Viewpoint front end to make the management of stats easy. You will hear more about that feature as it gets closer to 14.10 GCA, maybe next year.

Thanks, -Carrie

Response to abhishekjain25:

Why don't you try posting your question on the Teradata Forum? Sorry, but it is not in my area of experience or expertise.

Thanks, -Carrie

Hi Carrie,

when join conditions require conversion between data types or character sets will collected statistics on those columns be eligible for use when the optimiser chooses the execution plan ?

Thanks

Hi Andrew,

Yes, that should be fine.

For simple data type/char set conversions, the optimizer uses statistics from the underlying columns.

Thanks, -Carrie

Hi Carrie,

we get the error below in Teradata 14, but not in Teradata 13, could you help explain and any suggestion?

COLLECT STATISTICS DB.mytable COLUMN (C1 , C2);

COLLECT STATISTICS DB.mytable COLUMN (C1 , C2);

$

*** Failure 3706 Syntax error: Multiple statistics with different column ordering on the same set of columns are not allowed.

Statement# 1, Info =0

I asked someone in my office to do the same thing you did in your comment (the same column order in both cases for two different stats collections) right after each other on a Teradata 14.0 system, and we did not get the error that you did. Not being able to replicate the problem, I am afraid I cannot provide any suggestions to you.

However, I cannot think of any practical reason you would want to repeat the same identical collect stats statement on the same exact columns, in the same order, at the same time. If those two stats collection statements are indeed identical, you only ever need one of them on the two columns per cycle. The solution is to must not issue the second one until its time for the next stats collection cycle.

However, if you did in fact collect stats on those two columns but in a different order the second time, (C1, C2) and then (C2, C1), and it just got incorrectly transcribed so it looks like it used the same sequence in your comment, then you should be getting that error. You are not allowed to collect on multicolumn stats in different sequences without dropping the stats with the original sequence. And you will not get the error in releases earlier than 14.0, because stats collection will just silently ignore any column order that does not reflect Field ID order.

Thanks, -Carrie

Carrie,

How will the table level stats be collected in TD 14? If we collect stats using the new syntax, will the table level stats continue to reuse the spool ?

Table level recollections of stats in Teradata 14.0 will remember the USING options at each statistics level and apply them at recollection time. To continue to use the same USING options, do not specify USING options at recollection time.

To address your question, only statistics that share the same USING options will be able to use the rollup aggregation and reusabel spool enhancements. And only statistics that are full collections (no sampling). This is true during the initial collections as well as recollections.

Example: Consider the following as your initial collection statements with different USING options:

COLLECT STATISTICS

USING SAMPLE 20 PERCENT

COLUMN (p_avail_date)

ON Product;

COLLECT STATISTICS

USING MAXINTERVALS 100

AND MAXVALUELENGTH 30

COLUMN ( P_BRAND ) ,

COLUMN ( P_CONTAINER ) ,

COLUMN ( P_SIZE ) ,

COLUMN ( P_BRAND ,P_CONTAINER ,P_SIZE )

ON Product;

COLLECT STATISTICS

USING MAXINTERVALS 400

COLUMN (p_name)

ON Product;

For table level statement recollects all the three statistics collections on the Product table applying each of the USING options as it recollects.

COLLECT STATISTICS ON Product;

However, only the second statement of the three will be considered for the rollup optimization where spool can be re-used. Since the 3 statements have different USING specifications they cannot be spooled together.

Thanks, -Carrie

Hi Carrie,

Thanks for posting this useful article.

After collecting column level statistics in TD14. Ended up with error, when I tried to check the histogram.

HELP STATS table_name COLUMN column_name;

HELP Failed. 3706: Syntax error: INDEX and COLUMN option not allowed in HELP Statistics.

I tried to explore 14 document but no luck. Can we get the histogram details in 14? If so, how to get the information. Appreciate your help.

Thanks,

Velu N

Hi Velu,

In Teradata 14.0 the HELP STATISTICS statement will no longer accept column or index specification. You will want to use the new SHOW STATISTICS command with the VALUES option to view the histogram detail.

For example, to see the histogram for a column customer.expr_dt you would issue this command:

SHOW STATISTICS VALUES COLUMN expr_dt ON customer;

If you have access to the orange book repository, there is an orange book that explains all the changes in statistics collection in 14.0 titled: Teradata Database 14.0 Statistics Enhancements.

I found a section in Chapter 9 in the SQL Data Definition Language - Detailed Topics manual for 14.0 that discusses the SHOW STATISTICS command, starting on page 1072.

Thanks, -Carrie

Carrie:

A little guidance on how Version14 may (or may not) be able to help with out environment. We have tables that are multbillion row tables and the data in those tables becomes cold quickly. After one month's time, the data is not updated. We had thought in the new release that 14 was going to allow us to gather stats at a partition level. In our case, this would be perfect because we know our data goes cold only after one month. Are there any plans to ever allow us to gather stats at a partition level so that I only have to update for new acquired since the last stats run

Thanks

What you can do in 14.0 is to recollect SUMMARY stats on the table after each load winder. SUMMARY stats on the table are very quick and carry information that will help the optimizer produce extrapolations for statistics that were not able to be recollected.

However, to get the optimizer to extrapolate statistics, you will need to already have collected full statistics on the table's columns. In 14.0 it usually takes a little while to get to robust extrapolations because the optimizer will look for and use history of recollections for a statistic to better understand the underlying patterns of change. So if you can recollect full statistics a few times before relying on SUMMARY stats, that will provide the optimizer with better quality information.

If you have access to the orange book repository, there is a section on extrapolations in 14.0 that will provide some additional information in the document on Teradata 14.0 Statistics Enhancements.

You can see how accurate the table row count extrapolations that are being made are by issuing the command:

HELP CURRENT STATISTICS ON table-name;

The output you see will show you the extrapolated values are for the stats that have already been collected fully, rather than the values based on the last histogram.

Once you get that extrapolated data, you can compare it against actual demographics in the table and see how close you are based with extrapolation. You do not want to rely on extrapolation for long periods of time, but it is a means of avoiding frequent recollections on large tables that are undergoing row count changes.

Thanks, -Carrie

Hi Carrie,

I am using Teradata 12. I created a table with single column NUSI and collected stats on the index. Then i dropped the index. Still i able to see the stats using help stats and in dbc.tvfields, so from this i infer that even after dropping index, stats associated with the index remains. My question is whether this stats can be used by the parser while generating the plan after dropping the index? Whether the stats are valid after index is dropped? Please advise

- Ganapathy

When you drop a single-column index, the single-column statistics will remain and can be used by the optimizer in generating a plan. The stats are still valid.

However, if you have an index on multicolumn statistics, and you have collected statistics on the index, those statistics will go away when you drop the multicolumn index, at least at your software release.

In Teradata 14.0 that changes. In 14.0 when you drop an index on multiple columns, any stats you have collected will remain.

Thanks, -Carrie

Thank you very much Carrie. Got clarified now.

Hi Carrie ,

1.I would like to understand , why do the figures of using pre aggregation optimization alogrithm vary ? for one 100 GB table I get 75% saving and for another 1TB table I get 15 % for same number of cols . Below are the candidates for stats:

(col a, col b , col c) , col a , col b , col c

2, Index level confidence is based on the fact that one of the col is indexed and stats are missing on it. But I have come accross a scenarion when both the cols were not indexed and still the explain showed index level confidence .

The optimizer chooses whether or how to use pre-aggregation or rollups at the time of the statistics collection statement. I do not have any insights I can share with you concerning how those algorithms work or why they might lead to different results with different tables. But it makes sense to me that they could, because demographics are usually different in different tables.

You can look at the explain text of the EXPLAIN COLLECT STATISTICS... statement and see where the optimizer choose to overlap collections and where it did not.

If you are on 14.0, the columns you mention below look like good candidates to be pre-aggregated during statistics collection, but you would have to look at the explain output to validate that.

My understanding of Index Join Confidence is that it is for join steps only and then only when one side of the join has an index with or without stats, and the other side has no stats and no index. Also, steps lower in the plan can inherit "Index Join Confidence" from a step higher in the plan, so that may be what you are seeing. If you think you are getting this confidence level in error, please open an incident with the support center.

Thanks, -Carrie

Hi Carrie,

Is it not possible to collect stats inside a SP??

Yes, you can collect statistics inside of a stored procedure.

Thanks, -Carrie

Thank You Carrie .. That was informative .

1> I was going through the orange book , which mentions the aggregation algorithm being used only when the stats are recollected . But I have tried defining stats using new syntax and every time it has used the syntax . Could you please share your thoughts on that ?

2> Also , I have a volatile table , and I have collected stats on all the possible columns of that volatile table ...still I end up with NO CONFIDENCE explain plan ... I have seen this on GTT , but does that happen for Volatile tables as well ?

3>I have tried to replicate the problem of limited maxx value length .. the column's vaue had same first 49 characters and the 50th character was different . Considering the default of 25 bytes was defined , I should have got distinct value as 1 .. but exact number of distinct values were displayed .

Could you suggest a better test to get the real advantage of MAXXVALUE length ?

Thank You in advance .

Carrie,

But everytime i try to collect stats inside a SP for VT's i get this error:

"CALL Failed. 3598: STATS_CHECK:Concurrent change conflict on database -- try again. "

Also raised the concern in here but no solution yet --http://forums.teradata.com/forum/database/stats-collection-inside-sps

@Barani_Sachin : I have come across this error ...this happens when you conurrently try to define stats on same table ...

Not related to stored proc ... To confirm , Just try keeping one collect stats statement in your SP and run it..it should collect .

As far as I remember , TD documentation says ... This error would only be while defining stats ...not while recollecting it

@samyyy : I have 10 volatle tables, so i have to define and collect stats only in runtime. And also i am not trying to concurrently define stats on the same table. having only one statement per VT.

Response to Samyyy's three questions from 3 days ago:

#1:

I am not sure I am understanding what you are saying when you say: "I have tried defining stats using new syntax and every time it has used the syntax." If it is accepting the new syntax, then the new syntax should result in statistics being collected. So that seems fine to me. Just collecting statistics using the new syntax will not provide any pre-aggregation or rollup advantage, but only if you use the new syntax AND bundle up several related statistics collection statements into one request.

If you are submitting multiple collect stats statements in one request and NOT seeing any pre-aggregation or overlap of table scans then it means that the optimizer is not able to overlap the collections because they are unrelated. Only under some conditions will you see that enhancement provide value. Do an explain on the collect stats statement to see what is happening in your particular case..

#2:

Collecting all possible statitsics ona VTT may not improve the confidence level. There are many things that go into producing confidence levels, including how deep in the plan the step is located. Table's accessed at the top of the plan have more predictable confidence levels in their steps than do tables accessed low in the plan. I do not have a specific answer to this particular question, but it is less important what the confidence level is, and more important whether or not the plan is a good one.

#3:

The number of distinct values in a statistic is accurately determined at the time of the collection and aggregation process. The full length of the field is used during that process, so it can know about all the distinct values in their entirety. It is only after the process is complete, and the number of distinct values is determined that the values will be moved into the histogram. It is the histogram value fields that cause truncation, but that will not impact the distinct number of values that have already been calculated. The problem around this truncation within the histogram is that it may impact single table selection row estimates, as the optimizer may not be able to see which values are actually being represented in which field or which interval, if they have been truncated too severely. That is the advantage of the the max value length enhancement, to improve single table selection.

Thanks, -Carrie

I am not familiar with the error mentioned above. If you are still not sure what is happening, consider opening an incident with the support center.

Thanks, -Carrie

Thank You Carrie .... That was really helpful . #3 was a beautiful explanation :)

For #2 , The volatile table is getting joined to itself .. so there are no other tables involved .What amazes me is the fact that Optimizer has all the possible information for the demographics of the volatile table , but still it says no confidence . I will have to read more into confidence levels . Thank You for your suggestions on it .

Hi Carrie

In the SQL Data Definition Langauge - Detailed Topics manual for TD14 it lists:

- SYSTEM MAXINTERVALS

- SYSTEM MAXVALUESLENGTH

With SYSTEM SAMPLE the manual suggests that Teradata Database will automatically adjust the sample being taken the more times you run it, However, there is no mention of any automatic adjustment of values specified for SYSTEM MAXINTERVALS or SYSTEM MAXVALUESLENGTH.

My question is when will these SYSTEM options be useful in comparison to not specifying them at all.

Many thanks

Hi Ian,

That auto adjustment to the sampling percent is not done in 14.0. This is implemented in 14.10. Somehow, this statement made it into the document prematurely, and I have put in a request for it to be corrected.

The purpose of the SYSTEM option in 14.0 is the following: If you have, for example, explicitly asked for sampling at 10% and now want to return to whatever is stated in DBS Control (by default this is 2%), then by stating SYSTEM SAMPLE, you will be returned to the higher, system-level specification. If the system percent as represented in DBS Control should change, this statistic will be sensitive to that change. In later releases SYSTEM SAMPLE will take on additional meaning related to automating the system percent based on what the optimizer sees as appropriate, but we are not at that point yet. So SYSTEM in the using clauses is somewhat of a place-holder for future functionality.

Thanks, -Carrie

For multi-column statistics, the order is still determined by the order in which they appear in the table, correct?

This is a very helpful blog.

Hi Carrie,

I want to clarify my doubts. If I have tables with

a) UPI b) NUPI c) USI d) NUSI. What types of stats are recommended best for each based on these USING options ?

Thanks and regards,

Raja

Raja K Thaw

Street Children suffer not by their fault. We can help them if we want.

Raja,

There is a blog posting that lists the recommendations for collecting statistics.

It makes these recommendations for indexed columns:

Collect Full Statistics

- Non-indexed columns used in predicates

- All NUSIs

- USIs/UPIs if used in non-equality predicates (range constraints)

- Most NUPIs (see below for a fuller discussion of NUPI statistic collection)

Can Rely on Random AMP Sampling

- USIs or UPIs if only used with equality predicates

- NUSIs with an even distribution of values

- NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness

Those recommendations are the same no matter what statistics USING clauses you attach to the statistic when you are collecting it. I can't think of any difference between and indexed column and a non-indexed column when it comes to which USING options to apply, or how to apply them.

Thanks, -Carrie

Hi Carrie,

Thanks a lot for the information. I always tell people " Keep statistics fresh always" even if I have equality predicates of UPIs or USIs. I hope I make a 100% correct statement. Please let me know.

Thanks and regards,

Raja

Raja K Thaw

Street Children suffer not by their fault. We can help them if we want.

Raja,

Fresh statistics are always the best. But extrapolation capabilities, particularly in releases 14.0 and above, can provide some relief for sites that simply do not have the time or the resources to recollect everything to a high standard of freshness.

When data distribution is relatively uniform extrapolation can offer reasonably accurate estimates, as long as you recollect SUMMARY statistics on the table frequently.

In 14.0 and above you can see what the extrapolated statistics will look like by issuing this command:

HELP CURRENT STATISTICS ON MyTable;

Sampling is another technique you can use to save resources, but prior to 14.10 you want to be careful not to rely on sampling too heavily for very skewed columns.

Thanks, -Carrie

Thanks, it was immensly useful.

Hi Carrie,

Just two questions:-

1. In TD14, Step 1- I defined stats on one column X in a table, Step 2- I did "SHOW STATISTICS ON table" ... I see that summary stats are already available. So, do I need to re-define summary stats by issuing "COLLECT SUMMARY STATISTICS"?

2. Regarding naming stats on index- Can we write something like this-

COLLECT STATS USING THRESHOLD 5 DAYS AND THRESHOLD 5 PERCENT INDEX(COLUM) AS idx_SMENAME ON DB.TABLE;

Correct me if I am missing something please.

Thanks,

SS

Hi Carrie,

I am having some doubts regarding the automatic stats collection process. Currently i am working on Version 13.10 and we are doing the stats collection on big/important tables on daily basis. We automated the complete process using unix shell scripts and teradata macros using parameter tables approach.

Now, i started hearing from V14.0 stats can be automated using viewpoint. Using viewpoint we can automate the stats by specifying the frequency on which stats needs to be collected for the mentioned tables. Whether this is possible? Please explain.

Response to sk73:

1. You do not need to recollect summary stats. Summary stats are automatically collected on the table any time column/index stats are collected.

2. In terms of the threshold option, that is described pretty thoroughly in another blog posting, so instead of repeating that detail here, let me point you to that write up.

Statistics Threshold Functionality 101

Yes, you can name a statistic by adding an "AS stat_name" immediately following the statistic column(s) collect statistic statement. The Teradata manuals should be able to provide you with the exact syntax and examples.

Thanks, -Carrie

Vasudev,

I have written a blog posting that provides a summary of the AutoStats feature. I believe your question is answered in that posting. It is titled:

Easing into Using the New AsutoStats feature, created in December 2013.

AutoStats is optional to use, if you have your own home grown stats management application you can continue to use that if you wish.

There is also an orange book called Automated Statistics Management which provides more detail on this feature, which is available in 14.10.

Thanks, -Carrie

Thank you very much carrie....

Hi All,

When I do help stats over my table, I do see the result set containing '*' in column names, Column dictinary names, & Column Sql names. Does it mean that the stats have been done for the all the column in the table? I do see the unique value with the full count of the table.

Regards,

Srini

## Pages