AttachmentSize
CollectStatsQueue.txt13.25 KB

When you start designing a statistics collection process topics include:

  • stats on some tables are more important than others.
  • some stats on a single table are more important than others.

Both problems involve some kind of prioritization process, which finally leads to a sorted list of COLLECT STATS statement (See Marcio Moura's blog When is the right time to refresh statistics?)

Processing this list (using a BTEQ script or a cursor in a Stored Procedure) could lead to another problem:

  • only a limited time frame for collection is available.

Therefore you might have to stop the process before all those stats are collected.

But now you have to take care that the remaining stats will be collected during the next run (and probably with a higher priority).


Introducing the queue table

Queue tables have been implemented in V2R6, but are still one of Teradata's hardly known features. Their main purpose is to allow asynchronous event processing in a FIFO and this perfectly matches with our requirements:

  • When the process stops the remaining rows are still in the queue and will be processed first when the next run starts.
  • More statistics might be collected in the same period when multiple processes consume rows from the queue (this is really hard to implement without a queue - how to decide which stats should be collected by which process?).

As an additional benefit ETL-jobs requiring some final COLLECT STATS steps might finish earlier, when those statements can run asynchronous via queue.


The attached script creates this queue table (CollectStatsQueue) plus two log tables (CollectStatsLog and CollectStatsLogDetails), a Stored Procedure (CollectStatsProc) to process the queue and a macro (StopQueue) to stop the processing. Syntax is based on a TD13.0 running in Teradata mode, older releases would require some modifications.

CollectStatsQueue

This is the main driver table for the collection process, a COLLECT STATS is triggered for each inserted row.

QITS

When your current process results in an ordered list this order of execution is maintained by setting the queue's "Query Insertion Timestamp" column to

CURRENT_TIMESTAMP
 + (ROW_NUMBER() OVER (ORDER BY whatever_determins_the_order) * INTERVAL '00.000001' SECOND)


For a single row insert higher or lower priority might be assigned by simply adjusting this value, e.g.

CURRENT_TIMESTAMP - INTERVAL '10' MINUTE AS QITS -- higher priority

DatabaseName
TableName

Self-explanatory

IndexName

Optional, when supplied used instead of the column list

ColumnName

Comma-delimited list of column names (for column and index level stats)

CollectLevel

Collect on  T (Table), C (Column), I (Index) level

SampleSize

uses SAMPLE when set to any non-NULL value

CollectStatsLog

Records info about every run of the CollectStatsProc including the number of statistics collected and the number of errors during the collect.

CollectStatsLogDetails

Records info for every failing (optionally also successful) COLLECT STATS.

CollectStatsProc

This Stored Procedure actually submits the COLLECT STATS for each row in the queue

IN MaxTS

Process will stop when a COLLECT finishes after MaxTS or all rows in queue are processed.

If MaxTS is set to NULL the process will run until the StopQueue macro is executed.

IN LogDetails

If set to 'Y' info about each processed row is written to CollectStatsLogDetails, otherwise only failed collects.

IN MonitorInterval

Update the CollectStatsLog table after x minutes with info about the number of statistics successfully collected or failed.

OUT RC

SQL Errorcode: when there's any error outside of the actual COLLECT statement the process stops

OUT Msg

Result of the execution: Success/Warning/Failed plus additional info.

StopQueue

Macro to stop all SPs processing the queue: invoke EXEC StopQueue; from any session.


Usage examples

CALL CollectStatsProc(CURRENT_TIMESTAMP(2)+INTERVAL '1' HOUR,'Y',5, rc, msg);

Run for up to one hour, log all rows in the CollectStatsLogDetails table and update the CollectStatsLog table every 5 minutes.

Can be stopped earlier by simply running EXEC StopQueue; from any session.

CALL CollectStatsProc(NULL,'N',10, rc, msg);

Run until StopQueue is executed, log only failed collects in the CollectStatsLog table and update the CollectStatsLog table every 10 minutes.

Running in parallel

As the CollectStatsProc will probably be CALLed in a BTEQ script it's easy to run multiple copies:

.SET SESSIONS 2
.REPEAT 2
CALL CollectStatsProc...

Some remarks on running multiple instances of CollectStatsProc:

  • Collecting multiple stats on a small table using one row per index/column might result in a deadlock (when stats data is merged into the system tables), thus it's better to collect those stats on table level
  • As stats collection is quite resource intensive you should check if additional resources are available, multiple instances are mainly suitable for running in a dedicated batch window.

Sample script

The following example to recollect all existing stats within a database is based on my StatsInfo view:

INSERT INTO CollectStatsQueue
 ( QITS, DatabaseName, TableName, IndexName, ColumnName, CollectLevel, SampleSize)
SELECT
   CURRENT_TIMESTAMP
   + (ROW_NUMBER() OVER (ORDER BY databasename, tablename) * INTERVAL '0.000001' SECOND)
   ,dt.*
FROM
 (
   SELECT DISTINCT
      DatabaseName                  
      ,TableName                     
      ,CASE WHEN CollectLevel = 'T' THEN '' ELSE IndexName END AS IndexName
      ,CASE WHEN CollectLevel = 'T' THEN '' ELSE ColumnName END AS ColumnName
      ,CASE -- collect on table level for small tables
          WHEN COUNT(*) OVER (PARTITION BY TableId) > 1
             AND (NumRows < 100000
                  OR SUM(CollectDuration) OVER (PARTITION BY TableId) < INTERVAL '0:10' MINUTE TO SECOND)
             THEN 'T'
          WHEN StatsType IN ('Col', 'MCol', 'Part') THEN 'C'
          ELSE 'I'
       END AS CollectLevel
      ,SampleSize                    
   FROM
     StatsInfo
   WHERE TableType <> 'TempTbl'
   AND MissingStats = 'N'
   AND DatabaseName = ...
 ) AS dt;   

CALL CollectStatsProc(CURRENT_TIMESTAMP(2)+INTERVAL '10' MINUTE,'Y',1, rc, msg);

SELECT * FROM CollectStatsLog ORDER BY StartTS;

SELECT * FROM CollectStatsLogDetails ORDER BY StartTS;

Please provide any kind of feedback (especially about performance when running multiple instances).

Discussion
ulrich 35 comments Joined 09/09
23 Oct 2012

Very nice idea! Thanks for sharing

Der Beginn aller Wissenschaften ist das Erstaunen, dass die Dinge sind, wie sie sind.-Aristoteles

CHERRY_UZZE 1 comment Joined 07/11
25 Nov 2012

Thanks for sharing, but unfortunately not finding any attachment here.

dnoeth 63 comments Joined 11/04
25 Nov 2012

Strange, there should be an attachement area right to the headline of this blog entry. You could try to empty the browser cache.

Otherwise this is the direct link:
http://developer.teradata.com/sites/all/files/CollectStatsQueue.txt

Dieter

Dieter

samp6050 1 comment Joined 06/08
05 Dec 2012

Great post Dieter. Thank you for sharing!

Regards,
Sam

Abhishektd 2 comments Joined 06/11
11 Dec 2012

Thanks for sharing.

Thanks,
Abhi

18 Dec 2012

Hi,

I am facing an issue in running a teradata procedure.
Getting the error code as 3807,

This procedure is called via informatica mapping and this is a monthly process, suddenly having issue in inserting to a table through the proc.

Till last month this was running fine.

Please help me in resolving this issue.

Thanks,
Sasikala V

Abhishektd 2 comments Joined 06/11
18 Dec 2012

Verify once, that the procedure and the table are qualified with the databasename. This may be one of the reasons behind the issue.

Thanks,
Abhi

ahnajeeb 2 comments Joined 03/11
07 Feb 2013

Nice idea - We can also use other parameters. I have been able to get very good results considering other factors Like
- Update priority based on Frequency_Of_use (Partition, PI and PPI can get special attention)
- Consider Table / Column level Stats Collection.
- Set a Max limit on CPU to be used by Stats Process.
- Increment the priority of left overs to avoid starvation.
- split the table using MOD function to run # parrallel threads based on State of system.

Najeeb

AN

mikesteeves 3 comments Joined 10/11
14 Nov 2013

Excellent demonstration of a pratical use of a Queue table, excatly what I've been looking for.  
Thanks for the informative post Dieter.
-Mike

david clough 4 comments Joined 12/09
23 Dec 2013

Like the idea of the Stop Queue Macro, Dieter.
Think I'll take your concept and use it with my 'mixed workload' application that I'm currently building. 
Probably what I'll do is allow the Stored Procedure - the one which does the Consuming - to run 'forever',  unless we decide (either by choice, or by periodic necessity) to Stop the Queue.
So, thank you.

Raja_KT 10 comments Joined 07/09
30 Jan 2014

Yes, the Stop Queue Macro!!!!! amazing. I was in a different world before this :).

Raja K Thaw

You must sign in to leave a comment.