How to simplify your statistics collection with a queue table
When you start designing a statistics collection process topics include:
- stats on some tables are more important than others.
- some stats on a single table are more important than others.
Both problems involve some kind of prioritization process, which finally leads to a sorted list of COLLECT STATS statement (See Marcio Moura's blog When is the right time to refresh statistics?)
Processing this list (using a BTEQ script or a cursor in a Stored Procedure) could lead to another problem:
- only a limited time frame for collection is available.
Therefore you might have to stop the process before all those stats are collected.
But now you have to take care that the remaining stats will be collected during the next run (and probably with a higher priority).
Introducing the queue table
Queue tables have been implemented in V2R6, but are still one of Teradata's hardly known features. Their main purpose is to allow asynchronous event processing in a FIFO and this perfectly matches with our requirements:
- When the process stops the remaining rows are still in the queue and will be processed first when the next run starts.
- More statistics might be collected in the same period when multiple processes consume rows from the queue (this is really hard to implement without a queue - how to decide which stats should be collected by which process?).
As an additional benefit ETL-jobs requiring some final COLLECT STATS steps might finish earlier, when those statements can run asynchronous via queue.
The attached script creates this queue table (CollectStatsQueue) plus two log tables (CollectStatsLog and CollectStatsLogDetails), a Stored Procedure (CollectStatsProc) to process the queue and a macro (StopQueue) to stop the processing. Syntax is based on a TD13.0 running in Teradata mode, older releases would require some modifications.
This is the main driver table for the collection process, a COLLECT STATS is triggered for each inserted row.
When your current process results in an ordered list this order of execution is maintained by setting the queue's "Query Insertion Timestamp" column to
|Optional, when supplied used instead of the column list|
Comma-delimited list of column names (for column and index level stats)
Collect on T (Table), C (Column), I (Index) level
|uses SAMPLE when set to any non-NULL value|
Records info about every run of the CollectStatsProc including the number of statistics collected and the number of errors during the collect.
Records info for every failing (optionally also successful) COLLECT STATS.
This Stored Procedure actually submits the COLLECT STATS for each row in the queue
Process will stop when a COLLECT finishes after MaxTS or all rows in queue are processed.
If MaxTS is set to NULL the process will run until the StopQueue macro is executed.
If set to 'Y' info about each processed row is written to CollectStatsLogDetails, otherwise only failed collects.
Update the CollectStatsLog table after x minutes with info about the number of statistics successfully collected or failed.
SQL Errorcode: when there's any error outside of the actual COLLECT statement the process stops
Result of the execution: Success/Warning/Failed plus additional info.
Macro to stop all SPs processing the queue: invoke
EXEC StopQueue; from any session.
CALL CollectStatsProc(CURRENT_TIMESTAMP(2)+INTERVAL '1' HOUR,'Y',5, rc, msg);
Run for up to one hour, log all rows in the CollectStatsLogDetails table and update the CollectStatsLog table every 5 minutes.
Can be stopped earlier by simply running
EXEC StopQueue; from any session.
CALL CollectStatsProc(NULL,'N',10, rc, msg);
Run until StopQueue is executed, log only failed collects in the CollectStatsLog table and update the CollectStatsLog table every 10 minutes.
Running in parallel
As the CollectStatsProc will probably be CALLed in a BTEQ script it's easy to run multiple copies:
.SET SESSIONS 2
Some remarks on running multiple instances of CollectStatsProc:
- Collecting multiple stats on a small table using one row per index/column might result in a deadlock (when stats data is merged into the system tables), thus it's better to collect those stats on table level
- As stats collection is quite resource intensive you should check if additional resources are available, multiple instances are mainly suitable for running in a dedicated batch window.
The following example to recollect all existing stats within a database is based on my StatsInfo view:
INSERT INTO CollectStatsQueue ( QITS, DatabaseName, TableName, IndexName, ColumnName, CollectLevel, SampleSize) SELECT CURRENT_TIMESTAMP + (ROW_NUMBER() OVER (ORDER BY databasename, tablename) * INTERVAL '0.000001' SECOND) ,dt.* FROM ( SELECT DISTINCT DatabaseName ,TableName ,CASE WHEN CollectLevel = 'T' THEN '' ELSE IndexName END AS IndexName ,CASE WHEN CollectLevel = 'T' THEN '' ELSE ColumnName END AS ColumnName ,CASE -- collect on table level for small tables WHEN COUNT(*) OVER (PARTITION BY TableId) > 1 AND (NumRows < 100000 OR SUM(CollectDuration) OVER (PARTITION BY TableId) < INTERVAL '0:10' MINUTE TO SECOND) THEN 'T' WHEN StatsType IN ('Col', 'MCol', 'Part') THEN 'C' ELSE 'I' END AS CollectLevel ,SampleSize FROM StatsInfo WHERE TableType <> 'TempTbl' AND MissingStats = 'N' AND DatabaseName = ... ) AS dt; CALL CollectStatsProc(CURRENT_TIMESTAMP(2)+INTERVAL '10' MINUTE,'Y',1, rc, msg); SELECT * FROM CollectStatsLog ORDER BY StartTS; SELECT * FROM CollectStatsLogDetails ORDER BY StartTS;
Please provide any kind of feedback (especially about performance when running multiple instances).