Periodic Loading with Teradata Parallel Transporter
While transactional processing through the use of “message queues” is a common approach in ADW today, the file-oriented approach is also begining to find its way into ADW due to its simplicity in nature and ease-of-control. Today, many companies monitor and store thousands--or hundreds of thousands--of transactions per day across their branches and stores. Transactional data is usually collected and stored as files in directories, before being merged into the enterprise-wide data warehouse. In fact, there have been Teradata sites which extract transactional data from message queues, pre-process them, and store them into different directories based on transaction types, in an “active” manner. By “active”, we mean the files are created as transactions are collected.
Unlike the “batch file” processing where files are stationary and usually contain huge number of rows, files that represent transactions in the ADW environment are “dynamic” and relatively small in size (a few hundred rows per file in average). By “dynamic”, we mean files can be created or removed from the directory while the ETL job is running. Since these files represent real-time transactional data flow, they are usually created in short-time duration, made available for updates once they are created, processed in or close to time-sequence order, and committed to the data warehouse in a timely manner.
Using the “directory scan” feature as described in the article “Active and Batch Directory Scan in TPT”, transactional files can be collected periodically from a directory and loaded into Teradata. While the Stream operator is usually used for continuous loading, the Load or Update operator can be used for periodic loading, with which a load job can be repeatedly executed throughout the business day to ensure data is fresh enough to meet business needs.
There are trade-offs between continuous loading and periodic loading in terms of “data freshness”. The active directory scan using the Stream operator provides close to real-time loading while the periodic loading approach with the Update or Load operator could have a slower data refresh rate. This is due to the fact that loading with the Update or Load operator locks the target table until the data is loaded and committed to the target table. However, the mass-transit protocol such as Fastload through the Load operator or Multiload through the Update operator can avoid the transactional overhead of the row-at-a-time insert or update against a table.
Despite the differences in the two approaches, the continuous loading and the periodic loading are not mutually exclusive. In fact, there have been successful Teradata sites which combine the two approaches in their production environment to obtain a good balance of data freshness, better overall throughput performance, and more efficient use of system resources.
Periodic Files Collection and Load
As shown in Figure 1, transactional files can be collected periodically and processed by the Data Connector operator before they are loaded into Teradata tables using the Update operator.
Figure 1: Periodic loading with directory scan
Both active and batch directory scan can be used for periodic loading. For active directory scan, there are multiple scans (based on the VigilWaitTime value) of the directory for new files and the job does not terminate until the VigilElapsedTime or VigilStopTime expires. For batch directory scan (i.e. without using the VigilWaitTime and VigilElapsedTIme attributes), there is only one scan of the directory for files and the job terminates once all the files collected by that scan are processed.
In either case, files in a directory can be divided into multiple small batches of files to be processed based on user-defined requirements. These requirements, which can be passed as parameters to the operator through operator attributes, include but are not limited to the following:
- File name patterns to be collected
- File type
- Time interval for the active directory scan
- Maximum number of files to be processed for each time interval
- Start and stop time for each job execution
For more information on active and batch directory scan attributes, search for “Teradata Parallel Transporter User Guide” release 13.00.00 (product identification B035-2445-088A).
Switching the Load Protocol for Periodic Loading
One of the most distinguishable advantages of using TPT for active or periodic loading is that users can switch load protocol and job parameters without modifying the job script itself. The switching of the load protocol is desirable for the following reasons:
- The current load protocol cannot process the current volume of transactional files fast enough
- The number of concurrent load jobs that require “load slots” has reached a limit imposed by the DBS
- The system cannot sustain the current usage of system resources with the current load strategy
- “Catch up” is required after a system failure or a sudden increase in volume of transactions
The change of protocol is simple with TPT. As shown in Figure 2, if you want to change the load protocol from Stream to Update, or vice versa, you can define the operator type with a job variable name which starts with the @ sign (for example, @LOAD). Any token prefixed with the @ sign would be replaced at run time based on the value defined for the symbol in the TPT Job Variable file, which can be referenced through the “tbuild –v” command option. For more details about how to use job variables, please see the TPT User Guide.
Figure 2: Switching operator using a job variable
Sample Script for Periodic Load
DEFINE JOB PERIODIC_LOAD DESCRIPTION 'PERIODIC LOAD WITH DIRECTORY SCAN' ( /*********************************** Schema Definition ****/ DEFINE SCHEMA PRODUCT_SOURCE_SCHEMA DESCRIPTION 'PRODUCT INFORMATION SCHEMA' ( IN_RECV_DT VARCHAR(10), IN_RTL_CO_NUM VARCHAR( 6), IN_STR_NUM VARCHAR(11), IN_RECV_TMSTMP VARCHAR(23), IN_RECV_TYPE VARCHAR( 4), IN_RECV_DOC VARCHAR(10), IN_USER_ID VARCHAR(11), IN_ITEM_UPC_CD VARCHAR(15), IN_RECV_QTY VARCHAR(11) ); DEFINE OPERATOR UPDATE_OPERATOR() DESCRIPTION 'Teradata Parallel Transporter Update Operator' TYPE @LOAD_OPER SCHEMA * ATTRIBUTES ( INTEGER MaxSessions = 4, INTEGER MinSessions = 1, INTEGER ErrorLimit = 1, VARCHAR PrivateLogName = 'updlog', VARCHAR TdpId = @TdpId, VARCHAR UserName = @UserName, VARCHAR UserPassword = @UserPassword, VARCHAR AccountID, VARCHAR ARRAY TargetTable = ['T1'], VARCHAR ARRAY ErrorTable1 = ['T1_e1'], VARCHAR ARRAY ErrorTable2 = ['T1_e2'], VARCHAR ARRAY WorkTable = ['T1_wt'], VARCHAR LogTable = 'T1_log', VARCHAR WorkingDatabase = @WorkingDatabase, VARCHAR TraceLevel = 'NONE' ); /************************** Reader Operator Definition ****/ DEFINE OPERATOR FILE_READER () DESCRIPTION 'FILE READER' TYPE DATACONNECTOR PRODUCER SCHEMA PRODUCT_SOURCE_SCHEMA ATTRIBUTES ( VARCHAR PrivateLogName = 'dtac_081.dtaclog', VARCHAR FileName = 'INVRCV.*', VARCHAR OpenMode = 'Read', VARCHAR DirectoryPath = 'temp', VARCHAR ArchiveDirectoryPath = 'archive', VARCHAR IndicatorMode, VARCHAR Format = 'Delimited', VARCHAR TraceLevel = 'NONE', INTEGER VigilMaxFiles, INTEGER VigilStartTime, INTEGER VigilStopTime, INTEGER VigilWaitTime = 5, /* scan interval in seconds */ INTEGER VigilElapsedTime = 1 /* Duration of the job in minutes (or use VigilStartTime and VigilStopTime) */ ); /********************************* The APPLY Statement ****/ APPLY ( 'INSERT INTO T1 ( :IN_RECV_DT ,:IN_RTL_CO_NUM ,:IN_STR_NUM ,:IN_RECV_TMSTMP ,:IN_RECV_TYPE ,:IN_RECV_DOC ,:IN_USER_ID ,:IN_ITEM_UPC_CD ,:IN_RECV_QTY );' ) TO OPERATOR ( UPDATE_OPERATOR) SELECT * FROM OPERATOR ( FILE_READER ); );
Job Variable File
TdpId = ‘DBS1’ ,UserName = 'TPT_user' ,UserPassword = 'TPT_user' ,WorkingDatabase = 'TPT_user' LOAD_OPER = Update /* or Stream */