A reliable rebooting mechanism for Data Mover Services
Data Mover is a Teradata Application that allows users to copy databases or tables between Teradata systems. It is a JEE based application that is composed of three major code components: the Client (Command-line interface or Viewpoint portlet), the Daemon, and the Agent. The Daemon is the central piece of the application and the Agent is a worker unit that does the actual job of moving data. Both the Damon and the Agent are deployed in a Linux managed server. There are two other dependent Data Mover components installed in the same box: the repository DBS for storing Data Mover job data and the Active MQ application used as the messaging service provider.
The four major components of the Data Mover, Daemon, Agent, Repository and Active MQ are all installed on the same Linux machine. Each component is started by an execution script created in the /etc/init.d/ directory. When the Linux machine is rebooted for any reason, the four components will be started automatically. However there are dependencies between these components. The Daemon will depend on the repository DBS, Active MQ, and network ports. The Agent will depend on Active MQ and network ports. If a dependent component is not ready (started), the Daemon or Agent will fail to start.
There is no guarantee that the Data Mover services will be started and ready for use in the same order every time the server is rebooted. It would be useful if we could have the Daemon and Agent components keep trying to start when the required resource is not available though. This solution will work if we can take care of the following key points. Here I am using the Daemon component as the example.
- Perform retry execution if the previous execution throws any exception related to resource availability.
- Release the Apache Spring context and the communication ports that were grabbed in the last unsuccessful execution.
- Print out a user friendly error message in the log file when the execution fails.
- Have a short break/sleep time between execution failures and the next retry.
- To prevent unlimited retries that could potentially fill up the log file and disk space, increase the Java log level to minimize the log lines after a certain number of failure counts. Once the service can be started successfully, revert back to the original log level.
This solution was implemented successfully in the Data Mover 13.10.00.03 release. The Data Mover Daemon and Agent services can always start successfully using this retry mechanism to wait for their dependent resources. Both services will start working properly as soon as the internal DBS repository and the ActiveMQ service are ready for use.