Techniques for handling failures and restarting jobs efficiently
In large batch processing environments, jobs can sometimes fail due to various reasons such as system failures, resource constraints, or application errors. Rerunning lengthy jobs from the beginning wastes system resources and delays processing. JCL provides several mechanisms to restart failed jobs from the point of failure rather than from the beginning.
This tutorial covers JCL's restart and recovery facilities, including the RESTART parameter, restart definition (RD) parameter, and checkpoint/restart techniques.
The RESTART parameter on the JOB statement allows you to restart a job at a specific step or checkpoint, bypassing steps that have already completed successfully. This is particularly useful for long-running jobs that fail after completing several steps.
12//jobname JOB accounting-info,programmer-name,RESTART=stepname //jobname JOB accounting-info,programmer-name,RESTART=(stepname,checkid)
To restart a job at a specific step, use the RESTART=stepname format:
12//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RESTART=STEP040
This causes the system to skip all steps before STEP040 and begin execution at STEP040.
To restart a job at a specific checkpoint within a step, use the RESTART=(stepname,checkid) format:
12//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RESTART=(STEP040,CHECK3)
This causes the system to skip all steps before STEP040, begin execution at STEP040, and use the checkpoint data for CHECK3 to position within the step.
The RD parameter on the JOB statement controls whether the system can automatically restart the job after a system failure. It works differently from the RESTART parameter, which is used for manual restart of a previously failed job.
1RD=R | RD=NC | RD=NR | RD=RNC
Value | Description |
---|---|
R | Restart at checkpoint or step beginning after system failure |
NC | No checkpoint restarts, but restart from step beginning is allowed |
NR | No automatic restarts after system failures |
RNC | Restart at step beginning only, even if checkpoints exist |
12//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RD=R
This job can be automatically restarted from a checkpoint or step beginning if a system failure occurs.
12//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RD=NR
This job will not be automatically restarted after a system failure.
The checkpoint/restart facility allows programs to establish restart points during execution. If a failure occurs, the job can be restarted from the last checkpoint rather than from the beginning of a step.
The SYSCHK DD statement is required when restarting a job from a checkpoint. It identifies the dataset containing the checkpoint record:
123//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RESTART=(STEP040,CHECK3) //SYSCHK DD DSN=CHECKPT.DATASET,DISP=OLD
Checkpoints are established by programs using the CHKPT macro. Here are examples in different languages:
This example shows a job designed with restart capabilities:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657//BILLRUN JOB (ACCT),'MONTHLY BILLING',CLASS=A, // MSGCLASS=X,MSGLEVEL=(1,1), // NOTIFY=&SYSUID,RD=R //* //* STEP010: INITIALIZE PROCESSING //* //STEP010 EXEC PGM=INITPROC //SYSOUT DD SYSOUT=* //INFILE DD DSN=CUST.MASTER,DISP=SHR //OUTFILE DD DSN=&&TEMP1,DISP=(NEW,PASS), // UNIT=SYSDA,SPACE=(CYL,(10,5)) //* //* STEP020: SORT CUSTOMER RECORDS //* //STEP020 EXEC PGM=SORT //SORTIN DD DSN=&&TEMP1,DISP=(OLD,PASS) //SORTOUT DD DSN=&&TEMP2,DISP=(NEW,PASS), // UNIT=SYSDA,SPACE=(CYL,(10,5)) //SYSIN DD * SORT FIELDS=(1,10,CH,A) /* //SYSOUT DD SYSOUT=* //* //* STEP030: BACKUP BILLING MASTER //* //STEP030 EXEC PGM=IDCAMS //SYSPRINT DD SYSOUT=* //SYSIN DD * REPRO INFILE(BILLMAST) OUTFILE(BACKUP) /* //BILLMAST DD DSN=BILLING.MASTER,DISP=SHR //BACKUP DD DSN=BILLING.MASTER.BACKUP,DISP=(NEW,CATLG,DELETE), // UNIT=TAPE,VOL=SER=BKUP01 //* //* STEP040: MAIN PROCESSING WITH CHECKPOINTS //* //STEP040 EXEC PGM=BILLPROC //STEPLIB DD DSN=APPL.LOADLIB,DISP=SHR //SYSOUT DD SYSOUT=* //CHKPTDS DD DSN=BILLING.CHECKPOINT,DISP=(MOD,KEEP,KEEP) //INFILE DD DSN=&&TEMP2,DISP=(OLD,DELETE) //MASTER DD DSN=BILLING.MASTER,DISP=OLD //REPORT DD DSN=BILLING.REPORT,DISP=(NEW,CATLG,DELETE), // UNIT=SYSDA,SPACE=(CYL,(5,2)) //* //* STEP050: FINALIZE PROCESSING //* //STEP050 EXEC PGM=FINPROC //SYSOUT DD SYSOUT=* //INFILE DD DSN=BILLING.REPORT,DISP=SHR //OUTFILE DD SYSOUT=A //* //* CLEANUP STEP THAT ALWAYS RUNS EVEN IF PREVIOUS STEPS FAIL //* //CLEANUP EXEC PGM=IEFBR14,COND=EVEN //TEMP1 DD DSN=&&TEMP1,DISP=(OLD,DELETE) //TEMP2 DD DSN=&&TEMP2,DISP=(OLD,DELETE)
Beyond the basic JCL mechanisms, here are some broader strategies for effective restart and recovery:
Design job steps to be idempotent (can be run multiple times without adverse effects). This makes restart safer and more predictable.
Document dependencies between steps to understand what happens if a job is restarted from the middle.
Include validation steps that can detect data inconsistencies that might occur during restart.
Where possible, use transactional processing techniques to ensure data integrity during failures.
Always back up critical data before making significant updates.
Include cleanup steps that run regardless of job success or failure.
1. Which parameter is used to restart a job at a specific job step?
2. What is the correct syntax to restart a job at STEP040?
3. Which parameter controls automatic restart after system failure?
4. What does the RD=R parameter specify?
5. Which DD statement is required for checkpoint/restart facilities?
6. What is the purpose of the CHKPT macro?