MainframeMaster

JCL Tutorial

Restart and Recovery

Techniques for handling failures and restarting jobs efficiently

Progress0 of 0 lessons

Introduction to Restart and Recovery

In large batch processing environments, jobs can sometimes fail due to various reasons such as system failures, resource constraints, or application errors. Rerunning lengthy jobs from the beginning wastes system resources and delays processing. JCL provides several mechanisms to restart failed jobs from the point of failure rather than from the beginning.

This tutorial covers JCL's restart and recovery facilities, including the RESTART parameter, restart definition (RD) parameter, and checkpoint/restart techniques.

The RESTART Parameter

The RESTART parameter on the JOB statement allows you to restart a job at a specific step or checkpoint, bypassing steps that have already completed successfully. This is particularly useful for long-running jobs that fail after completing several steps.

RESTART Syntax

jcl
1
2
//jobname JOB accounting-info,programmer-name,RESTART=stepname //jobname JOB accounting-info,programmer-name,RESTART=(stepname,checkid)

Restarting at a Step

To restart a job at a specific step, use the RESTART=stepname format:

jcl
1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RESTART=STEP040

This causes the system to skip all steps before STEP040 and begin execution at STEP040.

Restarting at a Checkpoint

To restart a job at a specific checkpoint within a step, use the RESTART=(stepname,checkid) format:

jcl
1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RESTART=(STEP040,CHECK3)

This causes the system to skip all steps before STEP040, begin execution at STEP040, and use the checkpoint data for CHECK3 to position within the step.

Important Considerations:

  • When restarting at a specific step, all previous steps are bypassed
  • Dataset dispositions from previous steps are honored as if the steps had executed
  • Temporary datasets created in bypassed steps are not available
  • Ensure the job is designed to be restarted (idempotent operations)

The RD (Restart Definition) Parameter

The RD parameter on the JOB statement controls whether the system can automatically restart the job after a system failure. It works differently from the RESTART parameter, which is used for manual restart of a previously failed job.

RD Syntax

jcl
1
RD=R | RD=NC | RD=NR | RD=RNC
ValueDescription
RRestart at checkpoint or step beginning after system failure
NCNo checkpoint restarts, but restart from step beginning is allowed
NRNo automatic restarts after system failures
RNCRestart at step beginning only, even if checkpoints exist

Example: Allowing Automatic Restart

jcl
1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RD=R

This job can be automatically restarted from a checkpoint or step beginning if a system failure occurs.

Example: Preventing Automatic Restart

jcl
1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RD=NR

This job will not be automatically restarted after a system failure.

Checkpoint/Restart Facilities

The checkpoint/restart facility allows programs to establish restart points during execution. If a failure occurs, the job can be restarted from the last checkpoint rather than from the beginning of a step.

How Checkpoint/Restart Works

  1. A program issues the CHKPT macro to establish a checkpoint
  2. The system creates a checkpoint record containing:
    • Information about the program's status
    • Contents of main storage
    • Position of datasets
  3. If the job fails, it can be restarted from this checkpoint

SYSCHK DD Statement

The SYSCHK DD statement is required when restarting a job from a checkpoint. It identifies the dataset containing the checkpoint record:

jcl
1
2
3
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A, // RESTART=(STEP040,CHECK3) //SYSCHK DD DSN=CHECKPT.DATASET,DISP=OLD

Creating Checkpoints in Programs

Checkpoints are established by programs using the CHKPT macro. Here are examples in different languages:

Best Practices for Checkpoint/Restart:

  • Place checkpoints at logical processing boundaries
  • Don't place checkpoints too frequently (creates overhead)
  • Don't place checkpoints too infrequently (loses too much work if failure occurs)
  • Ensure checkpoint datasets have adequate space
  • Use a meaningful naming convention for checkpoint IDs

Practical Example: A Restartable Job

This example shows a job designed with restart capabilities:

jcl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
//BILLRUN JOB (ACCT),'MONTHLY BILLING',CLASS=A, // MSGCLASS=X,MSGLEVEL=(1,1), // NOTIFY=&SYSUID,RD=R //* //* STEP010: INITIALIZE PROCESSING //* //STEP010 EXEC PGM=INITPROC //SYSOUT DD SYSOUT=* //INFILE DD DSN=CUST.MASTER,DISP=SHR //OUTFILE DD DSN=&&TEMP1,DISP=(NEW,PASS), // UNIT=SYSDA,SPACE=(CYL,(10,5)) //* //* STEP020: SORT CUSTOMER RECORDS //* //STEP020 EXEC PGM=SORT //SORTIN DD DSN=&&TEMP1,DISP=(OLD,PASS) //SORTOUT DD DSN=&&TEMP2,DISP=(NEW,PASS), // UNIT=SYSDA,SPACE=(CYL,(10,5)) //SYSIN DD * SORT FIELDS=(1,10,CH,A) /* //SYSOUT DD SYSOUT=* //* //* STEP030: BACKUP BILLING MASTER //* //STEP030 EXEC PGM=IDCAMS //SYSPRINT DD SYSOUT=* //SYSIN DD * REPRO INFILE(BILLMAST) OUTFILE(BACKUP) /* //BILLMAST DD DSN=BILLING.MASTER,DISP=SHR //BACKUP DD DSN=BILLING.MASTER.BACKUP,DISP=(NEW,CATLG,DELETE), // UNIT=TAPE,VOL=SER=BKUP01 //* //* STEP040: MAIN PROCESSING WITH CHECKPOINTS //* //STEP040 EXEC PGM=BILLPROC //STEPLIB DD DSN=APPL.LOADLIB,DISP=SHR //SYSOUT DD SYSOUT=* //CHKPTDS DD DSN=BILLING.CHECKPOINT,DISP=(MOD,KEEP,KEEP) //INFILE DD DSN=&&TEMP2,DISP=(OLD,DELETE) //MASTER DD DSN=BILLING.MASTER,DISP=OLD //REPORT DD DSN=BILLING.REPORT,DISP=(NEW,CATLG,DELETE), // UNIT=SYSDA,SPACE=(CYL,(5,2)) //* //* STEP050: FINALIZE PROCESSING //* //STEP050 EXEC PGM=FINPROC //SYSOUT DD SYSOUT=* //INFILE DD DSN=BILLING.REPORT,DISP=SHR //OUTFILE DD SYSOUT=A //* //* CLEANUP STEP THAT ALWAYS RUNS EVEN IF PREVIOUS STEPS FAIL //* //CLEANUP EXEC PGM=IEFBR14,COND=EVEN //TEMP1 DD DSN=&&TEMP1,DISP=(OLD,DELETE) //TEMP2 DD DSN=&&TEMP2,DISP=(OLD,DELETE)

Key Restart Features in this Job:

  • RD=R allows automatic restart after system failure
  • STEP040 uses a checkpoint dataset to enable restart within the step
  • Temporary datasets are passed between steps and cleaned up at the end
  • COND=EVEN ensures cleanup runs regardless of previous step failures
  • Backup step (STEP030) ensures data integrity before main processing

Restart and Recovery Strategies

Beyond the basic JCL mechanisms, here are some broader strategies for effective restart and recovery:

Idempotent Design

Design job steps to be idempotent (can be run multiple times without adverse effects). This makes restart safer and more predictable.

Step Dependencies

Document dependencies between steps to understand what happens if a job is restarted from the middle.

Data Validation

Include validation steps that can detect data inconsistencies that might occur during restart.

Transactional Processing

Where possible, use transactional processing techniques to ensure data integrity during failures.

Backup Before Updates

Always back up critical data before making significant updates.

Cleanup Steps

Include cleanup steps that run regardless of job success or failure.

Common Questions and Issues

Test Your Knowledge

1. Which parameter is used to restart a job at a specific job step?

  • RESUME
  • RESTART
  • RERUN
  • RECOVER

2. What is the correct syntax to restart a job at STEP040?

  • RESTART=STEP040
  • RESTART AT STEP040
  • RERUN=STEP040
  • RESUME=STEP040

3. Which parameter controls automatic restart after system failure?

  • AR
  • RD
  • SR
  • AUTO

4. What does the RD=R parameter specify?

  • No automatic restarts permitted
  • Restart at beginning of step only, even if checkpoints exist
  • Restart job at checkpoint or step beginning after system failure
  • Restart at any point

5. Which DD statement is required for checkpoint/restart facilities?

  • SYSOUT DD
  • SYSCHK DD
  • SYSRESTART DD
  • CHECKPT DD

6. What is the purpose of the CHKPT macro?

  • To define DASD space for checkpoints
  • To establish checkpoint records during program execution
  • To record system status
  • To validate checkpoint IDs