JCL Tutorial

Restart and Recovery

Techniques for handling failures and restarting jobs efficiently

Progress0 of 0 lessons

Introduction to Restart and Recovery

In large batch processing environments, jobs can sometimes fail due to various reasons such as system failures, resource constraints, or application errors. Rerunning lengthy jobs from the beginning wastes system resources and delays processing. JCL provides several mechanisms to restart failed jobs from the point of failure rather than from the beginning.

This tutorial covers JCL's restart and recovery facilities, including the RESTART parameter, restart definition (RD) parameter, and checkpoint/restart techniques.

The RESTART Parameter

The RESTART parameter on the JOB statement allows you to restart a job at a specific step or checkpoint, bypassing steps that have already completed successfully. This is particularly useful for long-running jobs that fail after completing several steps.

RESTART Syntax

jcl

1
2
//jobname JOB accounting-info,programmer-name,RESTART=stepname
//jobname JOB accounting-info,programmer-name,RESTART=(stepname,checkid)

Restarting at a Step

To restart a job at a specific step, use the RESTART=stepname format:

jcl

1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A,
//         RESTART=STEP040

This causes the system to skip all steps before STEP040 and begin execution at STEP040.

Restarting at a Checkpoint

To restart a job at a specific checkpoint within a step, use the RESTART=(stepname,checkid) format:

jcl

1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A,
//         RESTART=(STEP040,CHECK3)

This causes the system to skip all steps before STEP040, begin execution at STEP040, and use the checkpoint data for CHECK3 to position within the step.

Important Considerations:

When restarting at a specific step, all previous steps are bypassed
Dataset dispositions from previous steps are honored as if the steps had executed
Temporary datasets created in bypassed steps are not available
Ensure the job is designed to be restarted (idempotent operations)

The RD (Restart Definition) Parameter

The RD parameter on the JOB statement controls whether the system can automatically restart the job after a system failure. It works differently from the RESTART parameter, which is used for manual restart of a previously failed job.

RD Syntax

jcl

1
RD=R | RD=NC | RD=NR | RD=RNC

Value	Description
R	Restart at checkpoint or step beginning after system failure
NC	No checkpoint restarts, but restart from step beginning is allowed
NR	No automatic restarts after system failures
RNC	Restart at step beginning only, even if checkpoints exist

Example: Allowing Automatic Restart

jcl

1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A,
//         RD=R

This job can be automatically restarted from a checkpoint or step beginning if a system failure occurs.

Example: Preventing Automatic Restart

jcl

1
2
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A,
//         RD=NR

This job will not be automatically restarted after a system failure.

Checkpoint/Restart Facilities

The checkpoint/restart facility allows programs to establish restart points during execution. If a failure occurs, the job can be restarted from the last checkpoint rather than from the beginning of a step.

How Checkpoint/Restart Works

A program issues the CHKPT macro to establish a checkpoint
The system creates a checkpoint record containing:

Information about the program's status
Contents of main storage
Position of datasets

If the job fails, it can be restarted from this checkpoint

SYSCHK DD Statement

The SYSCHK DD statement is required when restarting a job from a checkpoint. It identifies the dataset containing the checkpoint record:

jcl

1
2
3
//PAYJOB JOB (ACCT),'MONTHLY PAYROLL',CLASS=A,
//         RESTART=(STEP040,CHECK3)
//SYSCHK  DD DSN=CHECKPT.DATASET,DISP=OLD

Creating Checkpoints in Programs

Checkpoints are established by programs using the CHKPT macro. Here are examples in different languages:

Best Practices for Checkpoint/Restart:

Place checkpoints at logical processing boundaries
Don't place checkpoints too frequently (creates overhead)
Don't place checkpoints too infrequently (loses too much work if failure occurs)
Ensure checkpoint datasets have adequate space
Use a meaningful naming convention for checkpoint IDs

Practical Example: A Restartable Job

This example shows a job designed with restart capabilities:

jcl

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
//BILLRUN JOB (ACCT),'MONTHLY BILLING',CLASS=A,
//         MSGCLASS=X,MSGLEVEL=(1,1),
//         NOTIFY=&SYSUID,RD=R
//*
//* STEP010: INITIALIZE PROCESSING
//*
//STEP010  EXEC PGM=INITPROC
//SYSOUT   DD SYSOUT=*
//INFILE   DD DSN=CUST.MASTER,DISP=SHR
//OUTFILE  DD DSN=&&TEMP1,DISP=(NEW,PASS),
//            UNIT=SYSDA,SPACE=(CYL,(10,5))
//*
//* STEP020: SORT CUSTOMER RECORDS
//*
//STEP020  EXEC PGM=SORT
//SORTIN   DD DSN=&&TEMP1,DISP=(OLD,PASS)
//SORTOUT  DD DSN=&&TEMP2,DISP=(NEW,PASS),
//            UNIT=SYSDA,SPACE=(CYL,(10,5))
//SYSIN    DD *
  SORT FIELDS=(1,10,CH,A)
/*
//SYSOUT   DD SYSOUT=*
//*
//* STEP030: BACKUP BILLING MASTER
//*
//STEP030  EXEC PGM=IDCAMS
//SYSPRINT DD SYSOUT=*
//SYSIN    DD *
  REPRO INFILE(BILLMAST) OUTFILE(BACKUP)
/*
//BILLMAST DD DSN=BILLING.MASTER,DISP=SHR
//BACKUP   DD DSN=BILLING.MASTER.BACKUP,DISP=(NEW,CATLG,DELETE),
//            UNIT=TAPE,VOL=SER=BKUP01
//*
//* STEP040: MAIN PROCESSING WITH CHECKPOINTS
//*
//STEP040  EXEC PGM=BILLPROC
//STEPLIB  DD DSN=APPL.LOADLIB,DISP=SHR
//SYSOUT   DD SYSOUT=*
//CHKPTDS  DD DSN=BILLING.CHECKPOINT,DISP=(MOD,KEEP,KEEP)
//INFILE   DD DSN=&&TEMP2,DISP=(OLD,DELETE)
//MASTER   DD DSN=BILLING.MASTER,DISP=OLD
//REPORT   DD DSN=BILLING.REPORT,DISP=(NEW,CATLG,DELETE),
//            UNIT=SYSDA,SPACE=(CYL,(5,2))
//*
//* STEP050: FINALIZE PROCESSING
//*
//STEP050  EXEC PGM=FINPROC
//SYSOUT   DD SYSOUT=*
//INFILE   DD DSN=BILLING.REPORT,DISP=SHR
//OUTFILE  DD SYSOUT=A
//*
//* CLEANUP STEP THAT ALWAYS RUNS EVEN IF PREVIOUS STEPS FAIL
//*
//CLEANUP  EXEC PGM=IEFBR14,COND=EVEN
//TEMP1    DD DSN=&&TEMP1,DISP=(OLD,DELETE)
//TEMP2    DD DSN=&&TEMP2,DISP=(OLD,DELETE)

Key Restart Features in this Job:

RD=R allows automatic restart after system failure
STEP040 uses a checkpoint dataset to enable restart within the step
Temporary datasets are passed between steps and cleaned up at the end
COND=EVEN ensures cleanup runs regardless of previous step failures
Backup step (STEP030) ensures data integrity before main processing

Restart and Recovery Strategies

Beyond the basic JCL mechanisms, here are some broader strategies for effective restart and recovery:

Idempotent Design

Design job steps to be idempotent (can be run multiple times without adverse effects). This makes restart safer and more predictable.

Step Dependencies

Document dependencies between steps to understand what happens if a job is restarted from the middle.

Data Validation

Include validation steps that can detect data inconsistencies that might occur during restart.

Transactional Processing

Where possible, use transactional processing techniques to ensure data integrity during failures.

Backup Before Updates

Always back up critical data before making significant updates.

Cleanup Steps

Include cleanup steps that run regardless of job success or failure.

Common Questions and Issues

Test Your Knowledge

1. Which parameter is used to restart a job at a specific job step?

RESUME
RESTART
RERUN
RECOVER

2. What is the correct syntax to restart a job at STEP040?

RESTART=STEP040
RESTART AT STEP040
RERUN=STEP040
RESUME=STEP040

3. Which parameter controls automatic restart after system failure?

AR
RD
SR
AUTO

4. What does the RD=R parameter specify?

No automatic restarts permitted
Restart at beginning of step only, even if checkpoints exist
Restart job at checkpoint or step beginning after system failure
Restart at any point

5. Which DD statement is required for checkpoint/restart facilities?

SYSOUT DD
SYSCHK DD
SYSRESTART DD
CHECKPT DD

6. What is the purpose of the CHKPT macro?

To define DASD space for checkpoints
To establish checkpoint records during program execution
To record system status
To validate checkpoint IDs

IF/THEN/ELSE/ENDIF IEBGENER Utility