Checkpoint and Restore

Overview

The Checkpoint feature can save the state of a running job to disk, and later restore it to continue execution. This operation captures process memory, open files, and execution state, and is complex to implement.

Important note: The Checkpoint feature is not guaranteed to work for all applications. Success depends on application characteristics, system configuration, and runtime conditions. This feature is currently experimental and must be thoroughly tested before production use.

Supported Versions

fsched supports this feature after version (TBD).

Potential Use Cases

If the Checkpoint feature is reliable and effective for your application, it can be used in the following scenarios:

Node failure recovery: After a node failure, fsched can automatically restore jobs from the Checkpoint to avoid restarting from scratch (requires application compatibility and successful restore).
Long-running jobs: Provides some interruption protection for compatible applications.
Planned maintenance: Pause jobs before maintenance, but recovery is not guaranteed.
Resource management: If supported by the application, jobs can be moved off busy nodes.

Important Considerations

The feasibility of these scenarios depends entirely on your specific application:

Application compatibility: Most complex applications have limitations and may be completely unsupported.
Reliability: Checkpoint or restore operations may fail for various reasons.
Testing requirements: Each application and workload must be extensively tested.
No guarantees: Even after testing, Checkpoint/restore can still fail in production.

Do not rely on Checkpoint as a primary fault-tolerance or job-management mechanism without extensive validation.

How It Works

This feature is based on CRIU (Checkpoint/Restore In Userspace). CRIU is a Linux tool that can freeze a running application and save its state to disk. We use a customized version optimized for HPC workloads.

The Checkpoint process includes two steps:

Freeze: pause the job (duration depends on job size and system load)
Dump: write the job state to disk

During restore, fsched attempts to reconstruct the job state, including:

Memory contents
Open file descriptors
Network connections
Process tree structure
Environment variables

Note: Whether restore succeeds depends on many factors. Complex applications with external dependencies, network state, or hardware access may not restore correctly.

Key Limitations

warning

The Checkpoint feature is complex and prone to failure

Many applications cannot use the Checkpoint feature correctly. Even if basic tests pass, failures may still occur in production.

Known incompatibilities:

GPU and specialized hardware access (typically unsupported)
Most MPI applications (unreliable)
Applications with complex network state
Real-time or timing-sensitive applications
Applications that depend on external services
Complex inter-process communication patterns

Even simple applications can fail:

File I/O operations in progress during Checkpoint
Network connections that time out or cannot be re-established
Temporary files or system resources that change between Checkpoint and restore
Kernel or library version differences

Performance impact

Checkpoint interrupts the job (depending on memory size, it may last from seconds to minutes)
Checkpoint creation generates heavy I/O load
Storage requirements are equal to or greater than the job's memory size
Failed Checkpoints waste time and resources

System requirements

Linux kernel 4.x or later is strongly recommended
A shared file system with high I/O performance is required
Sufficient free disk space (recommend several times the job's memory size)

Mandatory testing: Never assume Checkpoint will work. You must test extensively under real conditions. Even then, failures can occur in production.

Getting Started

Cluster Administrators

To enable Checkpoint on the cluster, configure the following:

# Enable Checkpoint plugin
CheckpointType = checkpoint/criu

This is the minimum configuration. For fsched-specific advanced settings, see the Configuration Options section below.

warning

System requirements: Checkpoint requires Linux kernel 3.11 or later. For full functionality and reliability, Linux kernel 4.x or later is strongly recommended. CentOS/RHEL 7 may not support all features.

Users

Submit Checkpoint-Enabled Jobs

Use the --checkpoint option when submitting jobs to enable automatic Checkpointing:

sbatch --checkpoint=30:00 my_job_script.sh

This command creates a Checkpoint every 30 minutes. Supported time formats:

minutes (for example 30)
minutes:seconds (for example 30:00)
hours:minutes:seconds (for example 2:30:00)
days-hours (for example 1-0 for 1 day)
days-hours:minutes (for example 1-0:30)
days-hours:minutes:seconds (for example 1-2:30:00)

Suggested starting intervals (adjust based on testing):

Testing: --checkpoint=5:00 (every 5 minutes)
After application compatibility is verified: --checkpoint=30:00 to --checkpoint=2:00:00
Balance Checkpoint overhead against potential lost work

Manual Checkpoint Operations

You can also manually control Checkpoints with scontrol:

Manually create a Checkpoint:

scontrol checkpoint create <job_id>

Creates a Checkpoint for a running job, which may fail depending on application state.

Vacate a job and save its state (release resources):

scontrol checkpoint vacate <job_id>

Checkpoints the job and terminates it. If Checkpoint fails, the job is not terminated.

Restart a Checkpointed job:

scontrol checkpoint restart <job_id>

Restores the job from the last Checkpoint; it may fail for various reasons.

Basic Workflow Example

# 1. Submit a Checkpoint-enabled test job
$ sbatch --checkpoint=1:00:00 test_job.sh
Submitted batch job 12345

# 2. Check job status
$ squeue -j 12345
  JOBID PARTITION NAME  USER  ST TIME  NODES
  12345 compute   test  alice R  2:15  1

# 3. Attempt vacate (may fail)
$ scontrol checkpoint vacate 12345

# 4. If successful, the job is checkpointed and can be restarted
$ scontrol checkpoint restart 12345

# Note: verify job output to ensure correct recovery

Configuration Options

You can customize Checkpoint behavior with the following options. All options are optional and have reasonable defaults.

fsched-Specific Features

fsched provides several advanced Checkpoint features not available in standard Slurm:

Automatic recovery on node failure - When a node failure causes a job to be requeued, fsched automatically attempts to restore from the last Checkpoint instead of restarting from scratch. This can provide fault tolerance for compatible applications, but success is not guaranteed. Additional timeout configuration is required (see below).

Load-aware Checkpoint - Skips Checkpoint when system load is too high to reduce impact on other workloads. If load remains high, Checkpoint may be skipped for a long time.

Incremental Checkpoint - Saves only memory changes since the last Checkpoint. This can save time and storage for applications with stable memory access patterns, but it introduces dependencies between Checkpoints and can complicate the restore process.

Pre-dump - Copies memory before the final freeze to reduce pause time. Effectiveness depends on the application's memory access pattern.

Node Failure Recovery Configuration

To enable automatic Checkpoint recovery on node failure, use the following additional configuration:

BatchStartTimeout = 10
SlurmdTimeout = 15
AuthInfo = cred_expire=180,ttl=180,requeue_timeout=30

warning

Do not use these settings in high-load environments.

These timeout settings are aggressive and are mainly used for fast failure detection and requeue. But under high node load, they can cause false positives and unnecessary job failures:

Jobs may be incorrectly marked failed when nodes slow down temporarily.
Normal jobs may be terminated and requeued unnecessarily.
The system may become unstable at high utilization.

Use these settings only when:

The cluster typically has ample idle capacity.
Node failures are frequent and you need fast recovery.
Occasional false positives are acceptable.

For high-utilization clusters, do not enable automatic node-failure recovery. Use manual Checkpoint/vacate/restart instead.

Basic Options

CRIU_MaxEpochs (default: 1)

Number of Checkpoint snapshots to keep
Old Checkpoints are deleted automatically
Higher values provide more restore options but use more disk space
Example: CRIU_MaxEpochs = 3 keeps the last 3 Checkpoints

Performance Optimization

CRIU_Incremental (default: NO)

When enabled, saves only changes since the last Checkpoint
Can reduce Checkpoint time and storage for some applications
Drawback: restore requires a complete incremental chain; any file corruption causes restore failure
Adds system complexity and potential failure points

CRIU_FullEvery (default: 10)

When using incremental Checkpoint, perform a full Checkpoint every N Checkpoints
Prevents very long incremental chains that slow restore
Applies only when CRIU_Incremental = YES
Example: with CRIU_FullEvery = 10, Checkpoints 1-9 are incremental, 10 is full, then repeat

CRIU_PreDump (default: NO)

Pre-copies memory before the final freeze to reduce freeze time
Effectiveness depends on application memory access patterns
May not significantly reduce freeze time for applications with frequent memory changes

CRIU_PreDumpDelay (default: 5 seconds)

Wait time between pre-dump and final dump
Allows the application to stabilize after pre-dump
Applies only when CRIU_PreDump = YES

Load-Aware Checkpoint

CRIU_LoadAware (default: NO)

Skips scheduled Checkpoint when system load is too high
If the system stays busy, Checkpoint may be skipped for a long time
Checks CPU, I/O, and memory load against thresholds

CRIU_LoadThresholdCPU (default: 0.95)

Skip Checkpoint if CPU utilization exceeds this ratio (0.0 to 1.0)
Example: 0.95 means skip if CPU is above 95% utilization

CRIU_LoadThresholdIO (default: 0.50)

Skip Checkpoint if I/O wait exceeds this ratio (0.0 to 1.0)
Example: 0.50 means skip if I/O wait exceeds 50%

CRIU_LoadThresholdMemory (default: 0.85)

Skip Checkpoint if memory usage exceeds this ratio (0.0 to 1.0)
Example: 0.85 means skip if memory usage exceeds 85%

Configuration Examples

Minimal configuration (start here):

CheckpointType = checkpoint/criu

With incremental (test thoroughly before use):

CheckpointType = checkpoint/criu
CRIU_Incremental = YES
CRIU_FullEvery = 5
CRIU_MaxEpochs = 2

With load-aware (Checkpoints may be skipped):

CheckpointType = checkpoint/criu
CRIU_LoadAware = YES
CRIU_LoadThresholdCPU = 0.90
CRIU_LoadThresholdIO = 0.60
CRIU_LoadThresholdMemory = 0.80

Note: Start with the minimal configuration. Add features only after verifying they work for your specific application.

Command Reference

User Commands

Command	Description	When to Use
`sbatch --checkpoint=<interval> script.sh`	Submit a job with automatic Checkpoint enabled	Start a new job that needs periodic Checkpointing
`scontrol checkpoint create <job_id>`	Create a Checkpoint immediately	Before risky operations or for manual backup
`scontrol checkpoint vacate <job_id>`	Checkpoint and vacate the job, releasing resources	Planned maintenance or resource rebalancing
`scontrol checkpoint restart <job_id>`	Restore and continue a vacated job	After maintenance or when resources are available
`scontrol show job <job_id>`	View job details including Checkpoint status	Check Checkpoint configuration and status

Example Commands

Check whether a job has Checkpoint enabled:

scontrol show job 12345 | grep Checkpoint

Attempt to create a Checkpoint:

scontrol checkpoint create 12345
# Check logs to verify success

Attempt to vacate multiple jobs:

for job in 12345 12346 12347; do
  scontrol checkpoint vacate $job
  # Verify each Checkpoint before relying on it
done

Note: Before assuming a job is recoverable, verify Checkpoint success by checking logs and testing restores.

Best Practices

Choosing a Checkpoint Interval

Checkpoint interval selection depends on many factors and requires repeated testing:

Considerations:

Checkpoint overhead increases with job memory size
Storage I/O performance affects Checkpoint duration
More frequent Checkpoints increase overhead but reduce work lost on failure
Applications vary in how sensitive they are to Checkpoint interruptions

Testing approach:

Start with a longer interval (1-2 hours)
Measure Checkpoint duration and impact on the application
Adjust based on observed overhead and acceptable risk
Monitor Checkpoint failures and adjust accordingly

There is no universal "best" interval; determine it through repeated testing for your application and workload.

Storage Management

Monitor Checkpoint storage usage: du -sh /path/to/checkpoint/dir/<job_id>
Checkpoints are not automatically cleaned up after jobs finish
For long-running jobs, consider using CRIU_MaxEpochs to limit storage
Ensure sufficient free space (recommend 2x the job's memory size)

Application Compatibility

warning

You must test before production use.

Checkpoint works at the process level and not all applications are compatible. Applications with complex I/O, network connections, hardware devices, or timing sensitivity may not run reliably.

Required testing:

Test with the real application and workload
Verify output correctness after restore
Test on the actual hardware and configuration used in production
Record application-specific limitations or workarounds

Do not assume Checkpoint works without thorough testing.

Test Checklist

Before using Checkpoint for important workloads, complete comprehensive testing:

Basic functionality test:

sbatch --checkpoint=5:00 test_job.sh
# Wait for the first Checkpoint to complete
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart <job_id>
# Verify the job actually restarted (may fail)

Output verification:
- Compare output files byte-for-byte with normal runs
- Check for corrupted, duplicated, or missing data
- Verify application state consistency after restore
Measure overhead:
- Measure Checkpoint duration (may be much longer than expected)
- Measure impact on total job runtime
- Check storage usage
Test multiple scenarios:
- Checkpoint at different stages of job execution
- Test restore on different nodes (may fail)
- Use real workloads, not simple examples
- Test Checkpoint failure scenarios
Long-term reliability:
- Run multiple Checkpoint/restore cycles
- Monitor for performance degradation or data corruption
- Test under system load

Important: Passing basic tests does not guarantee production reliability. Continuous monitoring is required after deployment.

FAQ

Q: Can GPU jobs use Checkpoint? A: GPU Checkpointing has severe limitations and is not recommended. CRIU's GPU state support is incomplete, and most CUDA/GPU applications cannot Checkpoint or restore properly. If you must use it, test thoroughly and expect issues.

Q: What happens to Checkpoint files after a job finishes? A: Checkpoint files remain on disk and must be cleaned up manually to free space. Monitor Checkpoint storage usage and delete old Checkpoint directories that are no longer needed.

Q: Can MPI jobs use Checkpoint? A: MPI Checkpointing is complex and usually unreliable. It depends heavily on MPI implementation, network architecture, and process communication patterns. Many MPI jobs cannot be Checkpointed successfully, and even with extensive testing, reliability may be limited.

Q: How much disk space does Checkpoint require? A: Each Checkpoint is roughly equal to the job's memory usage. With CRIU_MaxEpochs = 3, you need at least 3x the memory size. Incremental Checkpoints can save some space but still require significant storage. Use high-performance storage and ensure adequate space.

Q: Can I move Checkpoints to another cluster? A: No. Checkpoints are tightly coupled to a specific system configuration (including kernel version, libraries, file paths, and hardware) and can only be restored on the same cluster where they were created.

Q: Will Checkpoint affect job performance? A: Yes. Jobs are paused during Checkpoint creation (freeze time), which may last from milliseconds to seconds depending on memory size. Writing Checkpoint data also creates I/O overhead. Pre-dump can mitigate but not eliminate the impact. There is no performance impact between Checkpoints.

Q: What happens if Checkpoint fails? A: The job continues to run normally. Failed Checkpoints are recorded but do not terminate the job. However, you lose that restore point; if the job fails before the next successful Checkpoint, you must restart from an earlier Checkpoint or from scratch.

Q: Can I trigger Checkpoint from within a job script? A: No. Checkpoint must be controlled via fsched commands (scontrol). Job scripts cannot request Checkpoint directly. If needed, implement an application-level Checkpoint mechanism separately.

Q: How do I confirm Checkpoint success? A: Check fsched logs for Checkpoint success/failure messages; log location depends on configuration. Note: the absence of errors does not mean the Checkpoint is valid—only a successful restore test confirms it.

Q: Can I change the Checkpoint interval after submission? A: No. The Checkpoint interval is fixed at job submission time and cannot be changed while the job is running.

Q: Is Checkpoint the same as fsched job requeue? A: No. Job requeue (--requeue) terminates the job and starts it over. Checkpoint saves process state and continues from the interruption point. They serve different purposes.

Overview​

Supported Versions​

Potential Use Cases​

Important Considerations​

How It Works​

Key Limitations​

Getting Started​

Cluster Administrators​

Users​

Submit Checkpoint-Enabled Jobs​

Manual Checkpoint Operations​

Basic Workflow Example​

Configuration Options​

fsched-Specific Features​

Node Failure Recovery Configuration​

Basic Options​

Performance Optimization​

Load-Aware Checkpoint​

Configuration Examples​

Command Reference​

User Commands​

Example Commands​

Best Practices​

Choosing a Checkpoint Interval​

Storage Management​

Application Compatibility​

Test Checklist​

FAQ​

Overview

Supported Versions

Potential Use Cases

Important Considerations

How It Works

Key Limitations

Getting Started

Cluster Administrators

Users

Submit Checkpoint-Enabled Jobs

Manual Checkpoint Operations

Basic Workflow Example

Configuration Options

fsched-Specific Features

Node Failure Recovery Configuration

Basic Options

Performance Optimization

Load-Aware Checkpoint

Configuration Examples

Command Reference

User Commands

Example Commands

Best Practices

Choosing a Checkpoint Interval

Storage Management

Application Compatibility

Test Checklist

FAQ