Skip to main content

Checkpoint and Restore

Overview

The Checkpoint feature can save the state of a running job to disk, and later restore it to continue execution. This operation captures process memory, open files, and execution state, and is complex to implement.

Important note: The Checkpoint feature is not guaranteed to work for all applications. Success depends on application characteristics, system configuration, and runtime conditions. This feature is currently experimental and must be thoroughly tested before production use.

Supported Versions

fsched supports this feature after version (TBD).

Potential Use Cases

If the Checkpoint feature is reliable and effective for your application, it can be used in the following scenarios:

  • Node failure recovery: After a node failure, fsched can automatically restore jobs from the Checkpoint to avoid restarting from scratch (requires application compatibility and successful restore).
  • Long-running jobs: Provides some interruption protection for compatible applications.
  • Planned maintenance: Pause jobs before maintenance, but recovery is not guaranteed.
  • Resource management: If supported by the application, jobs can be moved off busy nodes.

Important Considerations

The feasibility of these scenarios depends entirely on your specific application:

  • Application compatibility: Most complex applications have limitations and may be completely unsupported.
  • Reliability: Checkpoint or restore operations may fail for various reasons.
  • Testing requirements: Each application and workload must be extensively tested.
  • No guarantees: Even after testing, Checkpoint/restore can still fail in production.

Do not rely on Checkpoint as a primary fault-tolerance or job-management mechanism without extensive validation.

How It Works

This feature is based on CRIU (Checkpoint/Restore In Userspace). CRIU is a Linux tool that can freeze a running application and save its state to disk. We use a customized version optimized for HPC workloads.

The Checkpoint process includes two steps:

  1. Freeze: pause the job (duration depends on job size and system load)
  2. Dump: write the job state to disk

During restore, fsched attempts to reconstruct the job state, including:

  • Memory contents
  • Open file descriptors
  • Network connections
  • Process tree structure
  • Environment variables

Note: Whether restore succeeds depends on many factors. Complex applications with external dependencies, network state, or hardware access may not restore correctly.

Key Limitations

warning

The Checkpoint feature is complex and prone to failure

Many applications cannot use the Checkpoint feature correctly. Even if basic tests pass, failures may still occur in production.

Known incompatibilities:

  • GPU and specialized hardware access (typically unsupported)
  • Most MPI applications (unreliable)
  • Applications with complex network state
  • Real-time or timing-sensitive applications
  • Applications that depend on external services
  • Complex inter-process communication patterns

Even simple applications can fail:

  • File I/O operations in progress during Checkpoint
  • Network connections that time out or cannot be re-established
  • Temporary files or system resources that change between Checkpoint and restore
  • Kernel or library version differences

Performance impact

  • Checkpoint interrupts the job (depending on memory size, it may last from seconds to minutes)
  • Checkpoint creation generates heavy I/O load
  • Storage requirements are equal to or greater than the job's memory size
  • Failed Checkpoints waste time and resources

System requirements

  • Linux kernel 4.x or later is strongly recommended
  • A shared file system with high I/O performance is required
  • Sufficient free disk space (recommend several times the job's memory size)

Mandatory testing: Never assume Checkpoint will work. You must test extensively under real conditions. Even then, failures can occur in production.

Getting Started

Cluster Administrators

To enable Checkpoint on the cluster, configure the following:

# Enable Checkpoint plugin
CheckpointType = checkpoint/criu

This is the minimum configuration. For fsched-specific advanced settings, see the Configuration Options section below.

warning

System requirements: Checkpoint requires Linux kernel 3.11 or later. For full functionality and reliability, Linux kernel 4.x or later is strongly recommended. CentOS/RHEL 7 may not support all features.

Users

Submit Checkpoint-Enabled Jobs

Use the --checkpoint option when submitting jobs to enable automatic Checkpointing:

sbatch --checkpoint=30:00 my_job_script.sh

This command creates a Checkpoint every 30 minutes. Supported time formats:

  • minutes (for example 30)
  • minutes:seconds (for example 30:00)
  • hours:minutes:seconds (for example 2:30:00)
  • days-hours (for example 1-0 for 1 day)
  • days-hours:minutes (for example 1-0:30)
  • days-hours:minutes:seconds (for example 1-2:30:00)

Suggested starting intervals (adjust based on testing):

  • Testing: --checkpoint=5:00 (every 5 minutes)
  • After application compatibility is verified: --checkpoint=30:00 to --checkpoint=2:00:00
  • Balance Checkpoint overhead against potential lost work

Manual Checkpoint Operations

You can also manually control Checkpoints with scontrol:

Manually create a Checkpoint:

scontrol checkpoint create <job_id>

Creates a Checkpoint for a running job, which may fail depending on application state.

Vacate a job and save its state (release resources):

scontrol checkpoint vacate <job_id>

Checkpoints the job and terminates it. If Checkpoint fails, the job is not terminated.

Restart a Checkpointed job:

scontrol checkpoint restart <job_id>

Restores the job from the last Checkpoint; it may fail for various reasons.

Basic Workflow Example

# 1. Submit a Checkpoint-enabled test job
$ sbatch --checkpoint=1:00:00 test_job.sh
Submitted batch job 12345

# 2. Check job status
$ squeue -j 12345
JOBID PARTITION NAME USER ST TIME NODES
12345 compute test alice R 2:15 1

# 3. Attempt vacate (may fail)
$ scontrol checkpoint vacate 12345

# 4. If successful, the job is checkpointed and can be restarted
$ scontrol checkpoint restart 12345

# Note: verify job output to ensure correct recovery

Configuration Options

You can customize Checkpoint behavior with the following options. All options are optional and have reasonable defaults.

fsched-Specific Features

fsched provides several advanced Checkpoint features not available in standard Slurm:

Automatic recovery on node failure - When a node failure causes a job to be requeued, fsched automatically attempts to restore from the last Checkpoint instead of restarting from scratch. This can provide fault tolerance for compatible applications, but success is not guaranteed. Additional timeout configuration is required (see below).

Load-aware Checkpoint - Skips Checkpoint when system load is too high to reduce impact on other workloads. If load remains high, Checkpoint may be skipped for a long time.

Incremental Checkpoint - Saves only memory changes since the last Checkpoint. This can save time and storage for applications with stable memory access patterns, but it introduces dependencies between Checkpoints and can complicate the restore process.

Pre-dump - Copies memory before the final freeze to reduce pause time. Effectiveness depends on the application's memory access pattern.

Node Failure Recovery Configuration

To enable automatic Checkpoint recovery on node failure, use the following additional configuration:

BatchStartTimeout = 10
SlurmdTimeout = 15
AuthInfo = cred_expire=180,ttl=180,requeue_timeout=30
warning

Do not use these settings in high-load environments.

These timeout settings are aggressive and are mainly used for fast failure detection and requeue. But under high node load, they can cause false positives and unnecessary job failures:

  • Jobs may be incorrectly marked failed when nodes slow down temporarily.
  • Normal jobs may be terminated and requeued unnecessarily.
  • The system may become unstable at high utilization.

Use these settings only when:

  • The cluster typically has ample idle capacity.
  • Node failures are frequent and you need fast recovery.
  • Occasional false positives are acceptable.

For high-utilization clusters, do not enable automatic node-failure recovery. Use manual Checkpoint/vacate/restart instead.

Basic Options

CRIU_MaxEpochs (default: 1)

  • Number of Checkpoint snapshots to keep
  • Old Checkpoints are deleted automatically
  • Higher values provide more restore options but use more disk space
  • Example: CRIU_MaxEpochs = 3 keeps the last 3 Checkpoints

Performance Optimization

CRIU_Incremental (default: NO)

  • When enabled, saves only changes since the last Checkpoint
  • Can reduce Checkpoint time and storage for some applications
  • Drawback: restore requires a complete incremental chain; any file corruption causes restore failure
  • Adds system complexity and potential failure points

CRIU_FullEvery (default: 10)

  • When using incremental Checkpoint, perform a full Checkpoint every N Checkpoints
  • Prevents very long incremental chains that slow restore
  • Applies only when CRIU_Incremental = YES
  • Example: with CRIU_FullEvery = 10, Checkpoints 1-9 are incremental, 10 is full, then repeat

CRIU_PreDump (default: NO)

  • Pre-copies memory before the final freeze to reduce freeze time
  • Effectiveness depends on application memory access patterns
  • May not significantly reduce freeze time for applications with frequent memory changes

CRIU_PreDumpDelay (default: 5 seconds)

  • Wait time between pre-dump and final dump
  • Allows the application to stabilize after pre-dump
  • Applies only when CRIU_PreDump = YES

Load-Aware Checkpoint

CRIU_LoadAware (default: NO)

  • Skips scheduled Checkpoint when system load is too high
  • If the system stays busy, Checkpoint may be skipped for a long time
  • Checks CPU, I/O, and memory load against thresholds

CRIU_LoadThresholdCPU (default: 0.95)

  • Skip Checkpoint if CPU utilization exceeds this ratio (0.0 to 1.0)
  • Example: 0.95 means skip if CPU is above 95% utilization

CRIU_LoadThresholdIO (default: 0.50)

  • Skip Checkpoint if I/O wait exceeds this ratio (0.0 to 1.0)
  • Example: 0.50 means skip if I/O wait exceeds 50%

CRIU_LoadThresholdMemory (default: 0.85)

  • Skip Checkpoint if memory usage exceeds this ratio (0.0 to 1.0)
  • Example: 0.85 means skip if memory usage exceeds 85%

Configuration Examples

Minimal configuration (start here):

CheckpointType = checkpoint/criu

With incremental (test thoroughly before use):

CheckpointType = checkpoint/criu
CRIU_Incremental = YES
CRIU_FullEvery = 5
CRIU_MaxEpochs = 2

With load-aware (Checkpoints may be skipped):

CheckpointType = checkpoint/criu
CRIU_LoadAware = YES
CRIU_LoadThresholdCPU = 0.90
CRIU_LoadThresholdIO = 0.60
CRIU_LoadThresholdMemory = 0.80

Note: Start with the minimal configuration. Add features only after verifying they work for your specific application.

Command Reference

User Commands

CommandDescriptionWhen to Use
sbatch --checkpoint=<interval> script.shSubmit a job with automatic Checkpoint enabledStart a new job that needs periodic Checkpointing
scontrol checkpoint create <job_id>Create a Checkpoint immediatelyBefore risky operations or for manual backup
scontrol checkpoint vacate <job_id>Checkpoint and vacate the job, releasing resourcesPlanned maintenance or resource rebalancing
scontrol checkpoint restart <job_id>Restore and continue a vacated jobAfter maintenance or when resources are available
scontrol show job <job_id>View job details including Checkpoint statusCheck Checkpoint configuration and status

Example Commands

Check whether a job has Checkpoint enabled:

scontrol show job 12345 | grep Checkpoint

Attempt to create a Checkpoint:

scontrol checkpoint create 12345
# Check logs to verify success

Attempt to vacate multiple jobs:

for job in 12345 12346 12347; do
scontrol checkpoint vacate $job
# Verify each Checkpoint before relying on it
done

Note: Before assuming a job is recoverable, verify Checkpoint success by checking logs and testing restores.

Best Practices

Choosing a Checkpoint Interval

Checkpoint interval selection depends on many factors and requires repeated testing:

Considerations:

  • Checkpoint overhead increases with job memory size
  • Storage I/O performance affects Checkpoint duration
  • More frequent Checkpoints increase overhead but reduce work lost on failure
  • Applications vary in how sensitive they are to Checkpoint interruptions

Testing approach:

  1. Start with a longer interval (1-2 hours)
  2. Measure Checkpoint duration and impact on the application
  3. Adjust based on observed overhead and acceptable risk
  4. Monitor Checkpoint failures and adjust accordingly

There is no universal "best" interval; determine it through repeated testing for your application and workload.

Storage Management

  • Monitor Checkpoint storage usage: du -sh /path/to/checkpoint/dir/<job_id>
  • Checkpoints are not automatically cleaned up after jobs finish
  • For long-running jobs, consider using CRIU_MaxEpochs to limit storage
  • Ensure sufficient free space (recommend 2x the job's memory size)

Application Compatibility

warning

You must test before production use.

Checkpoint works at the process level and not all applications are compatible. Applications with complex I/O, network connections, hardware devices, or timing sensitivity may not run reliably.

Required testing:

  1. Test with the real application and workload
  2. Verify output correctness after restore
  3. Test on the actual hardware and configuration used in production
  4. Record application-specific limitations or workarounds

Do not assume Checkpoint works without thorough testing.

Test Checklist

Before using Checkpoint for important workloads, complete comprehensive testing:

  1. Basic functionality test:

    sbatch --checkpoint=5:00 test_job.sh
    # Wait for the first Checkpoint to complete
    scontrol checkpoint vacate <job_id>
    scontrol checkpoint restart <job_id>
    # Verify the job actually restarted (may fail)
  2. Output verification:

    • Compare output files byte-for-byte with normal runs
    • Check for corrupted, duplicated, or missing data
    • Verify application state consistency after restore
  3. Measure overhead:

    • Measure Checkpoint duration (may be much longer than expected)
    • Measure impact on total job runtime
    • Check storage usage
  4. Test multiple scenarios:

    • Checkpoint at different stages of job execution
    • Test restore on different nodes (may fail)
    • Use real workloads, not simple examples
    • Test Checkpoint failure scenarios
  5. Long-term reliability:

    • Run multiple Checkpoint/restore cycles
    • Monitor for performance degradation or data corruption
    • Test under system load

Important: Passing basic tests does not guarantee production reliability. Continuous monitoring is required after deployment.

FAQ

Q: Can GPU jobs use Checkpoint? A: GPU Checkpointing has severe limitations and is not recommended. CRIU's GPU state support is incomplete, and most CUDA/GPU applications cannot Checkpoint or restore properly. If you must use it, test thoroughly and expect issues.

Q: What happens to Checkpoint files after a job finishes? A: Checkpoint files remain on disk and must be cleaned up manually to free space. Monitor Checkpoint storage usage and delete old Checkpoint directories that are no longer needed.

Q: Can MPI jobs use Checkpoint? A: MPI Checkpointing is complex and usually unreliable. It depends heavily on MPI implementation, network architecture, and process communication patterns. Many MPI jobs cannot be Checkpointed successfully, and even with extensive testing, reliability may be limited.

Q: How much disk space does Checkpoint require? A: Each Checkpoint is roughly equal to the job's memory usage. With CRIU_MaxEpochs = 3, you need at least 3x the memory size. Incremental Checkpoints can save some space but still require significant storage. Use high-performance storage and ensure adequate space.

Q: Can I move Checkpoints to another cluster? A: No. Checkpoints are tightly coupled to a specific system configuration (including kernel version, libraries, file paths, and hardware) and can only be restored on the same cluster where they were created.

Q: Will Checkpoint affect job performance? A: Yes. Jobs are paused during Checkpoint creation (freeze time), which may last from milliseconds to seconds depending on memory size. Writing Checkpoint data also creates I/O overhead. Pre-dump can mitigate but not eliminate the impact. There is no performance impact between Checkpoints.

Q: What happens if Checkpoint fails? A: The job continues to run normally. Failed Checkpoints are recorded but do not terminate the job. However, you lose that restore point; if the job fails before the next successful Checkpoint, you must restart from an earlier Checkpoint or from scratch.

Q: Can I trigger Checkpoint from within a job script? A: No. Checkpoint must be controlled via fsched commands (scontrol). Job scripts cannot request Checkpoint directly. If needed, implement an application-level Checkpoint mechanism separately.

Q: How do I confirm Checkpoint success? A: Check fsched logs for Checkpoint success/failure messages; log location depends on configuration. Note: the absence of errors does not mean the Checkpoint is valid—only a successful restore test confirms it.

Q: Can I change the Checkpoint interval after submission? A: No. The Checkpoint interval is fixed at job submission time and cannot be changed while the job is running.

Q: Is Checkpoint the same as fsched job requeue? A: No. Job requeue (--requeue) terminates the job and starts it over. Checkpoint saves process state and continues from the interruption point. They serve different purposes.