Checkpoint and Restore
Overview
The Checkpoint feature can save the state of a running job to disk, and later restore it to continue execution. This operation captures process memory, open files, and execution state, and is complex to implement.
Important note: The Checkpoint feature is not guaranteed to work for all applications. Success depends on application characteristics, system configuration, and runtime conditions. This feature is currently experimental and must be thoroughly tested before production use.
Supported Versions
fsched supports this feature after version (TBD).
Potential Use Cases
If the Checkpoint feature is reliable and effective for your application, it can be used in the following scenarios:
- Node failure recovery: After a node failure, fsched can automatically restore jobs from the Checkpoint to avoid restarting from scratch (requires application compatibility and successful restore).
- Long-running jobs: Provides some interruption protection for compatible applications.
- Planned maintenance: Pause jobs before maintenance, but recovery is not guaranteed.
- Resource management: If supported by the application, jobs can be moved off busy nodes.
Important Considerations
The feasibility of these scenarios depends entirely on your specific application:
- Application compatibility: Most complex applications have limitations and may be completely unsupported.
- Reliability: Checkpoint or restore operations may fail for various reasons.
- Testing requirements: Each application and workload must be extensively tested.
- No guarantees: Even after testing, Checkpoint/restore can still fail in production.
Do not rely on Checkpoint as a primary fault-tolerance or job-management mechanism without extensive validation.
How It Works
This feature is based on CRIU (Checkpoint/Restore In Userspace). CRIU is a Linux tool that can freeze a running application and save its state to disk. We use a customized version optimized for HPC workloads.
The Checkpoint process includes two steps:
- Freeze: pause the job (duration depends on job size and system load)
- Dump: write the job state to disk
During restore, fsched attempts to reconstruct the job state, including:
- Memory contents
- Open file descriptors
- Network connections
- Process tree structure
- Environment variables
Note: Whether restore succeeds depends on many factors. Complex applications with external dependencies, network state, or hardware access may not restore correctly.
Key Limitations
The Checkpoint feature is complex and prone to failure
Many applications cannot use the Checkpoint feature correctly. Even if basic tests pass, failures may still occur in production.
Known incompatibilities:
- GPU and specialized hardware access (typically unsupported)
- Most MPI applications (unreliable)
- Applications with complex network state
- Real-time or timing-sensitive applications
- Applications that depend on external services
- Complex inter-process communication patterns
Even simple applications can fail:
- File I/O operations in progress during Checkpoint
- Network connections that time out or cannot be re-established
- Temporary files or system resources that change between Checkpoint and restore
- Kernel or library version differences
Performance impact
- Checkpoint interrupts the job (depending on memory size, it may last from seconds to minutes)
- Checkpoint creation generates heavy I/O load
- Storage requirements are equal to or greater than the job's memory size
- Failed Checkpoints waste time and resources
System requirements
- Linux kernel 4.x or later is strongly recommended
- A shared file system with high I/O performance is required
- Sufficient free disk space (recommend several times the job's memory size)
Mandatory testing: Never assume Checkpoint will work. You must test extensively under real conditions. Even then, failures can occur in production.
Getting Started
Cluster Administrators
To enable Checkpoint on the cluster, configure the following:
# Enable Checkpoint plugin
CheckpointType = checkpoint/criu
This is the minimum configuration. For fsched-specific advanced settings, see the Configuration Options section below.
System requirements: Checkpoint requires Linux kernel 3.11 or later. For full functionality and reliability, Linux kernel 4.x or later is strongly recommended. CentOS/RHEL 7 may not support all features.
Users
Submit Checkpoint-Enabled Jobs
Use the --checkpoint option when submitting jobs to enable automatic Checkpointing:
sbatch --checkpoint=30:00 my_job_script.sh
This command creates a Checkpoint every 30 minutes. Supported time formats:
minutes(for example30)minutes:seconds(for example30:00)hours:minutes:seconds(for example2:30:00)days-hours(for example1-0for 1 day)days-hours:minutes(for example1-0:30)days-hours:minutes:seconds(for example1-2:30:00)
Suggested starting intervals (adjust based on testing):
- Testing:
--checkpoint=5:00(every 5 minutes) - After application compatibility is verified:
--checkpoint=30:00to--checkpoint=2:00:00 - Balance Checkpoint overhead against potential lost work
Manual Checkpoint Operations
You can also manually control Checkpoints with scontrol:
Manually create a Checkpoint:
scontrol checkpoint create <job_id>
Creates a Checkpoint for a running job, which may fail depending on application state.
Vacate a job and save its state (release resources):
scontrol checkpoint vacate <job_id>
Checkpoints the job and terminates it. If Checkpoint fails, the job is not terminated.
Restart a Checkpointed job:
scontrol checkpoint restart <job_id>
Restores the job from the last Checkpoint; it may fail for various reasons.
Basic Workflow Example
# 1. Submit a Checkpoint-enabled test job
$ sbatch --checkpoint=1:00:00 test_job.sh
Submitted batch job 12345
# 2. Check job status
$ squeue -j 12345
JOBID PARTITION NAME USER ST TIME NODES
12345 compute test alice R 2:15 1
# 3. Attempt vacate (may fail)
$ scontrol checkpoint vacate 12345
# 4. If successful, the job is checkpointed and can be restarted
$ scontrol checkpoint restart 12345
# Note: verify job output to ensure correct recovery
Configuration Options
You can customize Checkpoint behavior with the following options. All options are optional and have reasonable defaults.
fsched-Specific Features
fsched provides several advanced Checkpoint features not available in standard Slurm:
Automatic recovery on node failure - When a node failure causes a job to be requeued, fsched automatically attempts to restore from the last Checkpoint instead of restarting from scratch. This can provide fault tolerance for compatible applications, but success is not guaranteed. Additional timeout configuration is required (see below).
Load-aware Checkpoint - Skips Checkpoint when system load is too high to reduce impact on other workloads. If load remains high, Checkpoint may be skipped for a long time.
Incremental Checkpoint - Saves only memory changes since the last Checkpoint. This can save time and storage for applications with stable memory access patterns, but it introduces dependencies between Checkpoints and can complicate the restore process.
Pre-dump - Copies memory before the final freeze to reduce pause time. Effectiveness depends on the application's memory access pattern.
Node Failure Recovery Configuration
To enable automatic Checkpoint recovery on node failure, use the following additional configuration:
BatchStartTimeout = 10
SlurmdTimeout = 15
AuthInfo = cred_expire=180,ttl=180,requeue_timeout=30
Do not use these settings in high-load environments.
These timeout settings are aggressive and are mainly used for fast failure detection and requeue. But under high node load, they can cause false positives and unnecessary job failures:
- Jobs may be incorrectly marked failed when nodes slow down temporarily.
- Normal jobs may be terminated and requeued unnecessarily.
- The system may become unstable at high utilization.
Use these settings only when:
- The cluster typically has ample idle capacity.
- Node failures are frequent and you need fast recovery.
- Occasional false positives are acceptable.
For high-utilization clusters, do not enable automatic node-failure recovery. Use manual Checkpoint/vacate/restart instead.
Basic Options
CRIU_MaxEpochs (default: 1)
- Number of Checkpoint snapshots to keep
- Old Checkpoints are deleted automatically
- Higher values provide more restore options but use more disk space
- Example:
CRIU_MaxEpochs = 3keeps the last 3 Checkpoints
Performance Optimization
CRIU_Incremental (default: NO)
- When enabled, saves only changes since the last Checkpoint
- Can reduce Checkpoint time and storage for some applications
- Drawback: restore requires a complete incremental chain; any file corruption causes restore failure
- Adds system complexity and potential failure points
CRIU_FullEvery (default: 10)
- When using incremental Checkpoint, perform a full Checkpoint every N Checkpoints
- Prevents very long incremental chains that slow restore
- Applies only when
CRIU_Incremental = YES - Example: with
CRIU_FullEvery = 10, Checkpoints 1-9 are incremental, 10 is full, then repeat
CRIU_PreDump (default: NO)
- Pre-copies memory before the final freeze to reduce freeze time
- Effectiveness depends on application memory access patterns
- May not significantly reduce freeze time for applications with frequent memory changes
CRIU_PreDumpDelay (default: 5 seconds)
- Wait time between pre-dump and final dump
- Allows the application to stabilize after pre-dump
- Applies only when
CRIU_PreDump = YES
Load-Aware Checkpoint
CRIU_LoadAware (default: NO)
- Skips scheduled Checkpoint when system load is too high
- If the system stays busy, Checkpoint may be skipped for a long time
- Checks CPU, I/O, and memory load against thresholds
CRIU_LoadThresholdCPU (default: 0.95)
- Skip Checkpoint if CPU utilization exceeds this ratio (0.0 to 1.0)
- Example:
0.95means skip if CPU is above 95% utilization
CRIU_LoadThresholdIO (default: 0.50)
- Skip Checkpoint if I/O wait exceeds this ratio (0.0 to 1.0)
- Example:
0.50means skip if I/O wait exceeds 50%
CRIU_LoadThresholdMemory (default: 0.85)
- Skip Checkpoint if memory usage exceeds this ratio (0.0 to 1.0)
- Example:
0.85means skip if memory usage exceeds 85%
Configuration Examples
Minimal configuration (start here):
CheckpointType = checkpoint/criu
With incremental (test thoroughly before use):
CheckpointType = checkpoint/criu
CRIU_Incremental = YES
CRIU_FullEvery = 5
CRIU_MaxEpochs = 2
With load-aware (Checkpoints may be skipped):
CheckpointType = checkpoint/criu
CRIU_LoadAware = YES
CRIU_LoadThresholdCPU = 0.90
CRIU_LoadThresholdIO = 0.60
CRIU_LoadThresholdMemory = 0.80
Note: Start with the minimal configuration. Add features only after verifying they work for your specific application.
Command Reference
User Commands
| Command | Description | When to Use |
|---|---|---|
sbatch --checkpoint=<interval> script.sh | Submit a job with automatic Checkpoint enabled | Start a new job that needs periodic Checkpointing |
scontrol checkpoint create <job_id> | Create a Checkpoint immediately | Before risky operations or for manual backup |
scontrol checkpoint vacate <job_id> | Checkpoint and vacate the job, releasing resources | Planned maintenance or resource rebalancing |
scontrol checkpoint restart <job_id> | Restore and continue a vacated job | After maintenance or when resources are available |
scontrol show job <job_id> | View job details including Checkpoint status | Check Checkpoint configuration and status |
Example Commands
Check whether a job has Checkpoint enabled:
scontrol show job 12345 | grep Checkpoint
Attempt to create a Checkpoint:
scontrol checkpoint create 12345
# Check logs to verify success
Attempt to vacate multiple jobs:
for job in 12345 12346 12347; do
scontrol checkpoint vacate $job
# Verify each Checkpoint before relying on it
done
Note: Before assuming a job is recoverable, verify Checkpoint success by checking logs and testing restores.
Best Practices
Choosing a Checkpoint Interval
Checkpoint interval selection depends on many factors and requires repeated testing:
Considerations:
- Checkpoint overhead increases with job memory size
- Storage I/O performance affects Checkpoint duration
- More frequent Checkpoints increase overhead but reduce work lost on failure
- Applications vary in how sensitive they are to Checkpoint interruptions
Testing approach:
- Start with a longer interval (1-2 hours)
- Measure Checkpoint duration and impact on the application
- Adjust based on observed overhead and acceptable risk
- Monitor Checkpoint failures and adjust accordingly
There is no universal "best" interval; determine it through repeated testing for your application and workload.
Storage Management
- Monitor Checkpoint storage usage:
du -sh /path/to/checkpoint/dir/<job_id> - Checkpoints are not automatically cleaned up after jobs finish
- For long-running jobs, consider using
CRIU_MaxEpochsto limit storage - Ensure sufficient free space (recommend 2x the job's memory size)
Application Compatibility
You must test before production use.
Checkpoint works at the process level and not all applications are compatible. Applications with complex I/O, network connections, hardware devices, or timing sensitivity may not run reliably.
Required testing:
- Test with the real application and workload
- Verify output correctness after restore
- Test on the actual hardware and configuration used in production
- Record application-specific limitations or workarounds
Do not assume Checkpoint works without thorough testing.
Test Checklist
Before using Checkpoint for important workloads, complete comprehensive testing:
-
Basic functionality test:
sbatch --checkpoint=5:00 test_job.sh
# Wait for the first Checkpoint to complete
scontrol checkpoint vacate <job_id>
scontrol checkpoint restart <job_id>
# Verify the job actually restarted (may fail) -
Output verification:
- Compare output files byte-for-byte with normal runs
- Check for corrupted, duplicated, or missing data
- Verify application state consistency after restore
-
Measure overhead:
- Measure Checkpoint duration (may be much longer than expected)
- Measure impact on total job runtime
- Check storage usage
-
Test multiple scenarios:
- Checkpoint at different stages of job execution
- Test restore on different nodes (may fail)
- Use real workloads, not simple examples
- Test Checkpoint failure scenarios
-
Long-term reliability:
- Run multiple Checkpoint/restore cycles
- Monitor for performance degradation or data corruption
- Test under system load
Important: Passing basic tests does not guarantee production reliability. Continuous monitoring is required after deployment.
FAQ
Q: Can GPU jobs use Checkpoint? A: GPU Checkpointing has severe limitations and is not recommended. CRIU's GPU state support is incomplete, and most CUDA/GPU applications cannot Checkpoint or restore properly. If you must use it, test thoroughly and expect issues.
Q: What happens to Checkpoint files after a job finishes? A: Checkpoint files remain on disk and must be cleaned up manually to free space. Monitor Checkpoint storage usage and delete old Checkpoint directories that are no longer needed.
Q: Can MPI jobs use Checkpoint? A: MPI Checkpointing is complex and usually unreliable. It depends heavily on MPI implementation, network architecture, and process communication patterns. Many MPI jobs cannot be Checkpointed successfully, and even with extensive testing, reliability may be limited.
Q: How much disk space does Checkpoint require?
A: Each Checkpoint is roughly equal to the job's memory usage. With CRIU_MaxEpochs = 3, you need at least 3x the memory size. Incremental Checkpoints can save some space but still require significant storage. Use high-performance storage and ensure adequate space.
Q: Can I move Checkpoints to another cluster? A: No. Checkpoints are tightly coupled to a specific system configuration (including kernel version, libraries, file paths, and hardware) and can only be restored on the same cluster where they were created.
Q: Will Checkpoint affect job performance? A: Yes. Jobs are paused during Checkpoint creation (freeze time), which may last from milliseconds to seconds depending on memory size. Writing Checkpoint data also creates I/O overhead. Pre-dump can mitigate but not eliminate the impact. There is no performance impact between Checkpoints.
Q: What happens if Checkpoint fails? A: The job continues to run normally. Failed Checkpoints are recorded but do not terminate the job. However, you lose that restore point; if the job fails before the next successful Checkpoint, you must restart from an earlier Checkpoint or from scratch.
Q: Can I trigger Checkpoint from within a job script?
A: No. Checkpoint must be controlled via fsched commands (scontrol). Job scripts cannot request Checkpoint directly. If needed, implement an application-level Checkpoint mechanism separately.
Q: How do I confirm Checkpoint success? A: Check fsched logs for Checkpoint success/failure messages; log location depends on configuration. Note: the absence of errors does not mean the Checkpoint is valid—only a successful restore test confirms it.
Q: Can I change the Checkpoint interval after submission? A: No. The Checkpoint interval is fixed at job submission time and cannot be changed while the job is running.
Q: Is Checkpoint the same as fsched job requeue?
A: No. Job requeue (--requeue) terminates the job and starts it over. Checkpoint saves process state and continues from the interruption point. They serve different purposes.