Version: FCP 25.11

Reduce Node OOM Events by Enabling Load Thresholds for Fsched Clusters

By configuring load thresholds in Fsched, you can automatically drain a compute node when its available memory or CPU usage exceeds the configured threshold. In sinfo, the node state will change to drain, preventing new jobs from being assigned to that node and reducing the risk of OOM events and node outages.

Configure Load Thresholds

Log in to the platform.
Create a cluster named cluster-loadthreshold.

Cluster type: select Fsched

Compute partition > Node configuration: add one node to the compute partition

Compute partition > Advanced configuration: enable load thresholds

Head partition > Node configuration: add one node to the head partition

Keep the remaining settings at their default values.
At the bottom of the pinned configuration summary on the right, click Submit.
Wait 5 to 15 minutes. On the cluster management page, check the new cluster status and wait until it reaches the running state.

Submit a job.

# Use 1 node and 1 CPU core to submit a stress job
srun -n1 -c1 stress --cpu 1 --timeout 600s

Verify the load-threshold behavior.

Log in to the compute node and use top to observe CPU usage.
- While the stress job is running, sinfo shows the node state as drain. If you submit another job at this time, it will not be scheduled to that node.
- After the cluster has no new jobs running for several minutes, the compute node returns to the idle state. New jobs can then be scheduled to that node again.

Configure Load Thresholds​

Configure Load Thresholds