Reduce Node OOM Events by Enabling Load Thresholds for Fsched Clusters
By configuring load thresholds in Fsched, you can automatically drain a compute node when its available memory or CPU usage exceeds the configured threshold. In sinfo, the node state will change to drain, preventing new jobs from being assigned to that node and reducing the risk of OOM events and node outages.
Configure Load Thresholds
-
Log in to the platform.
-
Create a cluster named
cluster-loadthreshold.Cluster type: select
FschedCompute partition > Node configuration: add one node to the compute partition
Compute partition > Advanced configuration: enable load thresholds
Head partition > Node configuration: add one node to the head partition
Keep the remaining settings at their default values.
-
At the bottom of the pinned configuration summary on the right, click
Submit. -
Wait 5 to 15 minutes. On the cluster management page, check the new cluster status and wait until it reaches the running state.
-
Submit a job.
# Use 1 node and 1 CPU core to submit a stress job
srun -n1 -c1 stress --cpu 1 --timeout 600s -
Verify the load-threshold behavior.
Log in to the compute node and use
topto observe CPU usage.- While the
stressjob is running,sinfoshows the node state asdrain. If you submit another job at this time, it will not be scheduled to that node. - After the cluster has no new jobs running for several minutes, the compute node returns to the
idlestate. New jobs can then be scheduled to that node again.
- While the