Skip to main content
Version: FCP 25.11

Reduce Node OOM Events by Enabling Load Thresholds for Fsched Clusters

By configuring load thresholds in Fsched, you can automatically drain a compute node when its available memory or CPU usage exceeds the configured threshold. In sinfo, the node state will change to drain, preventing new jobs from being assigned to that node and reducing the risk of OOM events and node outages.

Configure Load Thresholds

  1. Log in to the platform.

  2. Create a cluster named cluster-loadthreshold.

    Cluster type: select Fsched

    Compute partition > Node configuration: add one node to the compute partition

    Compute partition > Advanced configuration: enable load thresholds

    Head partition > Node configuration: add one node to the head partition

    Keep the remaining settings at their default values.

  3. At the bottom of the pinned configuration summary on the right, click Submit.

  4. Wait 5 to 15 minutes. On the cluster management page, check the new cluster status and wait until it reaches the running state.

  5. Submit a job.

    # Use 1 node and 1 CPU core to submit a stress job
    srun -n1 -c1 stress --cpu 1 --timeout 600s
  6. Verify the load-threshold behavior.

    Log in to the compute node and use top to observe CPU usage.

    • While the stress job is running, sinfo shows the node state as drain. If you submit another job at this time, it will not be scheduled to that node.
    • After the cluster has no new jobs running for several minutes, the compute node returns to the idle state. New jobs can then be scheduled to that node again.