Skip to main content

Load Thresholds

fsched supports per-partition load thresholds for memory (in MiB), CPU utilization (in %), and average CPU run-queue length (decimal, no unit). Configure them on the line that starts with PartitionName for the partition in partitions.conf. The options include:

  1. Scheduling memory threshold LoadSchedMem
  2. Stop memory threshold LoadStopMem
  3. Scheduling CPU utilization threshold LoadSchedUt
  4. Stop CPU utilization threshold LoadStopUt
  5. Scheduling 15-second average CPU run-queue length threshold LoadSchedR15s
  6. Stop 15-second average CPU run-queue length threshold LoadStopR15s
  7. Scheduling 1-minute average CPU run-queue length threshold LoadSchedR1m
  8. Stop 1-minute average CPU run-queue length threshold LoadStopR1m
  9. Scheduling 15-minute average CPU run-queue length threshold LoadSchedR15m
  10. Stop 15-minute average CPU run-queue length threshold LoadStopR15m
tip

For fsched scheduler configuration, all options above are numeric and must not include units.

OptionDescription
LoadSchedMem/LoadStopMemInteger, default unit is MiB
LoadSchedUt/LoadStopUtInteger, default unit is %
LoadSchedR15s/LoadStopR15sDecimal or integer, no unit
LoadSchedR1m/LoadStopR1mDecimal or integer, no unit
LoadSchedR15m/LoadStopR15mDecimal or integer, no unit
tip
  • The current load is checked every 30 seconds. The load metrics are obtained as follows:
MetricData SourceSpecific /proc/stat FieldCorresponding top OutputDescription
mem/proc/meminfoMemAvailable line or MemFree + Buffers + Cachedavail Mem field in the top memory line (MB)Available memory size, updated every 5 seconds
ut/proc/statMainly uses the idle field; all CPU time fields are used to compute total timeEMA-smoothed value of 100% - id%CPU utilization, EMA smoothing over a 15-second window, updated every 5 seconds
r15s/proc/statprocs_running line, procs_blocked lineNo corresponding output (running count can be seen in the Tasks line)15-second EMA-smoothed run-queue length
r1m/proc/statprocs_running line, procs_blocked lineNo corresponding output (running count can be seen in the Tasks line)1-minute EMA-smoothed run-queue length
r15m/proc/statprocs_running line, procs_blocked lineNo corresponding output (running count can be seen in the Tasks line)15-minute EMA-smoothed run-queue length
  • Expected time for load changes to trigger loadsched/loadstop actions (estimated; actual values may differ):
MetricExpected Time
mem35 seconds
ut45 seconds
r15s45 seconds
r1m1 minute 30 seconds
r15m15 minutes 30 seconds
  • If the partition where the node resides is configured with scheduling thresholds LoadSched[XXX]:
    • The node is drained and stops accepting new jobs if any of the following conditions are met:
      • Remaining node memory is below LoadSchedMem.
      • Node CPU utilization is above LoadSchedUt.
      • Node 15-second average CPU run-queue length is greater than LoadSchedR15s.
      • Node 1-minute average CPU run-queue length is greater than LoadSchedR1m.
      • Node 15-minute average CPU run-queue length is greater than LoadSchedR15m.
    • The node is undrained and can accept new jobs if all of the following conditions are met:
      • Remaining node memory is above LoadSchedMem.
      • Node CPU utilization is below LoadSchedUt.
      • Node 15-second average CPU run-queue length is less than LoadSchedR15s.
      • Node 1-minute average CPU run-queue length is less than LoadSchedR1m.
      • Node 15-minute average CPU run-queue length is less than LoadSchedR15m.
  • If the partition where the node resides is configured with stop thresholds LoadStop[XXX] and scheduling thresholds LoadSched[XXX]:
    • If any of the following conditions are met, jobs on the node are STOPped in priority order (for the same priority, by start time). STOPped jobs do not release memory. Lower-priority jobs are stopped first (for the same priority, the later-started job), until only the last job remains:
      • Remaining node memory is below LoadStopMem.
      • Node CPU utilization is above LoadStopUt.
      • Node 15-second average CPU run-queue length is greater than LoadStopR15s.
      • Node 1-minute average CPU run-queue length is greater than LoadStopR1m.
      • Node 15-minute average CPU run-queue length is greater than LoadStopR15m.
    • If all of the following conditions are met, the node is undrained and STOPped jobs are CONTINUEd in priority order. Higher-priority jobs are continued first (for the same priority, the earlier-started job):
      • Remaining node memory is above LoadSchedMem.
      • Node CPU utilization is below LoadSchedUt.
      • Node 15-second average CPU run-queue length is less than LoadSchedR15s.
      • Node 1-minute average CPU run-queue length is less than LoadSchedR1m.
      • Node 15-minute average CPU run-queue length is less than LoadSchedR15m.

Notes:

  1. You can set only one type of resource LoadSched[XXX] or LoadSched[XXX].
  2. If you want to set LoadStop[XXX] for a resource, you must also set the corresponding LoadSched[XXX], and the following conditions must be met by resource type. Otherwise it is an invalid configuration; reconfiguration will fail, and slurmctld or slurmd restart will fail:
    • The value of LoadStopMem is less than LoadSchedMem.
    • The value of LoadStopUt is greater than LoadSchedUt.
    • The value of LoadStopR15s is greater than LoadSchedR15s.
    • The value of LoadStopR1m is greater than LoadSchedR1m.
    • The value of LoadStopR15m is greater than LoadSchedR15m.
  3. If a node belongs to multiple partitions, the configuration of the last partition takes effect.
  4. Because threshold checks are periodic polling, it cannot guarantee that node resources will never exceed the load thresholds.

Example

root@compute1:~# cat /etc/slurm/partitions.conf

#
# PARTITION partition-U7KH5
PartitionName=partition-U7KH5 Nodes=compute1 Default=YES LoadSchedMem=1300 LoadStopMem=1200 LoadSchedUt=80 LoadStopUt=90
# DUMMY

# NODES
NodeName=compute1 CPUs=16 RealMemory=13926 Weight=1 State=CLOUD