Load Thresholds

fsched supports per-partition load thresholds for memory (in MiB), CPU utilization (in %), and average CPU run-queue length (decimal, no unit). Configure them on the line that starts with PartitionName for the partition in partitions.conf. The options include:

Scheduling memory threshold LoadSchedMem
Stop memory threshold LoadStopMem
Scheduling CPU utilization threshold LoadSchedUt
Stop CPU utilization threshold LoadStopUt
Scheduling 15-second average CPU run-queue length threshold LoadSchedR15s
Stop 15-second average CPU run-queue length threshold LoadStopR15s
Scheduling 1-minute average CPU run-queue length threshold LoadSchedR1m
Stop 1-minute average CPU run-queue length threshold LoadStopR1m
Scheduling 15-minute average CPU run-queue length threshold LoadSchedR15m
Stop 15-minute average CPU run-queue length threshold LoadStopR15m

tip

For fsched scheduler configuration, all options above are numeric and must not include units.

Option	Description
LoadSchedMem/LoadStopMem	Integer, default unit is MiB
LoadSchedUt/LoadStopUt	Integer, default unit is %
LoadSchedR15s/LoadStopR15s	Decimal or integer, no unit
LoadSchedR1m/LoadStopR1m	Decimal or integer, no unit
LoadSchedR15m/LoadStopR15m	Decimal or integer, no unit

tip

The current load is checked every 30 seconds. The load metrics are obtained as follows:

Metric	Data Source	Specific /proc/stat Field	Corresponding top Output	Description
mem	/proc/meminfo	MemAvailable line or MemFree + Buffers + Cached	avail Mem field in the top memory line (MB)	Available memory size, updated every 5 seconds
ut	/proc/stat	Mainly uses the idle field; all CPU time fields are used to compute total time	EMA-smoothed value of 100% - id%	CPU utilization, EMA smoothing over a 15-second window, updated every 5 seconds
r15s	/proc/stat	procs_running line, procs_blocked line	No corresponding output (running count can be seen in the Tasks line)	15-second EMA-smoothed run-queue length
r1m	/proc/stat	procs_running line, procs_blocked line	No corresponding output (running count can be seen in the Tasks line)	1-minute EMA-smoothed run-queue length
r15m	/proc/stat	procs_running line, procs_blocked line	No corresponding output (running count can be seen in the Tasks line)	15-minute EMA-smoothed run-queue length

Expected time for load changes to trigger loadsched/loadstop actions (estimated; actual values may differ):

Metric	Expected Time
mem	35 seconds
ut	45 seconds
r15s	45 seconds
r1m	1 minute 30 seconds
r15m	15 minutes 30 seconds

If the partition where the node resides is configured with scheduling thresholds LoadSched[XXX]:
- The node is drained and stops accepting new jobs if any of the following conditions are met:
  - Remaining node memory is below LoadSchedMem.
  - Node CPU utilization is above LoadSchedUt.
  - Node 15-second average CPU run-queue length is greater than LoadSchedR15s.
  - Node 1-minute average CPU run-queue length is greater than LoadSchedR1m.
  - Node 15-minute average CPU run-queue length is greater than LoadSchedR15m.
- The node is undrained and can accept new jobs if all of the following conditions are met:
  - Remaining node memory is above LoadSchedMem.
  - Node CPU utilization is below LoadSchedUt.
  - Node 15-second average CPU run-queue length is less than LoadSchedR15s.
  - Node 1-minute average CPU run-queue length is less than LoadSchedR1m.
  - Node 15-minute average CPU run-queue length is less than LoadSchedR15m.
If the partition where the node resides is configured with stop thresholds LoadStop[XXX] and scheduling thresholds LoadSched[XXX]:
- If any of the following conditions are met, jobs on the node are STOPped in priority order (for the same priority, by start time). STOPped jobs do not release memory. Lower-priority jobs are stopped first (for the same priority, the later-started job), until only the last job remains:
  - Remaining node memory is below LoadStopMem.
  - Node CPU utilization is above LoadStopUt.
  - Node 15-second average CPU run-queue length is greater than LoadStopR15s.
  - Node 1-minute average CPU run-queue length is greater than LoadStopR1m.
  - Node 15-minute average CPU run-queue length is greater than LoadStopR15m.
- If all of the following conditions are met, the node is undrained and STOPped jobs are CONTINUEd in priority order. Higher-priority jobs are continued first (for the same priority, the earlier-started job):
  - Remaining node memory is above LoadSchedMem.
  - Node CPU utilization is below LoadSchedUt.
  - Node 15-second average CPU run-queue length is less than LoadSchedR15s.
  - Node 1-minute average CPU run-queue length is less than LoadSchedR1m.
  - Node 15-minute average CPU run-queue length is less than LoadSchedR15m.

Notes:

You can set only one type of resource LoadSched[XXX] or LoadSched[XXX].
If you want to set LoadStop[XXX] for a resource, you must also set the corresponding LoadSched[XXX], and the following conditions must be met by resource type. Otherwise it is an invalid configuration; reconfiguration will fail, and slurmctld or slurmd restart will fail:
- The value of LoadStopMem is less than LoadSchedMem.
- The value of LoadStopUt is greater than LoadSchedUt.
- The value of LoadStopR15s is greater than LoadSchedR15s.
- The value of LoadStopR1m is greater than LoadSchedR1m.
- The value of LoadStopR15m is greater than LoadSchedR15m.
If a node belongs to multiple partitions, the configuration of the last partition takes effect.
Because threshold checks are periodic polling, it cannot guarantee that node resources will never exceed the load thresholds.

Example

root@compute1:~# cat /etc/slurm/partitions.conf

#
# PARTITION partition-U7KH5
PartitionName=partition-U7KH5 Nodes=compute1 Default=YES LoadSchedMem=1300 LoadStopMem=1200  LoadSchedUt=80 LoadStopUt=90
#   DUMMY

#   NODES
NodeName=compute1 CPUs=16 RealMemory=13926  Weight=1 State=CLOUD

Example​

Example