Load Thresholds
fsched supports per-partition load thresholds for memory (in MiB), CPU utilization (in %), and average CPU run-queue length (decimal, no unit). Configure them on the line that starts with PartitionName for the partition in partitions.conf. The options include:
- Scheduling memory threshold
LoadSchedMem - Stop memory threshold
LoadStopMem - Scheduling CPU utilization threshold
LoadSchedUt - Stop CPU utilization threshold
LoadStopUt - Scheduling 15-second average CPU run-queue length threshold
LoadSchedR15s - Stop 15-second average CPU run-queue length threshold
LoadStopR15s - Scheduling 1-minute average CPU run-queue length threshold
LoadSchedR1m - Stop 1-minute average CPU run-queue length threshold
LoadStopR1m - Scheduling 15-minute average CPU run-queue length threshold
LoadSchedR15m - Stop 15-minute average CPU run-queue length threshold
LoadStopR15m
tip
For fsched scheduler configuration, all options above are numeric and must not include units.
| Option | Description |
|---|---|
| LoadSchedMem/LoadStopMem | Integer, default unit is MiB |
| LoadSchedUt/LoadStopUt | Integer, default unit is % |
| LoadSchedR15s/LoadStopR15s | Decimal or integer, no unit |
| LoadSchedR1m/LoadStopR1m | Decimal or integer, no unit |
| LoadSchedR15m/LoadStopR15m | Decimal or integer, no unit |
tip
- The current load is checked every 30 seconds. The load metrics are obtained as follows:
| Metric | Data Source | Specific /proc/stat Field | Corresponding top Output | Description |
|---|---|---|---|---|
| mem | /proc/meminfo | MemAvailable line or MemFree + Buffers + Cached | avail Mem field in the top memory line (MB) | Available memory size, updated every 5 seconds |
| ut | /proc/stat | Mainly uses the idle field; all CPU time fields are used to compute total time | EMA-smoothed value of 100% - id% | CPU utilization, EMA smoothing over a 15-second window, updated every 5 seconds |
| r15s | /proc/stat | procs_running line, procs_blocked line | No corresponding output (running count can be seen in the Tasks line) | 15-second EMA-smoothed run-queue length |
| r1m | /proc/stat | procs_running line, procs_blocked line | No corresponding output (running count can be seen in the Tasks line) | 1-minute EMA-smoothed run-queue length |
| r15m | /proc/stat | procs_running line, procs_blocked line | No corresponding output (running count can be seen in the Tasks line) | 15-minute EMA-smoothed run-queue length |
- Expected time for load changes to trigger loadsched/loadstop actions (estimated; actual values may differ):
| Metric | Expected Time |
|---|---|
| mem | 35 seconds |
| ut | 45 seconds |
| r15s | 45 seconds |
| r1m | 1 minute 30 seconds |
| r15m | 15 minutes 30 seconds |
- If the partition where the node resides is configured with scheduling thresholds
LoadSched[XXX]:- The node is drained and stops accepting new jobs if any of the following conditions are met:
- Remaining node memory is below
LoadSchedMem. - Node CPU utilization is above
LoadSchedUt. - Node 15-second average CPU run-queue length is greater than
LoadSchedR15s. - Node 1-minute average CPU run-queue length is greater than
LoadSchedR1m. - Node 15-minute average CPU run-queue length is greater than
LoadSchedR15m.
- Remaining node memory is below
- The node is undrained and can accept new jobs if all of the following conditions are met:
- Remaining node memory is above
LoadSchedMem. - Node CPU utilization is below
LoadSchedUt. - Node 15-second average CPU run-queue length is less than
LoadSchedR15s. - Node 1-minute average CPU run-queue length is less than
LoadSchedR1m. - Node 15-minute average CPU run-queue length is less than
LoadSchedR15m.
- Remaining node memory is above
- The node is drained and stops accepting new jobs if any of the following conditions are met:
- If the partition where the node resides is configured with stop thresholds
LoadStop[XXX]and scheduling thresholdsLoadSched[XXX]:- If any of the following conditions are met, jobs on the node are STOPped in priority order (for the same priority, by start time). STOPped jobs do not release memory. Lower-priority jobs are stopped first (for the same priority, the later-started job), until only the last job remains:
- Remaining node memory is below
LoadStopMem. - Node CPU utilization is above
LoadStopUt. - Node 15-second average CPU run-queue length is greater than
LoadStopR15s. - Node 1-minute average CPU run-queue length is greater than
LoadStopR1m. - Node 15-minute average CPU run-queue length is greater than
LoadStopR15m.
- Remaining node memory is below
- If all of the following conditions are met, the node is undrained and STOPped jobs are CONTINUEd in priority order. Higher-priority jobs are continued first (for the same priority, the earlier-started job):
- Remaining node memory is above
LoadSchedMem. - Node CPU utilization is below
LoadSchedUt. - Node 15-second average CPU run-queue length is less than
LoadSchedR15s. - Node 1-minute average CPU run-queue length is less than
LoadSchedR1m. - Node 15-minute average CPU run-queue length is less than
LoadSchedR15m.
- Remaining node memory is above
- If any of the following conditions are met, jobs on the node are STOPped in priority order (for the same priority, by start time). STOPped jobs do not release memory. Lower-priority jobs are stopped first (for the same priority, the later-started job), until only the last job remains:
Notes:
- You can set only one type of resource
LoadSched[XXX]orLoadSched[XXX]. - If you want to set
LoadStop[XXX]for a resource, you must also set the correspondingLoadSched[XXX], and the following conditions must be met by resource type. Otherwise it is an invalid configuration; reconfiguration will fail, andslurmctldorslurmdrestart will fail:- The value of
LoadStopMemis less thanLoadSchedMem. - The value of
LoadStopUtis greater thanLoadSchedUt. - The value of
LoadStopR15sis greater thanLoadSchedR15s. - The value of
LoadStopR1mis greater thanLoadSchedR1m. - The value of
LoadStopR15mis greater thanLoadSchedR15m.
- The value of
- If a node belongs to multiple partitions, the configuration of the last partition takes effect.
- Because threshold checks are periodic polling, it cannot guarantee that node resources will never exceed the load thresholds.
Example
root@compute1:~# cat /etc/slurm/partitions.conf
#
# PARTITION partition-U7KH5
PartitionName=partition-U7KH5 Nodes=compute1 Default=YES LoadSchedMem=1300 LoadStopMem=1200 LoadSchedUt=80 LoadStopUt=90
# DUMMY
# NODES
NodeName=compute1 CPUs=16 RealMemory=13926 Weight=1 State=CLOUD