Partition OverMemoryKill
fsched supports a per-partition policy: kill a job when it exceeds the requested memory. Configure it on the line that starts with PartitionName for the partition in partitions.conf. The options include:
- Kill jobs on memory overuse:
OverMemoryKill
- If the partition where the job runs is configured with
OverMemoryKill=YES, the job will be killed when the memory it uses exceeds the memory it requested.
Notes:
- Killing jobs on memory overuse can be delayed, and it does not guarantee the system will never OOM.
- If the task plugin is configured as cgroup, cgroup is used to kill jobs on memory overuse; otherwise jobacct is used.
- cgroup currently does not work on Ubuntu 22.
- If no memory request is specified when submitting a job, the default memory is used as the requested memory.
Example
root@head1:~# cat /etc/slurm/partitions.conf
#
# PARTITION partition-8PSBV
PartitionName=partition-8PSBV Nodes=compute1,compute2 Default=YES OverMemoryKill=YES
# DUMMY
# NODES
NodeName=compute1 CPUs=16 RealMemory=13926 Weight=1 State=CLOUD
NodeName=compute2 CPUs=16 RealMemory=13926 Weight=1 State=CLOUD
root@head1:~# srun --mem=3 hostname
srun: Exceeded job memory limit
Jul 17 15:43:46.625179 206113 slurmstepd 0x14becbec1b80: E: Step 463.0 exceeded memory limit (7354368 > 3145728), being killed
compute1