Skip to main content

Partition OverMemoryKill

fsched supports a per-partition policy: kill a job when it exceeds the requested memory. Configure it on the line that starts with PartitionName for the partition in partitions.conf. The options include:

  1. Kill jobs on memory overuse: OverMemoryKill
  • If the partition where the job runs is configured with OverMemoryKill=YES, the job will be killed when the memory it uses exceeds the memory it requested.

Notes:

  1. Killing jobs on memory overuse can be delayed, and it does not guarantee the system will never OOM.
  2. If the task plugin is configured as cgroup, cgroup is used to kill jobs on memory overuse; otherwise jobacct is used.
  3. cgroup currently does not work on Ubuntu 22.
  4. If no memory request is specified when submitting a job, the default memory is used as the requested memory.

Example

root@head1:~# cat /etc/slurm/partitions.conf

#
# PARTITION partition-8PSBV
PartitionName=partition-8PSBV Nodes=compute1,compute2 Default=YES OverMemoryKill=YES
# DUMMY

# NODES
NodeName=compute1 CPUs=16 RealMemory=13926 Weight=1 State=CLOUD
NodeName=compute2 CPUs=16 RealMemory=13926 Weight=1 State=CLOUD


root@head1:~# srun --mem=3 hostname
srun: Exceeded job memory limit
Jul 17 15:43:46.625179 206113 slurmstepd 0x14becbec1b80: E: Step 463.0 exceeded memory limit (7354368 > 3145728), being killed
compute1