QoS Rejection Policy Flags
fsched supports QoS (Quality of Service) rejection policy flags to control behavior when jobs exceed resource limits. With these flags, administrators can control scheduling policies more granularly.
This feature applies to fsched 10.101 and later.
Flag Details
DenyOnMaxPerJob
When DenyOnMaxPerJob is set, if a job requests resources that exceed the QoS per-job maximum resource limit (MaxTRESPerJob), the job is rejected immediately instead of entering the queue.
Use cases:
- Strictly limit per-job resource usage.
- Avoid users submitting oversized jobs that occupy the queue for a long time.
DenyOnMaxPerUser
When DenyOnMaxPerUser is set, if a job would cause the user's resource usage to exceed the QoS per-user maximum resource limit (MaxTRESPerUser), the job is rejected immediately instead of entering the queue.
Use cases:
- Strictly limit per-user resource usage.
- Prevent a single user from consuming too many cluster resources.
DenyOnGrp
When DenyOnGrp is set, if a job would cause the group's resource usage to exceed the QoS group maximum resource limit (GrpTRES), the job is rejected immediately instead of entering the queue.
Use cases:
- Strictly limit total resource usage for a group.
- Avoid a group's jobs over-consuming cluster resources.
How to Use
View QoS Flags
sacctmgr show qos format=name,flags
Set Rejection Policy Flags
Use sacctmgr to set QoS flags:
# Set a single flag
sacctmgr modify qos <qos_name> set flags=DenyOnMaxPerJob
# Set multiple flags
sacctmgr modify qos <qos_name> set flags=DenyOnMaxPerJob,DenyOnMaxPerUser
# Add a flag to existing flags
sacctmgr modify qos <qos_name> set flags+=DenyOnGrp
Remove Rejection Policy Flags
# Remove a specific flag
sacctmgr modify qos <qos_name> set flags-=DenyOnMaxPerJob
Configuration Examples
Example 1: Limit Per-Job Resources and Reject Oversized Jobs Immediately
- Create a QoS and set the per-job maximum CPU count to 16.
sacctmgr add qos limited_job
sacctmgr modify qos limited_job set MaxTRESPerJob=cpu=16
- Set the
DenyOnMaxPerJobflag.
sacctmgr modify qos limited_job set flags=DenyOnMaxPerJob
- Associate the QoS with a user.
sacctmgr modify user alice account=myaccount set qos=limited_job
- Test: when user
alicesubmits a job requesting more than 16 CPUs, it is rejected immediately.
alice@head:~$ sbatch -n 20 job.sh
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy
Example 2: Limit Total User Resources and Reject Oversized Jobs Immediately
- Create a QoS and set the per-user maximum CPU count to 32.
sacctmgr add qos limited_user
sacctmgr modify qos limited_user set MaxTRESPerUser=cpu=32
- Set the
DenyOnMaxPerUserflag.
sacctmgr modify qos limited_user set flags=DenyOnMaxPerUser
- Associate the QoS with a user.
sacctmgr modify user bob account=myaccount set qos=limited_user
- Test: when user
bobalready has jobs using 24 CPUs, submitting another job requesting 16 CPUs will be rejected immediately.
bob@head:~$ squeue -u bob
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
101 partition job1 bob R 0:30 3 compute[1-3]
bob@head:~$ sbatch -n 16 job2.sh
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy
Example 3: Combine Multiple Flags
- Create a QoS and set multiple limits.
sacctmgr add qos strict_qos
sacctmgr modify qos strict_qos set MaxTRESPerJob=cpu=16
sacctmgr modify qos strict_qos set MaxTRESPerUser=cpu=32
sacctmgr modify qos strict_qos set GrpTRES=cpu=64
- Set all three rejection policy flags at the same time.
sacctmgr modify qos strict_qos set flags=DenyOnMaxPerJob,DenyOnMaxPerUser,DenyOnGrp
With this configuration:
- A single job requesting more than 16 CPUs will be rejected.
- A new job will be rejected when the user's total usage exceeds 32 CPUs.
- A new job will be rejected when the group's total usage exceeds 64 CPUs.
Behavior Comparison
Without Rejection Policy Flags (Default Behavior)
When a job exceeds resource limits, it enters the PENDING state and waits for resources to be released before it can run.
bob@head:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
102 partition job2 bob PD 0:00 1 (MaxCpuPerUser)
101 partition job1 bob R 0:30 3 compute[1-3]
With Rejection Policy Flags Enabled
When a job exceeds resource limits, it is rejected immediately and the submission fails.
bob@head:~$ sbatch job2.sh
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy
Notes
- Rejection policy flags cause job submission to fail instead of waiting. Choose whether to enable them based on actual needs.
- These flags affect only newly submitted jobs that exceed limits, and do not affect jobs that are already running.
- Before enabling rejection policies, it is recommended to test resource limits without flags first to ensure they are reasonable.
- Rejection policy flags are different from the
DenyOnLimitflag:DenyOnLimitapplies to all limit types, while these three flags can target different limit types separately.