Skip to main content

QoS New Policy Parameters

fsched (fsched-10.61+) supports new QoS policy parameters:

  1. MaxTRESRunMinsPerAccount
    • For all accounts associated with this QoS, the maximum total resource minutes that all running jobs of each account can use.
  2. MaxTRESRunMinsPerUser
    • For all users associated with this QoS, the maximum total resource minutes that all running jobs of each user can use.
warning
  • Manually modifying slurm accounting associations and QoS settings conflicts with the UI cluster quota settings, so do not use them together with the UI cluster quota feature.
  • In addition to updating fsched, you must also update the slurmdbd container image. Otherwise, there will be a bug when setting these two new parameters (impact: if the value to set for MaxTRESRunMinsPerAccount equals the current value of MaxTRESRunMinsPerUser, the setting will fail with no error).
  • If you submit jobs without using -t to specify time_limit, the default time_limit is unlimited. In that case, if MaxTRESRunMinsPerAccount or MaxTRESRunMinsPerUser is set, the job will always be pending because the resource minutes exceed the maximum limit.

Example

  1. Assume users alice and bob exist, both with primary group jjtest.

    root@head1:~# id alice
    uid=2006(alice) gid=2013(jjtest) groups=2001(fsadmin),2003(defaultGroup),2013(jjtest)
    root@head1:~# id bob
    uid=2007(bob) gid=2013(jjtest) groups=2001(fsadmin),2003(defaultGroup),2013(jjtest)
  2. Create associations for the current cluster and users alice and bob.

    sacctmgr add account jjtest cluster=fastone-18
    sacctmgr add user alice account=jjtest cluster=fastone-18
    sacctmgr add user bob account=jjtest cluster=fastone-18
  3. Create a QoS and set MaxTRESRunMinsPerAccount and MaxTRESRunMinsPerUser.

    sacctmgr add qos test_tres
    sacctmgr modify qos test_tres set MaxTRESRunMinsPerAccount=cpu=3
    sacctmgr modify qos test_tres set MaxTRESRunMinsPerUser=cpu=2
  4. Associate the QoS with the associations of users alice and bob.

    sacctmgr modify user alice account=jjtest cluster=fastone-18 set qos=test_tres
    sacctmgr modify user bob account=jjtest cluster=fastone-18 set qos=test_tres
  5. User alice submits a job with time_limit of 2 minutes, then submits another job with time_limit of 1 minute. The later job will be pending due to MaxCpuRunMinsPerUser.

    alice@head1:~$ srun -t 2 sleep 100&
    alice@head1:~$ srun -t 1 sleep 100&
    alice@head1:~$ squeue
    JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    40 partition sleep alice PD 0:00 1 (MaxCpuRunMinsPerUser)
    39 partition sleep alice R 0:07 1 compute1
  6. Before the jobs in step 5 finish, user bob submits a job with time_limit of 2 minutes. The job will be pending due to MaxCpuRunMinsPerAccount.

    bob@head1:~$ srun -t 2 sleep 600&
    bob@head1:~$ squeue
    JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    42 partition sleep alice PD 0:00 1 (MaxCpuRunMinsPerUser)
    43 partition sleep bob PD 0:00 1 (MaxCpuRunMinsPerAccount)
    41 partition sleep alice R 0:21 1 compute1