Skip to main content

Fairshare Scheduling Policy

Overview

Slurm uses the priority/multifactor plugin to implement multifactor job scheduling. The Fairshare policy is the core mechanism to ensure fair resource allocation. With proper configuration and use of Fairshare, you can significantly improve the fairness and efficiency of cluster resource utilization, ensuring that all users receive cluster resources in line with their shares. The Fairshare policy includes two key factors:

Fairshare Factor (resource allocation fairness factor)

  • Purpose: considers the difference between committed resources and actually allocated resources for users/accounts.
  • Principle: based on the resource allocation share (Share) for users/accounts and the actual allocated resources they receive.
  • Effect: users who receive allocations beyond their committed share get lower priority.

FairshareUsed Factor (resource usage fairness factor)

  • Purpose: considers the difference between committed resources and actually consumed resources for users/accounts.
  • Principle: based on the resource allocation share (Share) for users/accounts and their actual resource consumption.
  • Effect: users who consume more than their committed share get lower priority.
  • Supported version: FairshareUsed Factor is supported starting from fsched-10.87.

Functional Details

Fairshare Factor mechanism

  • Tracks resource allocation for users/accounts.
  • Calculates a fairness index for resource allocation.
  • Affects scheduling priority for new jobs.
  • Ensures that, over the long term, each user/account receives resources consistent with their share.

FairshareUsed Factor mechanism:

  • Tracks actual resource consumption for users/accounts (CPU hours, memory usage, etc.).
  • Calculates a fairness index for resource usage.
  • Affects scheduling priority for new jobs.
  • Prevents inefficient resource usage.

Configuration Guide

Configure Association

  • Configure the cluster-account-user associations in the accounting system.
  • For the Fsched SE version: enable the "per-user resource limit" feature in the cluster management UI and set cluster resource quotas for each user.

Core Parameter Configuration

Add the following parameters to the slurm.conf configuration file:

# Enable the multifactor priority plugin
PriorityType=priority/multifactor

# Set fair scheduling weights (example values)
PriorityWeightFairshare=30 # Resource allocation fairness factor weight
PriorityWeightFairshareUsed=3000 # Resource usage fairness factor weight

Parameter Notes:

In this example, PriorityWeightFairshareUsed is set to a higher value (3000), which makes actual resource consumption have a greater impact on job priority. Adjust the actual value based on your cluster's needs.

Weight Configuration Recommendations

  • Adjust weights based on cluster characteristics. In production, it is recommended to establish a periodic review mechanism, analyze resource allocation/usage fairness monthly, and adjust share allocations based on usage.
  • In resource-constrained environments, increase the FairshareUsed weight.
  • If you want to prioritize allocation fairness, increase the Fairshare weight.

Monitoring Tools

Use sshare to Monitor Resource Shares

View detailed resource share information:

sshare -a --ext

Field descriptions:

  • RawShares: raw share value
  • NormShares: normalized share
  • RawUsage: raw resource usage
  • EffectvUsage: effective resource usage
  • FairShare: Fairshare Factor value
  • FairShareU: FairshareUsed Factor value

Use sprio to Analyze Job Priority

View job priority composition:

sprio --ext -l

Field descriptions:

  • PRIORITY: total priority
  • FAIRSHARE: Fairshare Factor contribution
  • FAIRSHAREU: FairshareUsed Factor contribution
  • Other standard priority factors

Usage Examples

Basic Environment Check

  1. Confirm association configuration:

    sacctmgr list assoc cluster=<cluster_name>
  2. Check the current scheduling configuration:

    scontrol show config | grep Priority

Fairness Tests

Test scenario 1: verify Fairshare Factor

  • User alice submits multiple jobs that consume a lot of resources:

    alice@ubuntu22-4c-1:~$ for i in `seq 100`;srun -c1 --exclusive sleep 3600 &;done
  • Observe priority changes:

    Before and after the jobs complete, use sshare --ext and sprio --ext to view resource share and priority changes.

Test scenario 2: verify FairshareUsed Factor

  • User alice submits jobs with low resource consumption:

    alice@ubuntu22-4c-1:~$ srun -n4 sleep 100&
    [1] 2512319
    alice@ubuntu22-4c-1:~$ srun -n4 sleep 100&
    [2] 2512372
  • User charlie submits jobs with high resource consumption:

    charlie@ubuntu22-4c-2:~$ srun stress-ng --cpu 1 --cpu-load 100 -t 100s &
    [1] 164380
    charlie@ubuntu22-4c-2:~$ srun stress-ng --cpu 1 --cpu-load 100 -t 100s &
    [2] 164387
  • Observe priority changes:

    Before and after the jobs complete, use sshare --ext and sprio --ext to view resource share and priority changes.

Notes

  • Version compatibility: If you downgrade from this version to an older fsched, the slurmctld service will fail to start due to an incompatible state file version. Example error: Can not recover assoc_usage state, incompatible version. Solution: manually delete the assoc_usage file under the cluster state directory, then restart slurmctld.

  • Algorithm limitation: FairshareUsed supports only the Fair Tree fairshare algorithm, and does not support the classic fairshare algorithm.

  • Weight impact: A very high FairshareUsed weight may give overly high priority to users with low resource usage. Determine the optimal weight ratio through testing.

  • Data accuracy: Ensure accounting data collection is working properly, and periodically verify the accuracy of resource usage statistics.