Fairshare Scheduling Policy
Overview
Slurm uses the priority/multifactor plugin to implement multifactor job scheduling. The Fairshare policy is the core mechanism to ensure fair resource allocation. With proper configuration and use of Fairshare, you can significantly improve the fairness and efficiency of cluster resource utilization, ensuring that all users receive cluster resources in line with their shares. The Fairshare policy includes two key factors:
Fairshare Factor (resource allocation fairness factor)
- Purpose: considers the difference between committed resources and actually allocated resources for users/accounts.
- Principle: based on the resource allocation share (Share) for users/accounts and the actual allocated resources they receive.
- Effect: users who receive allocations beyond their committed share get lower priority.
FairshareUsed Factor (resource usage fairness factor)
- Purpose: considers the difference between committed resources and actually consumed resources for users/accounts.
- Principle: based on the resource allocation share (Share) for users/accounts and their actual resource consumption.
- Effect: users who consume more than their committed share get lower priority.
- Supported version: FairshareUsed Factor is supported starting from fsched-10.87.
Functional Details
Fairshare Factor mechanism
- Tracks resource allocation for users/accounts.
- Calculates a fairness index for resource allocation.
- Affects scheduling priority for new jobs.
- Ensures that, over the long term, each user/account receives resources consistent with their share.
FairshareUsed Factor mechanism:
- Tracks actual resource consumption for users/accounts (CPU hours, memory usage, etc.).
- Calculates a fairness index for resource usage.
- Affects scheduling priority for new jobs.
- Prevents inefficient resource usage.
Configuration Guide
Configure Association
- Configure the cluster-account-user associations in the accounting system.
- For the Fsched SE version: enable the "per-user resource limit" feature in the cluster management UI and set cluster resource quotas for each user.
Core Parameter Configuration
Add the following parameters to the slurm.conf configuration file:
# Enable the multifactor priority plugin
PriorityType=priority/multifactor
# Set fair scheduling weights (example values)
PriorityWeightFairshare=30 # Resource allocation fairness factor weight
PriorityWeightFairshareUsed=3000 # Resource usage fairness factor weight
Parameter Notes:
In this example, PriorityWeightFairshareUsed is set to a higher value (3000), which makes actual resource consumption have a greater impact on job priority. Adjust the actual value based on your cluster's needs.
Weight Configuration Recommendations
- Adjust weights based on cluster characteristics. In production, it is recommended to establish a periodic review mechanism, analyze resource allocation/usage fairness monthly, and adjust share allocations based on usage.
- In resource-constrained environments, increase the FairshareUsed weight.
- If you want to prioritize allocation fairness, increase the Fairshare weight.
Monitoring Tools
Use sshare to Monitor Resource Shares
View detailed resource share information:
sshare -a --ext
Field descriptions:
RawShares: raw share valueNormShares: normalized shareRawUsage: raw resource usageEffectvUsage: effective resource usageFairShare: Fairshare Factor valueFairShareU: FairshareUsed Factor value
Use sprio to Analyze Job Priority
View job priority composition:
sprio --ext -l
Field descriptions:
PRIORITY: total priorityFAIRSHARE: Fairshare Factor contributionFAIRSHAREU: FairshareUsed Factor contribution- Other standard priority factors
Usage Examples
Basic Environment Check
-
Confirm association configuration:
sacctmgr list assoc cluster=<cluster_name> -
Check the current scheduling configuration:
scontrol show config | grep Priority
Fairness Tests
Test scenario 1: verify Fairshare Factor
-
User alice submits multiple jobs that consume a lot of resources:
alice@ubuntu22-4c-1:~$ for i in `seq 100`;srun -c1 --exclusive sleep 3600 &;done
-
Observe priority changes:
Before and after the jobs complete, use
sshare --extandsprio --extto view resource share and priority changes.
Test scenario 2: verify FairshareUsed Factor
-
User alice submits jobs with low resource consumption:
alice@ubuntu22-4c-1:~$ srun -n4 sleep 100&
[1] 2512319
alice@ubuntu22-4c-1:~$ srun -n4 sleep 100&
[2] 2512372 -
User charlie submits jobs with high resource consumption:
charlie@ubuntu22-4c-2:~$ srun stress-ng --cpu 1 --cpu-load 100 -t 100s &
[1] 164380
charlie@ubuntu22-4c-2:~$ srun stress-ng --cpu 1 --cpu-load 100 -t 100s &
[2] 164387
-
Observe priority changes:
Before and after the jobs complete, use
sshare --extandsprio --extto view resource share and priority changes.
Notes
-
Version compatibility: If you downgrade from this version to an older fsched, the
slurmctldservice will fail to start due to an incompatible state file version. Example error:Can not recover assoc_usage state, incompatible version. Solution: manually delete theassoc_usagefile under the cluster state directory, then restartslurmctld. -
Algorithm limitation: FairshareUsed supports only the Fair Tree fairshare algorithm, and does not support the classic fairshare algorithm.
-
Weight impact: A very high FairshareUsed weight may give overly high priority to users with low resource usage. Determine the optimal weight ratio through testing.
-
Data accuracy: Ensure accounting data collection is working properly, and periodically verify the accuracy of resource usage statistics.