Inside Fsched
Fsched-Related Processes
Fsched is derived from the open-source Slurm scheduler. Based on Slurm 19.05, Fsched adds many improvements in functionality, performance, and stability. The daemons in an Fsched cluster include:
slurmctld: the Fsched controller daemon, running on the head node. All scheduler management functions are provided byslurmctld. It is responsible for job queue management, scheduling, and monitoring the overall cluster state. It interacts with other daemons to allocate resources, start jobs, monitor job execution, and manage job dependencies.slurmd: the Fsched compute node daemon, running on all compute nodes. It manages local resource allocation, job execution, and communication withslurmctld. It ensures jobs run correctly on compute nodes, reports job status toslurmctld, and applies resource limits according to policy.slurmdbd: the Fsched database daemon, responsible for maintaining job accounting records in the database. It stores information about job submission, completion, failure, and resource usage. It supports queries and reporting, allowing users and administrators to track cluster usage and job history. In FCP or FCC-E environments,slurmdbdand its corresponding database service run on the FCP/FCC-E management node.munged: the authentication service used to create and validate credentials. It runs on all nodes in the Fsched cluster, including head nodes, login nodes, and compute nodes. It ensures communication between SLURM daemons and nodes is secure and trusted.statesvc: the Fsched state service, used to provide Fsched state and job information to the monitoring and analysis modules of FCP/FCC-E
How Fsched-Related Processes Communicate
- Registration and heartbeat: when
slurmdstarts on a compute node, it sends registration information toslurmctld, announcing its presence and providing the node's resource status. After that,slurmctldperiodically sends heartbeat signals to eachslurmdto confirm its state and availability. - Request and response: when a user submits a job,
slurmctldreceives the request and schedules it based on the current resource state. After scheduling is decided,slurmctldsends the job information to the correspondingslurmd, which is responsible for executing the job on the local node. - Monitoring and reporting:
slurmdperiodically reports node resource usage and job execution status toslurmctld, ensuring that the main control process has real-time visibility into the overall cluster state.
High Availability Design
Fsched's High Availability (HA) mechanism is designed to ensure that the cluster can continue operating when critical components fail, minimizing downtime and potential waste of computing resources. Fsched implements high availability in the following ways:
1. slurmctld failover
- Dual-controller configuration:
Fsched supports running
slurmctld, the control daemon, in a dual-controller mode. The cluster can be configured with one primaryslurmctldand one backupslurmctld, usually running on different physical nodes. - Automatic failover:
When the primary
slurmctldfails or becomes unavailable, the backupslurmctldautomatically takes over and continues managing the job queue and cluster resources. This automatic failover mechanism ensures continuous cluster availability. - Data synchronization:
The primary
slurmctldperiodically synchronizes current cluster state information to the backupslurmctld, ensuring that the backup node has the latest state information during failover.
2. Job fault tolerance
Fsched is fault-tolerant and can continue operating in the event of node failures or network interruptions. By monitoring node health, Fsched can automatically detect failures and reschedule affected jobs, minimizing downtime. In extreme cases where the head node becomes completely unavailable, Fsched is designed so that jobs on unaffected compute nodes can continue running for up to 8 hours, maximizing business continuity in HPC environments.
3. slurmdbd high availability
- Dual-node configuration:
Like
slurmctld,slurmdbd, the SLURM database daemon, can also be configured in a primary-backup mode to ensure cluster job accounting data is not lost if the primary daemon fails. - Database backup: Using a high-availability database management system, such as MySQL or MariaDB with primary-replica replication, can ensure that SLURM accounting data remains persistently available and avoid single points of failure.
4. Use of shared storage
- Shared file system:
Fsched relies on a shared file system, such as NFS or Lustre, to store configuration files, state files, job scripts, and similar data. Using a shared file system ensures that when
slurmctldfails over to a backup node, all nodes can access the same files and data. - Cluster logs and state files: By storing logs and state files in shared storage, Fsched daemons can maintain consistency and availability across nodes.
User Authentication
Within an Fsched cluster, the best practice is to use an external authentication system, such as LDAP or NIS, to provide unified identity authentication for all nodes in the cluster. Fsched clusters use standard Linux users and groups.