Skip to main content

Inside Fsched

Fsched is derived from the open-source Slurm scheduler. Based on Slurm 19.05, Fsched adds many improvements in functionality, performance, and stability. The daemons in an Fsched cluster include:

  • slurmctld: the Fsched controller daemon, running on the head node. All scheduler management functions are provided by slurmctld. It is responsible for job queue management, scheduling, and monitoring the overall cluster state. It interacts with other daemons to allocate resources, start jobs, monitor job execution, and manage job dependencies.
  • slurmd: the Fsched compute node daemon, running on all compute nodes. It manages local resource allocation, job execution, and communication with slurmctld. It ensures jobs run correctly on compute nodes, reports job status to slurmctld, and applies resource limits according to policy.
  • slurmdbd: the Fsched database daemon, responsible for maintaining job accounting records in the database. It stores information about job submission, completion, failure, and resource usage. It supports queries and reporting, allowing users and administrators to track cluster usage and job history. In FCP or FCC-E environments, slurmdbd and its corresponding database service run on the FCP/FCC-E management node.
  • munged: the authentication service used to create and validate credentials. It runs on all nodes in the Fsched cluster, including head nodes, login nodes, and compute nodes. It ensures communication between SLURM daemons and nodes is secure and trusted.
  • statesvc: the Fsched state service, used to provide Fsched state and job information to the monitoring and analysis modules of FCP/FCC-E
  • Registration and heartbeat: when slurmd starts on a compute node, it sends registration information to slurmctld, announcing its presence and providing the node's resource status. After that, slurmctld periodically sends heartbeat signals to each slurmd to confirm its state and availability.
  • Request and response: when a user submits a job, slurmctld receives the request and schedules it based on the current resource state. After scheduling is decided, slurmctld sends the job information to the corresponding slurmd, which is responsible for executing the job on the local node.
  • Monitoring and reporting: slurmd periodically reports node resource usage and job execution status to slurmctld, ensuring that the main control process has real-time visibility into the overall cluster state.

High Availability Design

Fsched's High Availability (HA) mechanism is designed to ensure that the cluster can continue operating when critical components fail, minimizing downtime and potential waste of computing resources. Fsched implements high availability in the following ways:

1. slurmctld failover

  • Dual-controller configuration: Fsched supports running slurmctld, the control daemon, in a dual-controller mode. The cluster can be configured with one primary slurmctld and one backup slurmctld, usually running on different physical nodes.
  • Automatic failover: When the primary slurmctld fails or becomes unavailable, the backup slurmctld automatically takes over and continues managing the job queue and cluster resources. This automatic failover mechanism ensures continuous cluster availability.
  • Data synchronization: The primary slurmctld periodically synchronizes current cluster state information to the backup slurmctld, ensuring that the backup node has the latest state information during failover.

2. Job fault tolerance

Fsched is fault-tolerant and can continue operating in the event of node failures or network interruptions. By monitoring node health, Fsched can automatically detect failures and reschedule affected jobs, minimizing downtime. In extreme cases where the head node becomes completely unavailable, Fsched is designed so that jobs on unaffected compute nodes can continue running for up to 8 hours, maximizing business continuity in HPC environments.

3. slurmdbd high availability

  • Dual-node configuration: Like slurmctld, slurmdbd, the SLURM database daemon, can also be configured in a primary-backup mode to ensure cluster job accounting data is not lost if the primary daemon fails.
  • Database backup: Using a high-availability database management system, such as MySQL or MariaDB with primary-replica replication, can ensure that SLURM accounting data remains persistently available and avoid single points of failure.

4. Use of shared storage

  • Shared file system: Fsched relies on a shared file system, such as NFS or Lustre, to store configuration files, state files, job scripts, and similar data. Using a shared file system ensures that when slurmctld fails over to a backup node, all nodes can access the same files and data.
  • Cluster logs and state files: By storing logs and state files in shared storage, Fsched daemons can maintain consistency and availability across nodes.

User Authentication

Within an Fsched cluster, the best practice is to use an external authentication system, such as LDAP or NIS, to provide unified identity authentication for all nodes in the cluster. Fsched clusters use standard Linux users and groups.