Skip to main content
Version: FCP 25.11

FCP-Suite Monitoring Architecture

Monitoring Architecture

image-20240514131210874

Data Flow

  • fs-states-svc: periodically caches jobs from Slurm according to defined rules.
  • urd: synchronizes jobs from the fs-states-svc service on the head node of each cluster and stores them in the database.
  • prometheus:
    • periodically scrapes data from urd
    • periodically scrapes data from slurm-export
  • grafana: connects to prometheus and the urd database to display data.

fs-states-svc

It queries the Slurm API, which is equivalent to:

  • squeue, scontrol
  • sacct

Urd

Tables

  • clusters: cluster table, synchronized from odin
  • head_nodes: head node table, synchronized from odin
  • jobs: job table, synchronized from fs-states-svc
  • user_usages: usage table for clusters, partitions, and users. This stores periodically aggregated data calculated from jobs.
    • usage: actual CPU usage
    • req_usage: requested CPU usage
    • Daily growth may reach max(partitions) * max(users), so monthly partitioning is planned.

user_usages

Data Tables

time_records    time-slice table
start_time slice start time
end_time slice end time
state slice status

img_16.png

user_usages     user usage table
cluster cluster
partition partition
username username
usage usage
req_usage requested usage
start_time start time
end_time end time

img_17.png

Implementation Details

Periodically Generate Time Slices
urd periodically generates time-slice records in the time_records table to record the time range and status to be aggregated.
start_time indicates the slice start time, end_time indicates the slice end time, and state indicates the slice status.
PENDING means aggregation has not started, and COMPLETED means aggregation is finished.

In the current production environment, the generated time-slice interval is one day.
In the test environment, setting the environment variable TIME_SPLIT_BY_FIVE_MINUTES to true changes the interval to 5 minutes.
Periodically Aggregate User Usage Based on Time Slices
urd periodically queries PENDING time ranges in the time_records table that are earlier than the current day,
and calculates user usage for completed jobs within that time range.

Because job collection is delayed, aggregating usage immediately at the end of the day may miss jobs completed in the last few minutes.
Therefore, user usage is aggregated after a delay. For example, yesterday's user usage is aggregated at 2:00 AM.
In the test environment, setting the environment variable TIME_SPLIT_BY_FIVE_MINUTES to true changes the delay to 30 minutes.

The jobs table is very large. Assuming an average of 5 million records per day retained for 90 days, the total reaches 450 million records.
Querying all records at once would put too much load on the database and may cause insufficient memory, affecting normal use.
Because the jobs table is partitioned by job submission time, each query is limited to a single partition table and then executed multiple times to improve query efficiency.
Tests on a table with 5 million records show that a single query takes about 20 seconds, and querying 90 days of data takes about 30 minutes.
SELECT
username, cluster, queue as partition,
SUM(cur_cpu)::FLOAT as usage,
SUM(exec_dur * req_cpu)::FLOAT as req_usage
from jobs
where status = ?
-- Limit the query range to a single partition table
and submit_at > ? and submit_at < ?
and end_at > ? and end_at < ?
group by username, cluster, partition

jobs

  • Stored in Table Partition format
    • A partition is created per day
    • By default, data is retained for 90 days, and data older than 90 days is detached from the parent table
    • A detached table becomes an independent discrete table, for example jobs_20231231
  • Fields
    • job_id: job ID
    • job_name: job name
    • cluster: cluster name, for example fastone-1
    • uid: UID of the user who submitted the job
    • username: username of the user who submitted the job
    • group_id: group ID of the user who submitted the job
    • priority: job priority
    • account: job account
    • qos: QOS used
    • dependency: job dependency
    • queue: job queue or partition
    • status: job status
    • command: submitted command, including command and parameters
    • submit_host: submission host
    • exec_host: execution host, comma-separated when multiple hosts are used
    • cwd: current working directory
    • submit_at: job submission time
    • start_at: job start time
    • end_at: job end time
    • wait_dur: job wait duration in seconds
    • exec_dur: job execution duration in seconds
    • exit_code: exit code
    • exit_sig: exit signal
    • reason: reason for pending, suspended, or exit
    • wckey: wckey
    • req_mem: requested memory in bytes
    • req_node: requested number of nodes
    • req_cpu: requested number of CPUs
    • run_time: elapsed runtime in seconds
    • time_limit: maximum runtime in seconds
    • eligible_time: eligible time
    • accrue_time: time when priority accumulation starts
    • deadline: configured deadline
    • req_nodes: requested node list
    • exc_nodes: excluded node list
    • max_cpus: maximum available CPUs for the job
    • max_nodes: maximum available nodes for the job
    • num_tasks: requested task count
    • tres_req_str: requested resources, such as cpu=1,mem=1G,node=1
    • tres_alloc_str: allocated resources, such as cpu=1,mem=1G,node=1
    • pn_min_cpus: minimum CPU count per node
    • pn_min_momory: minimum memory per node in bytes
    • liceses: requested licenses
    • std_err: stderr file path
    • std_in: stdin file path
    • std_out: stdout file path
    • alloc_cpu: allocated CPU count
    • cur_node: current number of nodes used by the job
    • peak_mem_per_task: peak memory per task during execution in bytes
    • avg_mem: average memory during execution in bytes
    • cur_cpu: CPU time used by the job in seconds

Data Sources for UI-Grafana

Cluster Monitoring

Cluster Overview

Number of Jobs by Cluster Job Status

img_12.png

sum(slurm_partition_jobs_pending{clusterId='$clusterId'})
sum(slurm_partition_jobs_running{clusterId='$clusterId'})
sum(slurm_partition_jobs_completed{clusterId='$clusterId'})

prometheus-slurm-exporter periodically queries cluster job states and counts them. These are instantaneous values.

Partition List: Average Wait Time of Pending Jobs

img_13.png

label_replace(avg by(partition)(slurm_partition_jobs_pending_wait_time{clusterId=~"$clusterId",partition=~"$partitionName",nodeId=~"$nodeId"}),"partitionName","$1","partition","(.*)")

prometheus-slurm-exporter periodically queries pending jobs in the cluster and counts their wait time. These are instantaneous values.

Number of Pending Jobs per Partition and Number of Running Jobs per Partition

img_14.png

sum by (partition)(slurm_partition_jobs_pending{clusterId="$clusterId"})
sum by (partition)(slurm_partition_jobs_running{clusterId="$clusterId"})

prometheus-slurm-exporter periodically queries cluster job states and counts them. These are instantaneous values.

Cluster Analytics

Execution Time of Completed Jobs in the Last 10 Minutes

img.png

SELECT
queue as "Partition",
DATE_TRUNC('second', AVG(end_at - start_at)) as "Average Execution Time (HH:mm:ss)",
DATE_TRUNC('second', AVG(start_at - submit_at)) as "Average Wait Time (HH:mm:ss)"
from
jobs
where submit_at > (CURRENT_TIMESTAMP - INTERVAL '30 days')
and status = 'COMPLETED'
and cluster = 'fastone-$ClusterId'
and end_at > (CURRENT_TIMESTAMP - INTERVAL '600 seconds')
GROUP BY queue
;

Pending and Running Jobs

img_1.png

Pending jobs

select
job_id as "Job ID",
job_name as "Job Name",
username as "User",
queue as "Partition",
req_cpu as "Requested CPU",
req_mem as "Requested Memory",
DATE_TRUNC('second', CURRENT_TIMESTAMP - submit_at) as "Wait Time (HH:mm:ss)",
submit_at as "Submission Time"
FROM
jobs
where
$__timeFilter(submit_at)
and status = 'PENDING'
and cluster = 'fastone-$ClusterId'
and ('$JobId' = '' or job_id = '$JobId')
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
order by job_id::int desc
limit 20000

Running jobs

select
job_id as "Job ID",
job_name as "Job Name",
username as "User",
exec_host as "Execution Host",
queue as "Partition",
req_cpu as "Requested CPU",
req_mem as "Requested Memory",
start_at - submit_at as "Wait Time (HH:mm:ss)",
submit_at as "Submission Time",
start_at as "Start Time",
DATE_TRUNC('second', CURRENT_TIMESTAMP - start_at) as "Elapsed Runtime (HH:mm:ss)"
FROM
jobs
where
$__timeFilter(submit_at)
and status = 'RUNNING'
and cluster = 'fastone-$ClusterId'
and ('$JobId' = '' or job_id = '$JobId')
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
order by job_id::int desc
limit 20000

Completed Jobs

img_2.png

select
job_id as "Job ID",
job_name as "Job Name",
username as "User",
exec_host as "Execution Host",
queue as "Partition",
req_cpu as "Requested CPU",
req_mem as "Requested Memory",
peak_mem_per_task as "Peak Memory",
cur_cpu as "CPU Time Used",
exec_dur as "Execution Duration",
cur_cpu::float / NULLIF(exec_dur * req_cpu, 0) as "CPU Used / Requested",
submit_at as "Submission Time",
start_at as "Start Time",
end_at as "End Time",
status as "Status"
from jobs
where
$__timeFilter(submit_at)
and cluster = 'fastone-$ClusterId'
and status not in ('PENDING', 'RUNNING', 'SUSPENDED')
AND ('$JobId' = '' OR job_id = '$JobId')
and ('$JobName' = '' or job_name = '$JobName')
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
and ('$Partition' = '' or queue = '$Partition')
and ('$ExecHost' = '' or exec_host = '$ExecHost')
order by job_id::int desc
limit 10000;
select
job_id as "Job ID",
job_name as "Job Name",
username as "User",
exec_host as "Execution Host",
queue as "Partition",
req_cpu as "Requested CPU",
req_mem as "Requested Memory",
peak_mem_per_task as "Peak Memory",
cur_cpu as "CPU Time Used",
exec_dur as "Execution Duration",
cur_cpu::float / NULLIF(exec_dur * req_cpu, 0) as "CPU Used / Requested",
submit_at as "Submission Time",
start_at as "Start Time",
end_at as "End Time",
status as "Status"
from jobs
where
$__timeFilter(submit_at)
and cluster = 'fastone-$ClusterId'
and status not in ('PENDING', 'RUNNING', 'SUSPENDED')
AND ('$JobId' = '' OR job_id = '$JobId')
and ('$JobName' = '' or job_name = '$JobName')
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
and ('$Partition' = '' or queue = '$Partition')
and ('$ExecHost' = '' or exec_host = '$ExecHost')
order by job_id::int desc
limit $LimitNum;

User Job Status Query

img_3.png

sum by (username) (urd_job_state{cluster='$ClusterId', state='PENDING'})
sum by (username) (urd_job_state{cluster='$ClusterId', state='RUNNING'})
sum by (username) (urd_job_state{cluster='$ClusterId', state='COMPLETED'})
sum by (username) (urd_job_state{cluster='$ClusterId', state='CANCELLED'})
sum by (username) (urd_job_state{cluster='$ClusterId', state='FAILED'})

urd periodically queries jobs in each cluster and counts their states. These are instantaneous values.

Job List

img_4.png

select
job_id as "Job ID",
job_name as "Job Name",
username as "User",
status as "Status",
exec_host as "Execution Host",
queue as "Partition",
req_node as "Requested Nodes",
req_mem as "Requested Memory",
req_cpu as "Requested CPU",
wckey as "Tag (wcKey)",
cur_cpu as "CPU Time Used",
exec_dur as "Execution Duration",
cur_cpu::float / NULLIF(exec_dur * req_cpu, 0) as "CPU Used / Requested",
wait_dur as "Wait Duration",
submit_at as "Submission Time",
start_at as "Start Time",
end_at as "End Time",
exit_code as "Exit Code",
exit_sig as "Exit Signal"
from jobs
where
$__timeFilter(submit_at)
and cluster = 'fastone-$ClusterId'
AND ('$JobId' = '' OR job_id = '$JobId')
and ('$JobName' = '' or job_name = '$JobName')
and (('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username')) and ('$UsernameQuery' = '' or username = '$UsernameQuery'))
and ('$Partition' = '' or queue = '$Partition')
and ('$ExecHost' = '' or exec_host = '$ExecHost')
and ('$StateEn' = '' or status = '$StateEn')
and ('$SubmitAt' = '' OR submit_at < COALESCE(NULLIF('$SubmitAt', '')::timestamp, '1970-01-01 00:00:00'))
and ('$StartAt' = '' OR start_at < COALESCE(NULLIF('$StartAt', '')::timestamp, '1970-01-01 00:00:00'))
and ('$EndAt' = '' OR end_at < COALESCE(NULLIF('$EndAt', '')::timestamp, '1970-01-01 00:00:00'))
order by job_id::int desc
limit 10000;

Jobs with Unreasonable Memory Requests

img_5.png

SELECT
job_id as "Job ID",
username as "User",
job_name as "Job Name",
req_mem as "Requested Memory",
peak_mem_per_task as "Peak Memory",
peak_mem_per_task::float / req_mem * 100 as "Peak Memory / Requested Memory"
FROM
jobs
where
$__timeFilter(submit_at)
and cluster = 'fastone-$ClusterId'
and status = 'COMPLETED'
and peak_mem_per_task::float / req_mem * 100 > CAST(SUBSTRING('$MemDiffPercent' FROM 1 FOR LENGTH('$MemDiffPercent') - 1) AS NUMERIC)
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
order by "Peak Memory / Requested Memory" desc, job_id::int desc
limit 10000
SELECT
job_id as "Job ID",
username as "User",
job_name as "Job Name",
req_mem as "Requested Memory",
peak_mem_per_task as "Peak Memory",
peak_mem_per_task - req_mem as "Peak Memory Minus Requested Memory"
FROM
jobs
where
$__timeFilter(submit_at)
and cluster = 'fastone-$ClusterId'
and status = 'COMPLETED'
and peak_mem_per_task - req_mem >= CAST('$MemDiff' as FLOAT) * 1024 * 1024 * 1024
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
order by "Peak Memory Minus Requested Memory" desc, job_id::int desc
limit 10000

Jobs with Unreasonable CPU Requests

img_6.png

select
job_id as "Job ID",
username as "User",
job_name as "Job Name",
req_cpu as "Requested CPU Cores",
exec_dur * req_cpu as "Total Requested CPU Time",
cur_cpu as "CPU Time Used",
cur_cpu / NULLIF(exec_dur * req_cpu::float, 0) as "CPU Used / Requested"
from
jobs
where
$__timeFilter(submit_at)
and status = 'COMPLETED'
and cluster = 'fastone-$ClusterId'
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
order by "CPU Used / Requested" desc NULLS LAST, job_id desc
limit 10000;

User Usage Statistics

img_7.png

SELECT
username as "User",
sum(usage) as "CPU Time Used",
sum(req_usage) as "Requested CPU Time",
sum(usage) / NULLIF(sum(req_usage), 0) as "CPU Used / Requested"
from user_usages
where
$__timeFilter(end_time)
and cluster = 'fastone-$ClusterId'
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
group by username
order by "CPU Time Used" desc

Jobs That Exited Abnormally

img_8.png

select
job_id as "Job ID",
job_name as "Job Name",
username as "User",
exec_host as "Execution Host",
queue as "Partition",
req_cpu as "Requested CPU",
req_mem as "Requested Memory",
exit_code as "Exit Status Code",
exit_sig as "Exit Signal",
status as "Status",
submit_at as "Submission Time"
from jobs
where
$__timeFilter(submit_at)
and (exit_code != 0 or exit_sig != 0)
and cluster = 'fastone-$ClusterId'
and ('$IsAdmin' = 'true' or ('$Username' = '' or username = '$Username'))
order by job_id::int desc
limit 10000;

Average Wait Time of Pending Jobs in the Cluster

img_9.png

slurm_partition_jobs_pending_wait_time_total{csClusterId='$ClusterId'}

prometheus-slurm-exporter periodically queries pending jobs in the cluster and counts wait time. These are instantaneous values.

Average Wait Time of Pending Jobs by Partition

img_10.png

avg by (partition)(slurm_partition_jobs_pending_wait_time{csClusterId='$ClusterId'})

prometheus-slurm-exporter periodically queries pending jobs in the cluster and counts wait time. These are instantaneous values.

Number of Completed Jobs by Partition

img_11.png

sum by (partition) (urd_job_state{cluster="$ClusterId", state='COMPLETED'})

urd periodically queries jobs in each cluster and counts their states. These are instantaneous values.

Operations Overview

User Dimension

User Usage in Cluster Mode

img_15.png

SELECT
username,
sum(usage) as usage
from user_usages
where end_time > ? and end_time < ? and (? = '' or cluster = ?)
group by username
order by usage desc