Manage Compute Jobs and Resources
squeue View queued or running jobs
Use the squeue command to get the current job status. If the job you want is not shown in the squeue output, it means the job has already exited.
In the output, the ST column is the job status. The status codes mean:
- R: Running
- PD: Pending
- CG: Completing
- S: Suspended
Common squeue command and option combinations are as follows:
| Function | Command Example |
|---|---|
| Show the status of all jobs in the queue | squeue |
| View job information for job ID 11 | squeue -j 11 |
| View job information for user user1 | squeue -u user1 |
| View jobs submitted to partition01 | squeue -p partition01 |
| View jobs using node compute01 | squeue -w compute01 |
| View jobs in Pending state | squeue --state=PENDING |
| View detailed info for a job with custom output | squeue -j 11 -o "%.18i %.9P %.8j %.8u %.2t %20V %.10M %.6D %R %Z" |
| View detailed info for a partition with custom output | squeue -p partition01 -o "%.18i %.9P %.8j %.8u %.2t %20V %.10M %.6D %R %Z" |
Other options can be viewed with squeue --help.
sinfo View partition information
The main function of sinfo is to view status information for partitions and nodes. Common command and option combinations are as follows:
| Function | Command Example |
|---|---|
| Show status of all partitions in the cluster | sinfo -Nl |
| Show usage of a specified partition | sinfo -p partition01 |
| Show detailed usage of a specified partition | sinfo -p partition01 -N -o "%20N %15C %.5a %.6t" |
Node status meanings in sinfo output:
- alloc: Node is allocated
- drain: Node is drained/unresponsive; no new jobs will be assigned in this state
- idle: Node is idle
- mix: Node has partial resources allocated
- comp: Node is releasing resources; nodes in other states are unavailable
Example
[root@login1 ~]# sinfo -N -o "%20N %15C %.5a %.6t"
NODELIST CPUS(A/I/O/T) AVAIL STATE
compute01 0/4/0/4 up idle
compute02 0/4/0/4 up idle
compute03 0/4/0/4 up idle
In the second column CPUS(A/I/O/T), A = CPUs used by jobs, I = idle CPUs, T = total CPUs on the node.
Common sinfo options
--help # Show help for the sinfo command;
-d # Show non-responsive nodes in the cluster;
-i <seconds> # Refresh partition/node output every N seconds
-n <name_list> # Show specified node(s); separate multiple nodes with commas;
-N # Display one line per node;
-p # <partition> Show specified partition(s); separate multiple partitions with commas;
-r # Show only responsive nodes;
-R # Show reasons for node issues;
Output in a specified format;
-o #<output_format> Show specified output. The format is %[[.]size]type. "." means right alignment; omitted means left alignment. size is the field width; type is the item to display. Common items include:
%a Availability state
%A Show node counts as "allocated/idle"; do not use with "%t" or "%T"
%c Number of cores per node
%C Total cores as "allocated/idle/other/total"
%D Total number of nodes
%E Reason a node is unavailable
%m Memory per node (in M)
%N Node name
%O CPU load
%P Partition name; the default partition is marked with "*"
%r Only root can submit jobs (yes/no)
%R Partition name
%t Node state (compact form)
%T Node state (extended form)
scancel Cancel running or queued jobs and view job status
The scancel command can cancel running or pending jobs in the queue.
Common commands and parameter examples:
| Function | Command Example |
|---|---|
| Cancel job ID 11 | scancel 11 |
| Cancel job named test-001 | scancel -n test-001 |
| Cancel jobs submitted to partition01 | scancel -p partition01 |
| Cancel pending jobs | scancel -t PENDING |
| Cancel jobs running on node compute01 | scancel -w -n compute01 -t RUNNING |
Other parameter options can be viewed with scancel --help.
Common scancel options:
--help # Show help for the scancel command;
-A <account> # Cancel jobs for the specified account; if no job_id is specified, cancel all;
-n <job_name> # Cancel jobs with the specified job name;
-p <partition_name> # Cancel jobs in the specified partition;
-q <qos> # Cancel jobs with the specified qos;
-t <job_state_name> # Cancel jobs in the specified state, "PENDING", "RUNNING" or "SUSPENDED";
-u <user_name> # Cancel jobs for the specified user;
sacct View historical job information
The sacct command can view historical job start/end time, end status, job ID, job name, number of nodes used, node list, runtime, and more.
Example
View runtime information for a job:
sacct -j 29
The output includes: job ID, job name, partition, billing account, requested CPU count, status, and exit code.
[root@head ~]# sacct -j 9
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
9 sleep partition+ _fsched_a+ 1 COMPLETED 0:0
9.extern extern _fsched_a+ 1 COMPLETED 0:0
9.0 sleep _fsched_a+ 1 COMPLETED 0:0
You can add output parameters to view detailed job information, for example:
[root@head ~]# sacct -j 9 -X -o jobid,jobname%50,user,group,partition,submit,start,end,state,alloccpus,reqmem,elapsed,exitcode,workdir%300
JobID JobName User Group Partition Submit Start End State AllocCPUS ReqMem Elapsed ExitCode WorkDir
------------ -------------------------------------------------- --------- --------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ---------- ---------- -------- -----------------------------------------------------
9 sleep cyan cyan partition+ 2024-09-03T00:25:02 2024-09-03T00:25:02 2024-09-03T00:26:02 COMPLETED 1 1Mn 00:01:00 0:0 /fastone/users/cyan
View job history and runtime information since a specific time:
[root@head ~]# sacct -X -T -S2024-08-10-11:00:00 -o jobid,jobname,user,partition,submit,start,end,state,alloccpus,reqmem,elapsed,exitcode,workdir
JobID JobName User Partition Submit Start End State AllocCPUS ReqMem Elapsed ExitCode WorkDir
------------ ---------- --------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ---------- ---------- -------- --------------------
2 hostname shaobing+ partition+ 2024-07-31T23:13:41 2024-07-31T23:13:41 2024-07-31T23:13:42 COMPLETED 1 1Mn 00:00:01 0:0 /fastone/users/shao+
3 big_task1 cadservi+ partition+ 2024-08-01T06:47:18 2024-08-01T06:47:19 2024-08-01T06:49:29 CANCELLED+ 6 1Mn 00:02:10 0:0 /fastone/users/cads+
4 big_task1 cadservi+ partition+ 2024-08-01T06:53:05 2024-08-01T06:53:05 2024-08-01T06:53:45 CANCELLED+ 6 1Mn 00:00:40 0:0 /fastone/users/cads+
5 big_task1 cadservi+ partition+ 2024-08-01T22:04:58 2024-08-01T22:04:59 2024-08-01T22:06:09 CANCELLED+ 6 1Mn 00:01:10 0:0 /fastone/users/cads+
6 big_task1 cadservi+ partition+ 2024-08-01T22:08:04 2024-08-01T22:08:05 2024-08-01T22:08:19 CANCELLED+ 2 1Mn 00:00:14 0:0 /fastone/users/cads+
7 big_task1 cadservi+ partition+ 2024-08-01T22:11:50 2024-08-01T22:11:51 2024-08-01T22:11:51 FAILED 2 1Mn 00:00:00 127:0 /fastone/users/cads+
8 Fano-slot shaobing+ partition+ 2024-08-01T22:50:31 2024-08-01T22:50:32 2024-08-01T22:57:46 FAILED 4 1Mn 00:07:14 127:0 /fastone/users/shao+
9 sleep cyan partition+ 2024-09-03T00:25:02 2024-09-03T00:25:02 2024-09-03T00:26:02 COMPLETED 1 1Mn 00:01:00 0:0 /fastone/users/cyan
For more output fields, see sacct -help.
# For sacct output, -o can include the following fields:
--format=jobid,jobname,partition,maxvmsize,maxvmsizenode,
maxvmsizetask,avevmsize,maxrss,maxrssnode,
maxrsstask,averss,maxpages,maxpagesnode,
maxpagestask,avepages,mincpu,mincpunode,
mincputask,avecpu,ntasks,alloccpus,elapsed,
state,exitcode,avecpufreq,reqcpufreqmin,
reqcpufreqmax,reqcpufreqgov,consumedenergy,
maxdiskread,maxdiskreadnode,maxdiskreadtask,
avediskread,maxdiskwrite,maxdiskwritenode,
maxdiskwritetask,avediskread,allocgres,reqgres
# If output is truncated, add "%field_length" after a format item to show more, for example "workdir%300"
scontrol View Fsched configuration and status
scontrol is used to view or modify Fsched configuration, including jobs, job steps, nodes, partitions, reservations, and overall system configuration. Regular users can use scontrol to query and display many Fsched status details, while most modification commands can only be executed by the root user or administrators.
Common command and parameter examples for regular users:
| Function | Command Example |
|---|---|
| View details of job ID 9 | scontrol show job 9 |
| View details of all running, queued, and just completed jobs | scontrol show job |
| View details of node compute03 | scontrol show node compute03 |
| View details of all nodes | scontrol show node |
| View details of all partitions | scontrol show partition |
| View Fsched configuration information | scontrol show config |
Other options can be viewed with scontrol --help.