Monitoring Service
- To ensure responsiveness, line charts display data for only the top 30 nodes.
- File system monitoring and base node monitoring are available only when Hybrid Cloud is enabled in FCP-Suite.
Charts in Monitoring Service can be dragged and zoomed in/out for easier inspection. Refreshing the page resets charts to their initial state.
Cluster Monitoring
Cluster monitoring includes multiple views: cluster overview, compute partition monitoring, node list monitoring, node monitoring, GPU monitoring, service status monitoring, and scheduler monitoring.
Cluster overview
Real-time metrics at the cluster level include: compute node count, compute partition count, total CPU cores, CPU utilization, and average wait time for queued jobs.
Charts: adjust the time range at the top-right to view monitoring data for the desired period.
- Compute node CPU utilization (line chart)
- Cluster job status distribution (pie chart): counts of Queued/Running/Completed jobs from the Fsched scheduler in-memory statistics
- Running CPU cores (line chart)
- Queued CPU cores (line chart)
- Average wait time for queued jobs (line chart)
- Cluster job status counts (stacked chart): counts of Queued/Running/Completed jobs from Fsched in-memory statistics
- Compute node count (line chart)
Compute partition monitoring
Real-time metrics at the partition level include: average wait time for queued jobs, node count, CPU cores, total scheduler CPU, free CPU, running CPU, queued CPU, CPU utilization, and memory utilization.
Charts: adjust the time range at the top-right to view monitoring data for the desired period.
- Partition CPU utilization (line chart)
- Partition running CPU utilization percent (line chart)
- Partition memory utilization (line chart)
- Partition running CPU cores (line chart)
- Partition CPU cores (line chart)
- Partition average wait time for queued jobs (line chart)
- Partition total memory and allocated memory (line chart)
- Partition allocated memory percent (line chart)
- Partition queued job count (line chart)
- Partition running job count (line chart)
- Partition compute node count (line chart)
Node list
Real-time fields include: node name, node ID, cluster ID, partition, uptime, CPU count, total memory, root partition, CPU utilization, memory utilization, root partition utilization, swap utilization, scheduler node status, session count, session user count, running job count, total scheduler CPU, free CPU, and running CPU.
Node monitoring
Real-time metrics at the node level include: uptime, CPU count, CPU iowait, total memory, total file descriptors, total CPU utilization, memory utilization, and swap utilization.
Charts: adjust the time range at the top-right to view monitoring data for the desired period.
- CPU utilization (line chart)
- Swap (line chart)
- Memory (line chart)
- 5-minute network traffic (stacked chart)
- System load average (line chart)
- Disk read/write bytes per second (line chart)
- Network bandwidth per second (line chart)
- Disk IOPS (line chart)
- Open file descriptors (left) / context switches per second (right) (line + scatter)
- Disk utilization (line chart)
- Network socket connections (line chart)
- I/O time breakdown within 1 second (line chart)
- Per-I/O latency (reference:
< 100 ms) (beta) (line chart)
GPU monitoring
Real-time metrics include: GPU count, warnings, GPU utilization, and GPU memory utilization.
Charts: adjust the time range at the top-right to view monitoring data for the desired period.
- GPU utilization (detail) (line chart)
- GPU memory utilization (detail) (line chart)
- GPU frequency (line chart)
- Power (line chart)
- Memory frequency (line chart)
- GPU temperature (line chart)
- Memory temperature (line chart)
- Memory used (frame buffer) (line chart)
- Memory free (frame buffer) (line chart)
Note: CentOS 6.x does not support GPU monitoring.
Service monitoring
Service status monitoring for each node in the cluster.
Scheduler monitoring
Shows node states at the scheduler level for Fsched clusters.
- Fully allocated:
alloc(blue) - Partially allocated:
mix(light blue) - Idle:
idle(green) - Unavailable:
drain+resv+maint+completing(the first three states are marked unavailable by administrators) (gray) - Fault:
down+fail+error(red)
Desktop Monitoring
Node monitoring
Shows hardware resource information (CPU, memory, storage, etc.) for the selected desktop and node.
GPU monitoring
When the node has GPU devices, shows GPU-related metrics.
Note: CentOS 6.x does not support GPU monitoring.
Service monitoring
Shows runtime status of desktop-related services for the selected desktop and node.
File System Monitoring
Node monitoring
Shows hardware resource monitoring for the file system, including CPU, memory, storage, and more.
Service monitoring
Shows runtime status of file-system-related services.
Performance monitoring
Shows file system performance metrics, including IOPS, throughput, latency, and capacity information (available and total).
Management Node Monitoring
Node monitoring
Shows hardware resource monitoring for management nodes, including CPU, memory, storage, and more.
Service monitoring
Shows runtime status of system services on the selected management node.
Base Node Monitoring
Node monitoring
Shows hardware resource monitoring for base nodes on the platform, including CPU, memory, storage, and more.
FAQ
- Removing or releasing all compute nodes in a cluster is an invalid operation. In this state, the monitoring system cannot collect valid node metrics and abnormal data is not meaningful.
- If a chart line color is too light to distinguish, click the color block in the legend to switch it to a more vivid color.