Impact of Restarting or Shutting Down the Platform and Related Nodes
The overall FCP platform consists of three major parts:
- Platform management nodes
- Core node
- Monitor node (optional)
- Cluster nodes
- Head node
- Compute node
- Login node
- Desktop node
- External supporting service nodes
- Authentication service (optional)
- NTP service
- Storage service
If the nodes above are shut down, the impact is as follows:
| Node Type | In-Cluster (Fsched) Jobs | Task Mode | Cluster Management | Cluster Monitoring | User Management | Data Access | Remote Access |
|---|---|---|---|---|---|---|---|
| Management node | Long downtime may make task accounting information inaccurate; short downtime has no effect | Task submission is unavailable | Cluster management is unavailable | Cluster monitoring is unavailable | User management is unavailable | Data access is unavailable | Remote access is unavailable |
| Monitor node | None | None | None | Cluster monitoring is unavailable | None | None | None |
| Head node | New jobs cannot be submitted. Running jobs continue until completion, but resources cannot be released afterward | Tasks fail | Cluster management is unavailable | Some monitoring information cannot be collected | None | None | None |
| Compute node | Jobs running on the node fail | Tasks running on the node fail | Cluster management is unavailable | Monitoring information for that node cannot be collected | None | None | None |
| Login node | Interactive jobs running on the node fail | None | Cluster management is unavailable | Monitoring information for that node cannot be collected | None | None | None |
| Desktop node | Jobs running on the node fail | None | Cluster management is unavailable | Monitoring information for that node cannot be collected | None | None | None |
| Authentication service | Long downtime (> 1 minute) prevents task submission because submitter identity cannot be verified; short downtime has no effect | Long downtime (> 1 minute) prevents task submission because submitter identity cannot be verified | Users cannot log in | None | User management is unavailable | Authentication cannot be verified | Authentication cannot be verified |
| NTP service | Long failures cause time drift, which breaks node-to-node validation and prevents jobs from running; short failures have no effect | Long failures cause time drift, which breaks node-to-node validation and prevents jobs from running | None | None | None | None | None |
| Storage service | Task execution may fail, depending on the application | Task submission is unavailable | Cluster management is unavailable and management operations may block | None | None | Data access is unavailable | If the user home directory is on shared storage, users cannot log in |