Platform Shutdown and Startup Procedure
Introduction
For various reasons, you may need to shut down or restart the entire platform. This document provides detailed shutdown and restart steps to help you perform these operations safely.
If you are not shutting down the entire platform, see Impact of Restarting or Shutting Down the Platform and Related Nodes.
Cluster Shutdown Procedure
Preparation
Before shutting down the cluster, make sure you complete the following preparation steps:
- Confirm task status: Make sure all running tasks have completed or been canceled to avoid data loss or state inconsistency.
- Finish or cancel UI-submitted tasks: Confirm that all tasks submitted through the UI have been completed or canceled.
- Check cluster changes: Make sure no cluster change operation is currently in progress, to avoid inconsistent state.
- Block new tasks: Make sure users stop creating new tasks or making cluster changes during shutdown.
Shut down the cluster
Because there are dependencies between platform nodes, shut down the cluster in the following order:
- Shut down compute nodes and submit nodes: Shut these down first to avoid dependency issues in later steps.
- Shut down head nodes: Then shut down the cluster head nodes to ensure the control nodes are shut down safely.
- Shut down the monitor node, if present: If the platform includes a Monitor node, shut it down after the head nodes.
- Shut down management nodes: Finally, shut down the platform management nodes to complete the cluster shutdown.
Final confirmation
- Check node status: Confirm that all nodes have shut down safely.
- Record the process: Record the shutdown process and node states for later inspection and recovery.
Cluster Startup Procedure
Preparation
Before restarting the platform, complete the following preparation:
- Confirm storage system status: Make sure all external storage systems, such as NFS servers, have been started and are accessible.
Start management nodes
Note: Before starting management nodes, make sure storage nodes are fully started.
-
Start management nodes:
-
Run the following commands to start the management node and reconfigure services:
cd $(dirname $(sudo docker container inspect fastone-api | jq -r '.[0].Config.Labels["com.docker.compose.project.working_dir"]'))
sudo ymir down
sudo ymir up
-
-
Start the monitor node, if present:
-
Use similar commands to start Monitor node services:
cd $(dirname $(sudo docker container inspect fastone-api | jq -r '.[0].Config.Labels["com.docker.compose.project.working_dir"]'))
sudo ymir down
sudo ymir up
-
Start cluster nodes
- Start head nodes first.
- Start compute nodes next.
- Start submit nodes last.
Final confirmation
- Check node status: Confirm that all nodes have started successfully and are in a healthy state.
- Reconfigure the cluster:
- In the management platform, click Reconfigure to force configuration redistribution to the cluster.
- Wait until cluster configuration completes and the cluster status changes to
Running.
FAQ
What happens if tasks are running or queued when the platform is shut down?
- Task interruption: If tasks are still running or queued when the platform is shut down, they are forcibly stopped and their status becomes
Failed.
What happens if cluster changes are running or queued when the platform is shut down?
- Change processing: Cluster changes continue attempting to run until they complete or are marked as failed. If the change becomes invalid or is no longer needed during platform downtime, you can adjust it in the UI after the platform restarts.
What happens if storage is not ready when reconfiguring management nodes?
- Management platform startup failure: If storage is not ready, the management platform may fail to start or fail to access the target directory. Make sure storage is ready before running the reconfiguration command.
Closing Notes
- Important reminder: Watch the state of each step closely during the operation and follow the procedure carefully to avoid potential issues.
- Feedback and support: If you encounter any issues or need further help, contact technical support or refer to the related support documentation.