Skip to main content

Core Node Post-Restore Procedure

Assume we back up the Core node on a regular basis. When a Core node fails, we restore the backup data to a new Core node. However, a backup is only a point-in-time snapshot. After the backup, the cluster may have changed in ways that are not included in the backup data. Therefore, after restoring the data, additional handling is required to ensure normal cluster operation. Note: Handling differences between the backup and the live nodes is risky. Avoid doing this while there are jobs running on the affected nodes.

This document describes how to handle changes that occurred after the backup point:

Pre-Restore Preparation

  1. If the Core node can be temporarily disconnected from the network, disconnect it before the restore to avoid data conflicts.
  2. Because access to the management console is still required during the restore steps, keep the minimum network access needed to reach the console.

Nodes Removed After the Backup

After the Core node is restored from backup, if some nodes have been removed from the cluster, those removed nodes will still be considered part of the cluster. If such a node has been repurposed, the cluster configuration process will re-manage that node, which may impact workloads already running on it. Therefore, before restoring network connectivity, remove those nodes from the restore target cluster (the cluster being restored).

  • Select these nodes in the UI and remove them.
  • Because the network is disconnected and cleanup cannot complete, choose "Force Remove."
  • Wait for node removal to complete.
  • After finishing the backup restore and resolving cluster node differences, you can restore the Core node network.

Nodes Added After the Backup

After the Core node is restored from backup, if nodes were added to the cluster, those new nodes will disappear from the management platform after the restore. If you reconfigure the cluster at this time, the new nodes will be removed, which can impact workloads. Therefore, after the restore, re-add the new nodes to the restore target cluster (the cluster being restored):

  • Use the following command to set newly added nodes to the DRAIN state. The reason must be consistent; otherwise it will be reset to IDLE.
    scontrol update nodename=<node_name> state=DRAIN reason=MAINT:RECOVER
  • After jobs finish, restore the Core node network and re-add the nodes to the cluster.