Version: FCP 25.11

Product Feature FAQs

Scheduler

1. Why does setting two nodes in different partitions to the same hostname cause tasks to fail in a Slurm cluster?

There is no restriction between partitions, and the automatic node naming rules do not prevent user-created conflicting names. Slurm does not know that the two nodes belong to different partitions; it only cares about hostname. Since we do not currently enforce naming rules that prevent potential conflicts, if two partitions contain nodes with the same hostname, Slurm may treat them as the same node and cause failures.

2. On on-prem/local environments with head-node HA enabled, `slurmctld` does not start automatically

After enabling head-node HA, slurmctld will not start automatically in order to keep state consistent during recovery. For local nodes, power-state management is not handled by the system and the system cannot know the exact power status. Therefore, unlike cloud nodes, the system will not automatically perform full cluster configuration updates after boot. For these reasons, in local environments with head-node HA enabled, you need to start slurmctld manually.

3. Slurm does not fail when submitting an unsatisfiable GPU task

Slurm has a known bug in GPU task validation. Make sure the GPU parameters you provide are correct.

4. After changing a node hostname to match another node, the node cannot execute tasks (Fsched)

Fsched schedules tasks based on hostname. If two nodes in the cluster share the same hostname, Fsched may produce scheduling errors.

5. `srun -n3 -G2` uses three CPU nodes (expected two)

CPU tasks and GPU tasks are calculated separately: CPU tasks = num_tasks / min_node (3/1), GPU tasks = gpu / min_gpu (2/1). The scheduler takes the max task count (3), which requires 3 nodes. Combined with manually selected nodes, it may scale up additional nodes. This scenario is special and uncommon, and the fix logic is complex, so it is not currently addressed.

6. In an Fsched cluster, a hostname like `a` + 62 digits shows inconsistently in `sinfo -Nel`

Slurm folds hostnames with numeric suffixes. Folding uses a uint64_t value, so it supports only about 9-digit numeric suffixes. Beyond that, folding can become incorrect. Using overly long numeric suffixes (more than 8 digits) as hostname is not supported in Fsched.

7. [Fsched cluster] Renaming a node may show `slurmd is not running`

Long consecutive digits can cause issues and slurmd cannot recognize the hostname. Use letters instead of long digit sequences.

8. SLURM interactive tasks (`srun`) are not terminated if the application cannot detect node failure

Slurm waits for node timeout to detect failures, which typically takes around 20 minutes. Only after the timeout will the task fail.

9. Interactive (`srun`) tasks may hit communication errors in autoscale scenarios

Interactive tasks require direct connectivity from the submit node to compute nodes, but during autoscale provisioning some services on compute nodes may not be fully started yet, leading to communication failures.

Cluster Management

1. `/etc/hosts` contains stale entries

The cluster does not remove entries that were previously removed because the file may contain user-managed content.

2. Single node in multiple partitions: vCPU Multiplier statistics are incorrect

If different partitions have different vCPU Multiplier values, the effective vCPU Multiplier for a node that belongs to multiple partitions is undefined.

3. Newly added static nodes still show up as selectable when reopening the Add Nodes list

Node state synchronization has a delay. Wait for a while and it will become normal.

4. Cluster configuration errors after changing node instance types

Cloud providers may fail when calling the power-on API after an instance type change. Wait briefly and the odin service will retry automatically until the cluster configuration succeeds.

5. FCP-OnPrem or hybrid cloud: the `common` node shows `ntp st=16` and time sync becomes abnormal

Cause: common node NTP shows st=16. Workaround: set custom_ntp_server to 127.0.0.1 in the deployment config file of the common node, redeploy the environment, and new clusters will take effect.

6. Head-node HA: if the head node disk is full, cluster stays in "updating"

If the head node disk is full, fs-scale cannot write to the database and exits, causing cluster configuration to fail. Reasons why the system did not fail over to a new head node:

fs-scale version did not change, so file upload was not triggered
Network connectivity was normal
Due to the special role of the head node, the system does not switch head nodes if it does not affect cluster configuration This scenario should be covered by node health monitoring.

Monitoring and Alerts

1. Email notifications cannot be received in on-prem/local environments

Email notifications are configured to use Fastone's mail server by default. In internal networks, Fastone's mail server may not be reachable, causing emails to fail. See the FCP-OnPrem deployment documentation for custom SMTP requirements.

2. [CentOS 6] `dcgm-exporter` fails when viewing GPU clusters in monitoring

CentOS 6 does not support GPUs and has no GPU drivers.

3. The Y-axis in Grafana resource monitoring shows repeated labels

Because the current chart data max value can be 1, the chart needs at least 6 points, which can cause repeated Y-axis labels. When the data max value is >= 6, it returns to normal.

4. After updating an alert policy and adding WeCom (WeChat Work), recovery alerts are sent

Alert policy changes need to be updated to Grafana. Since Grafana does not provide PATCH, the system deletes and recreates the original policy, which can cause resource status to be treated as recovered.

Other

1. On Linux, downloading DataManager from the web fails to invoke the `fastone://` protocol

Check whether DataManager is installed on the host.
Check whether xdg-open is configured correctly.
Delete ~/.config/data-manager.

Note: deleting ~/.config/data-manager can risk losing files. This directory is used by DataManager to store temporary files. Back up any personal files if needed.

2. When mount points include a parent directory and its subdirectory, the parent appears under autofs but is not actually mounted

With nested mounts, mount the parent directory first, then mount the subdirectory. However, autofs can make the final mounted target uncertain. This is not a recommended usage pattern.

It is not guaranteed that nodes which have not yet "expired" will immediately lose access to the mount point after unbinding.

Tencent Cloud's image sharing API has a bug that can cause sharing to remain in waiting state. Retry until it succeeds.

5. Rules for shared directory authorization addresses

Do not provide duplicate authorization addresses.
Authorization addresses can be: IP, domain name, wildcard domain, CIDR, or *.
CIDR blocks cannot overlap; overlapping CIDRs will cause an error.

6. AD users cannot log in to Fastone UI using `domain\\username`

Only the username is currently supported.

7. Different users connecting to the same Windows desktop (single-node) will close the previous RDP session

Windows nodes allow only one active remote session. Remote connections to the Windows node use the same username, so a new connection replaces the old one.

8. Creating a cluster or task fails when using a newly created subnet

Cause: if your environment uses external authentication components or external storage, firewall rules on those external components may block access if the new subnet CIDR is not within the allowed range. Solution: update firewall rules on external components to allow access from nodes within the new subnet CIDR range.

Configuration Management

Shared storage configuration fails: `[ERROR]: spec/scripts/nfs-lock-check.lua`

This issue is usually related to the underlying NFS mount. Try the following:

If the mount is already mounted on the Core node, try unmounting it with umount.

If the issue persists, contact Fastone support.

Scheduler​

1. Why does setting two nodes in different partitions to the same hostname cause tasks to fail in a Slurm cluster?​

2. On on-prem/local environments with head-node HA enabled, slurmctld does not start automatically​

3. Slurm does not fail when submitting an unsatisfiable GPU task​

4. After changing a node hostname to match another node, the node cannot execute tasks (Fsched)​

5. srun -n3 -G2 uses three CPU nodes (expected two)​

6. In an Fsched cluster, a hostname like a + 62 digits shows inconsistently in sinfo -Nel​

7. [Fsched cluster] Renaming a node may show slurmd is not running​

8. SLURM interactive tasks (srun) are not terminated if the application cannot detect node failure​

9. Interactive (srun) tasks may hit communication errors in autoscale scenarios​

Cluster Management​

1. /etc/hosts contains stale entries​

2. Single node in multiple partitions: vCPU Multiplier statistics are incorrect​

3. Newly added static nodes still show up as selectable when reopening the Add Nodes list​

4. Cluster configuration errors after changing node instance types​

5. FCP-OnPrem or hybrid cloud: the common node shows ntp st=16 and time sync becomes abnormal​

6. Head-node HA: if the head node disk is full, cluster stays in "updating"​

Monitoring and Alerts​

1. Email notifications cannot be received in on-prem/local environments​

2. [CentOS 6] dcgm-exporter fails when viewing GPU clusters in monitoring​

3. The Y-axis in Grafana resource monitoring shows repeated labels​

4. After updating an alert policy and adding WeCom (WeChat Work), recovery alerts are sent​

Other​

1. On Linux, downloading DataManager from the web fails to invoke the fastone:// protocol​

2. When mount points include a parent directory and its subdirectory, the parent appears under autofs but is not actually mounted​

3. After converting a global mount to a compute-partition mount, login nodes can still access the mount​

4. Image sharing stays in "waiting" and never completes​

5. Rules for shared directory authorization addresses​

6. AD users cannot log in to Fastone UI using domain\\username​

7. Different users connecting to the same Windows desktop (single-node) will close the previous RDP session​

8. Creating a cluster or task fails when using a newly created subnet​

Configuration Management​

Shared storage configuration fails: [ERROR]: spec/scripts/nfs-lock-check.lua​