Product Feature FAQs
Scheduler
1. Why does setting two nodes in different partitions to the same hostname cause tasks to fail in a Slurm cluster?
There is no restriction between partitions, and the automatic node naming rules do not prevent user-created conflicting names. Slurm does not know that the two nodes belong to different partitions; it only cares about hostname. Since we do not currently enforce naming rules that prevent potential conflicts, if two partitions contain nodes with the same hostname, Slurm may treat them as the same node and cause failures.
2. On on-prem/local environments with head-node HA enabled, slurmctld does not start automatically
After enabling head-node HA, slurmctld will not start automatically in order to keep state consistent during recovery.
For local nodes, power-state management is not handled by the system and the system cannot know the exact power status. Therefore, unlike cloud nodes, the system will not automatically perform full cluster configuration updates after boot.
For these reasons, in local environments with head-node HA enabled, you need to start slurmctld manually.
3. Slurm does not fail when submitting an unsatisfiable GPU task
Slurm has a known bug in GPU task validation. Make sure the GPU parameters you provide are correct.
4. After changing a node hostname to match another node, the node cannot execute tasks (Fsched)
Fsched schedules tasks based on hostname. If two nodes in the cluster share the same hostname, Fsched may produce scheduling errors.
5. srun -n3 -G2 uses three CPU nodes (expected two)
CPU tasks and GPU tasks are calculated separately:
CPU tasks = num_tasks / min_node (3/1), GPU tasks = gpu / min_gpu (2/1). The scheduler takes the max task count (3), which requires 3 nodes. Combined with manually selected nodes, it may scale up additional nodes.
This scenario is special and uncommon, and the fix logic is complex, so it is not currently addressed.
6. In an Fsched cluster, a hostname like a + 62 digits shows inconsistently in sinfo -Nel
Slurm folds hostnames with numeric suffixes. Folding uses a uint64_t value, so it supports only about 9-digit numeric suffixes. Beyond that, folding can become incorrect.
Using overly long numeric suffixes (more than 8 digits) as hostname is not supported in Fsched.
7. [Fsched cluster] Renaming a node may show slurmd is not running
Long consecutive digits can cause issues and slurmd cannot recognize the hostname. Use letters instead of long digit sequences.
8. SLURM interactive tasks (srun) are not terminated if the application cannot detect node failure
Slurm waits for node timeout to detect failures, which typically takes around 20 minutes. Only after the timeout will the task fail.
9. Interactive (srun) tasks may hit communication errors in autoscale scenarios
Interactive tasks require direct connectivity from the submit node to compute nodes, but during autoscale provisioning some services on compute nodes may not be fully started yet, leading to communication failures.
Cluster Management
1. /etc/hosts contains stale entries
The cluster does not remove entries that were previously removed because the file may contain user-managed content.
2. Single node in multiple partitions: vCPU Multiplier statistics are incorrect
If different partitions have different vCPU Multiplier values, the effective vCPU Multiplier for a node that belongs to multiple partitions is undefined.
3. Newly added static nodes still show up as selectable when reopening the Add Nodes list
Node state synchronization has a delay. Wait for a while and it will become normal.
4. Cluster configuration errors after changing node instance types
Cloud providers may fail when calling the power-on API after an instance type change. Wait briefly and the odin service will retry automatically until the cluster configuration succeeds.
5. FCP-OnPrem or hybrid cloud: the common node shows ntp st=16 and time sync becomes abnormal
Cause: common node NTP shows st=16.
Workaround: set custom_ntp_server to 127.0.0.1 in the deployment config file of the common node, redeploy the environment, and new clusters will take effect.
6. Head-node HA: if the head node disk is full, cluster stays in "updating"
If the head node disk is full, fs-scale cannot write to the database and exits, causing cluster configuration to fail.
Reasons why the system did not fail over to a new head node:
fs-scaleversion did not change, so file upload was not triggered- Network connectivity was normal
- Due to the special role of the head node, the system does not switch head nodes if it does not affect cluster configuration This scenario should be covered by node health monitoring.
Monitoring and Alerts
1. Email notifications cannot be received in on-prem/local environments
Email notifications are configured to use Fastone's mail server by default. In internal networks, Fastone's mail server may not be reachable, causing emails to fail. See the FCP-OnPrem deployment documentation for custom SMTP requirements.
2. [CentOS 6] dcgm-exporter fails when viewing GPU clusters in monitoring
CentOS 6 does not support GPUs and has no GPU drivers.
3. The Y-axis in Grafana resource monitoring shows repeated labels
Because the current chart data max value can be 1, the chart needs at least 6 points, which can cause repeated Y-axis labels. When the data max value is >= 6, it returns to normal.
4. After updating an alert policy and adding WeCom (WeChat Work), recovery alerts are sent
Alert policy changes need to be updated to Grafana. Since Grafana does not provide PATCH, the system deletes and recreates the original policy, which can cause resource status to be treated as recovered.
Other
1. On Linux, downloading DataManager from the web fails to invoke the fastone:// protocol
- Check whether DataManager is installed on the host.
- Check whether
xdg-openis configured correctly. - Delete
~/.config/data-manager.
Note: deleting ~/.config/data-manager can risk losing files. This directory is used by DataManager to store temporary files. Back up any personal files if needed.
2. When mount points include a parent directory and its subdirectory, the parent appears under autofs but is not actually mounted
With nested mounts, mount the parent directory first, then mount the subdirectory.
However, autofs can make the final mounted target uncertain.
This is not a recommended usage pattern.
3. After converting a global mount to a compute-partition mount, login nodes can still access the mount
It is not guaranteed that nodes which have not yet "expired" will immediately lose access to the mount point after unbinding.
4. Image sharing stays in "waiting" and never completes
Tencent Cloud's image sharing API has a bug that can cause sharing to remain in waiting state. Retry until it succeeds.
5. Rules for shared directory authorization addresses
- Do not provide duplicate authorization addresses.
- Authorization addresses can be: IP, domain name, wildcard domain, CIDR, or
*. - CIDR blocks cannot overlap; overlapping CIDRs will cause an error.
6. AD users cannot log in to Fastone UI using domain\\username
Only the username is currently supported.
7. Different users connecting to the same Windows desktop (single-node) will close the previous RDP session
Windows nodes allow only one active remote session. Remote connections to the Windows node use the same username, so a new connection replaces the old one.
8. Creating a cluster or task fails when using a newly created subnet
Cause: if your environment uses external authentication components or external storage, firewall rules on those external components may block access if the new subnet CIDR is not within the allowed range. Solution: update firewall rules on external components to allow access from nodes within the new subnet CIDR range.
Configuration Management
Shared storage configuration fails: [ERROR]: spec/scripts/nfs-lock-check.lua
This issue is usually related to the underlying NFS mount. Try the following:
- If the mount is already mounted on the Core node, try unmounting it with
umount.
If the issue persists, contact Fastone support.