Skip to main content
Version: FCP 25.11

Create a Cluster

tip
  1. When creating a cluster, the maximum total number of nodes in a single cluster is 200.
  2. Pay-as-you-go and prepaid nodes can be created only after Hybrid Cloud is enabled in FCP-Suite.
  3. Cost estimation is available only after Hybrid Cloud is enabled in FCP-Suite.

The Create a cluster feature lets you quickly build a high-performance computing environment on the platform. With template-based guidance and visual configuration, you can package complex hardware/software/scheduling settings into a cluster that can be deployed with one click, significantly simplifying HPC environment initialization.

Cluster Types

The platform supports the following cluster types to meet different business and technical requirements:

  • Fsched cluster

    • The most commonly used cluster type. Uses Fsched as the core job scheduler.
    • Suitable for general-purpose HPC scenarios that require advanced features such as complex scheduling, queue management, resource quota control, and priority scheduling.
    • Users submit and manage jobs via scheduler commands (for example srun, sbatch).
  • None-Linux cluster

    • Runs Linux, but does not include a built-in scheduler (such as Fsched).
    • Suitable for scenarios that do not require complex job scheduling, where users run tasks directly after SSH login, or where you use third-party cluster management tools (for example Kubernetes, Slurm).
    • Provides basic node management and network/storage integration.
  • None-Windows cluster

    • Runs Windows and does not include a built-in scheduler.
    • Suitable for Windows-centric applications (for example some commercial EDA software or Windows scientific computing software), and for users who distribute workloads via remote desktop or specific management tools.

Cost Estimation

When configuring a cluster, the system provides cost estimation on a 31-day (about one month) cycle to help you plan your budget.

  • What is estimated
    • Dynamic nodes: the manual node count you set in the configuration.
    • Autoscaling nodes: estimated based on your scaling max and an expected load model.
  • Cycle and billing model
    • Cost estimation is calculated based on pay-as-you-go (postpaid) pricing.
    • By default, the estimate assumes the selected node specification runs 24 hours a day for 31 days, producing a reference upper bound.
    • For autoscaling nodes, the system computes a cost range based on your scaling min and scaling max, assuming 24x31 operation at each bound, and shows a floating estimation interval.
  • Prepaid discount hint
    • If your workload is stable long-term, prepaid (monthly/yearly) typically reduces cost significantly.
    • The estimation area shows a comparison against pay-as-you-go and the potential savings percentage.
    • Recommendation: for production-critical nodes that need to run continuously for more than one month, consider prepaid billing first.

Note: Cost estimation is for reference only. Actual cost is based on the final bill. Autoscaling costs depend on job load and idle-release strategy.

Prerequisites

Before you start, make sure you have the required permissions:

  • System permission to Create cluster.
  • If you need to use a specific cluster template: permission to use that template.
  • If you need to associate specific static nodes or network resources: access permission to the corresponding resources.
  • For terminology used in cluster creation, see Glossary.

For a full permission matrix, see Permissions.

Create a Cluster Workflow

1. Select a cluster template

The first step is selecting a suitable template as the baseline.

  • UI behavior: On entering Create a cluster, the system lists all cluster templates you are authorized to use, including template name, description, and cluster type.
  • Template status
    • Available: Click to enter the detailed configuration page.
    • Unavailable: If required parameters are missing, the template shows a message like "This template is missing required parameters. Contact an administrator to edit the template." and cannot be selected.
  • No templates: If you are not authorized for any templates, the UI shows a message like "No available templates. Contact an administrator to grant access."

2. Configure the cluster

After selecting a template, configure the cluster's core settings. These settings become the defaults for the entire cluster.

Field descriptions

  • Cluster name: Auto-generated. You can modify it to 3 to 62 characters. Must start with a letter and contain only letters, numbers, and -.
  • Per-user resource restriction: When enabled, users cannot submit jobs to the cluster by default. You must later allocate resources explicitly via Cluster quotas. (Fsched clusters only)
  • SSH login restriction: When enabled, users cannot bypass the scheduler by SSH-ing directly into compute nodes, ensuring all compute work goes through unified scheduling and management. (Fsched clusters only)
  • Alert service: When enabled, the system creates default alert policies during cluster creation. See Alert service.
  • Release protection: When enabled, no one (including admins) can release this cluster, preventing accidental operations.
  • Mount configuration: Configure shared storage mounts for the entire cluster. See Mount configuration.
  • Custom parameters: Cluster-level advanced parameters for the Fsched scheduler. Configure only if you fully understand the meaning and impact. See Custom parameters. (Fsched clusters only)

3. Configure compute partitions

Partitions are the core unit of cluster resource management and scheduling, used to satisfy different business scenarios.

Note: For None-Linux and None-Windows clusters, partitions are mainly for resource grouping, do not provide Fsched scheduling strategy features, and you can create only one partition.

Common settings (applies to all node types)

  • Partition name: Auto-generated (for example partition-XXXX). You can modify it. Partition names must be unique within the cluster.
  • Default partition: If a task is submitted without specifying a partition, it is scheduled to this partition. The first created partition is the default. You can modify it during or after creation. (Fsched clusters only)
  • Enable hyper-threading: When enabled, vCPU count is twice the physical CPU core count. When disabled, vCPU maps 1:1 to physical cores. Enabled by default.
  • Swap configuration: Configure swap space for Linux nodes. See Swap configuration.
  • Tags: Tag nodes in the partition for identification, cost allocation, and management.

Advanced settings (scheduling policy)

  • Allowed groups: Fine-grained control of which user groups can submit jobs to this partition. (Fsched clusters only)
  • Allowed users: Fine-grained control of which users can submit jobs to this partition. (Fsched clusters only)
  • Max job runtime: Jobs running longer than this setting are terminated automatically. (Fsched clusters only)
  • Max CPU usage: Total CPUs available for all jobs in this partition. This controls CPUs configured in the scheduler, not necessarily the actual CPUs on nodes. (Fsched clusters only)
  • CPU oversubscription ratio: Defines the multiple relationship between scheduler-allocatable virtual CPUs and physical CPUs. (Fsched clusters only)
  • Load threshold: Configure CPU/memory utilization thresholds. If exceeded, the node is marked drain and stops accepting new jobs to reduce risk. (Fsched clusters only)
  • Custom parameters: Partition-level advanced Fsched scheduler parameters. Configure only if you fully understand the meaning and impact. See Custom parameters. (Fsched clusters only)

Node sources

Each partition can mix dynamic nodes and static nodes.

Dynamic node configuration

  • Instance type: Defines vCPU and memory specification. You can configure fallback instance types; if the preferred type is unavailable, the system tries the next option until nodes are started successfully.
  • Image: Node operating system. Images are pre-configured OS packages that include the OS plus preinstalled software and configuration. After selection, the system installs the image onto the node automatically.
  • System volume: System disk size.
  • Subnet: Network subnet of the nodes.
  • Manual node count: Number of nodes that are started immediately during cluster creation (0 to 999). After you configure it, the UI shows cost estimation on the right.
  • Autoscaling (auto nodes): (Fsched clusters only)
    • Function: Automatically scales nodes up/down based on job queue conditions. Nodes are reclaimed and released automatically after being idle for a configured duration. See Autoscaling.
    • Key parameters: scaling min/max, idle time (minutes), expiration days, reserved nodes. After you set scaling max, the UI shows cost estimation based on that max value.

Static node configuration

  • Select static nodes to add to the partition.
  • Note: Nodes already used as login nodes or head nodes cannot be selected as compute nodes. A compute node can be shared across multiple partitions.

4. Configure login and head partitions (Fsched clusters only)

Login and head partition settings are largely the same as compute partitions. Key differences are role and some limits:

  • Login partition: The partition that contains entry nodes users connect to via SSH/VNC.
  • Head partition: The partition that contains the primary node running Fsched scheduler management services.
    • Key requirement: An Fsched cluster must have at least one head node in Running or Updating state.
    • Important limit: If a node is prepaid (monthly/yearly) and its expiration policy is Auto release, it cannot be used as a head node.

Core Parameter Details

Mount configuration

  • You can select multiple mount records; mounts take effect for the entire cluster.
  • Partition-level mounts can be configured after the cluster is created.
  • You cannot select two mount records with the same mount point within the same cluster.

Custom parameters

  • These are advanced scheduler settings. Configure only if you fully understand the configuration and impact.
  • Reference: Fsched Scheduler Documentation.

Alert service

When creating a cluster, you can optionally enable default alert policies for the cluster. Custom templates are not supported. Alert service is disabled by default. When enabled, you can choose the following monitoring items:

  • Cluster node running status abnormal

    • Policy name: Auto-generated. Includes cluster name, cluster ID, and a system-generated policy identifier to ensure global uniqueness.
    • Object: This cluster.
    • Type: Host.
    • Nodes: All nodes.
    • Severity: Notification.
    • Sampling interval: 2 minutes.
    • Consecutive periods: 3.
    • Silence period: 24 hours.
    • Rule: Node running status = Abnormal.
    • Send notifications: Yes.
    • Email: Cluster creator.
  • Service abnormal

    • Policy name: Auto-generated. Includes cluster name, cluster ID, and a system-generated policy identifier to ensure global uniqueness.
    • Object: This cluster.
    • Type: Service.
    • Severity: Notification.
    • Sampling interval: 2 minutes.
    • Consecutive periods: 3.
    • Silence period: 24 hours.
    • Rule: Service abnormal.
    • Send notifications: Yes.
    • Email: Cluster creator.

Swap configuration

  • Disabled by default. When enabled, you can configure swap space. Minimum value: 1. Maximum: no limit.
  • Limits:
    • Swap configuration is supported only for Linux nodes. Windows nodes do not support swap.
    • Swap is configured per node (swap is a node attribute, not a partition attribute).

Autoscaling (auto nodes)

Autoscaling (AutoScale) dynamically scales compute node count based on job size and queue conditions during scheduling.

  • When enabled, the system requests compute resources as needed for submitted jobs. After jobs complete, nodes are reclaimed and released automatically after being idle for a configurable duration.
  • When autoscaling is disabled, the system uses only the static cluster and jobs can run only on manually powered-on nodes.
  • Autoscaling instance types match the partition instance type. Autoscaling nodes and manual nodes do not affect each other.

Parameters after enabling autoscaling:

  • Scaling min: Minimum number of dynamic nodes after scale-in (default: 0).
  • Scaling max: Maximum number of dynamic nodes that can be started when running jobs (default: 10).
  • Idle time (min): How long to wait after jobs complete before releasing nodes. Range: 5 to 1440 minutes. Default: 10 minutes.
  • Expiration days: Maximum lifetime for auto nodes. After the expiration days, auto nodes no longer accept new jobs.
  • Reserved nodes: Number of auto nodes that are always kept running and idle in the partition so new jobs can start immediately.

Autoscaling capacity calculation

  • Partition max node count = static node count in the partition + manual node count in the partition + autoscaling max.

Examples

  • Example 1: manual node count = 2, autoscaling range = [2, 5], idle time = 5 minutes.
    • Start 2 manual nodes. After the cluster is running, 2 auto nodes are started.
    • Run srun -N4 -n4 hostname: prints hostnames from 2 manual nodes and 2 auto nodes. After completion, keep 2 manual nodes and 2 auto nodes.
    • Run srun -N5 -n5 hostname: autoscaling starts 1 more node. After completion, wait 5 minutes and release 1 node; keep 2 manual nodes and 2 auto nodes.
    • Run srun -N7 -n7 hostname: autoscaling starts 3 more nodes. After completion, wait 5 minutes and release 3 nodes; keep 2 manual nodes and 2 auto nodes.
  • Example 2: static node count = 2, autoscaling range = [0, 5], idle time = 5 minutes.
    • Start 2 static nodes. After the cluster is running, no auto nodes are started.
    • Run srun -N2 -n2 hostname: prints hostnames from 2 static nodes. After completion, keep 2 static nodes.
    • Run srun -N4 -n4 hostname: autoscaling starts 2 auto nodes. After completion, wait 5 minutes and release 2 nodes; keep 2 static nodes.
    • Run srun -N7 -n7 hostname: autoscaling starts 3 auto nodes. After completion, wait 5 minutes and release 3 nodes; keep 2 static nodes.

Note: These examples illustrate how autoscaling starts and releases nodes. srun is an interactive job; during instance creation and initialization, network connectivity may be unstable and can cause submission errors. To avoid this issue completely, we strongly recommend submitting jobs with sbatch.

Usage Notes

  1. Plan first: Choose the cluster type based on workload, OS (Linux/Windows), and whether scheduling is required. Plan partitions, node specs, and counts based on concurrency.
  2. Use templates: Save common configurations as templates to standardize within a team and enable fast duplication.
  3. Optimize cost:
    • Use cost estimation and prefer prepaid billing for long-running stable nodes.
    • For fluctuating workloads, use autoscaling to avoid idle resources.
  4. Watch limits: Note the 200-node-per-cluster limit and head-node requirements.
  5. Security and governance: For Fsched clusters, use per-user resource restriction, SSH login restriction, and release protection to strengthen compliance and security.

After you complete all configurations, submit the creation request. The system automatically allocates resources, deploys software, and initializes the scheduler. You can monitor creation progress in the cluster list.