Skip to main content
Version: FCP 25.11

Alert Service

tip

The Idle shutdown automation is available only after Hybrid Cloud is enabled in FCP-Suite.

Alert Policies

Limit: You can create up to 1000 alert policies.

Field descriptions

  • Policy name: Required. 1 to 40 characters. Must start with a letter. Can include numbers, _, and -.
  • Object: Required. Select a cluster or the system platform.
    • Regular users can create alerts only for clusters visible to them (clusters they created or clusters shared with them).
    • Only running clusters can be selected. Non-running clusters may be shown in the list but cannot be selected.
  • Type: Required. Host, Service, or Scheduler (default: Host).
    • Host: Select one or more nodes for a cluster/system platform. For file systems, no node selection is required.
    • Service: No specific node selection is required; the system monitors services on all nodes for the selected object.
    • Scheduler: Available only when the object is an Fsched cluster. Scheduler-type policies apply to the Fsched scheduler.
  • Nodes: Required.
    • If the object is a cluster:
      • All nodes: Default. When selected, you cannot select other nodes.
      • Or select one or more specific nodes (head, login, compute).
    • If the object is the system platform:
      • All nodes: Default. When selected, you cannot select other nodes.
      • Or select one or more specific nodes (for example, all-in-one is one node; all-in-two is two nodes). Only running nodes can be selected.
  • Partitions: Required only for Scheduler type. Select one, multiple, or all partitions in the cluster.
  • Severity: Required. Notification, Warning, or Critical.
  • Sampling interval: Required. How often the system samples data and computes the average in that interval. Unit: minutes. Range: 1 to 1,000,000.
  • Consecutive periods: Required. How many consecutive periods must exceed the threshold before triggering an alert. Unit: times. Range: 1 to 1,000,000.
  • Silence period: Required. If the alert is not recovered, how often to resend notifications. Default: 24 hours. Options: 5/15/30 minutes, 1/3/6/12/24 hours.
  • Status:
    • Enabled: Default. Policy is effective, notifications are sent, and alert records are generated.
    • Disabled: No notifications and no alert records.
  • User: User who created the policy.
  • Actions:
    • Delete: Available in any status. Requires confirmation. Deleting a policy also deletes all alert records generated by it.
    • Edit: All fields are editable except policy name, object, and nodes.
    • Enable/Disable: Toggle policy status.
    • Bulk actions: Delete, Enable, Disable.

Notes

  • Releasing a cluster automatically disables all alert policies associated with that cluster.
  • If a partition is released or nodes are removed/powered off, related alert checks may produce no data.
    • If the policy does not include a node running status rule, no data does not generate notifications or records.
    • If the policy includes a node running status rule, no data generates notifications and records normally.

Alert behavior

  • Send notifications: Yes/No.
  • Notification list:
    • Email: Shows email address and username.
    • WeCom: Shows the WeCom robot ID and remarks.
  • Automation:
    • Idle shutdown: When enabled, the system performs an automatic shutdown when the rule is triggered.
      • This automation is available for clusters.
      • To use it, the alert rule must be CPU usage with condition <.
      • After configuration: if CPU usage stays below the threshold for N minutes, the system shuts down the target.
      • N = sampling interval (minutes) x consecutive periods (times).

Alert Rules

When any rule matches, the policy is considered triggered.

Limits:

  1. You cannot add two identical monitoring items.
  2. You can add up to 8 monitoring items.
  3. There is always one default rule and it cannot be deleted.

Monitoring items (Host)

MetricConditionThresholdUnit
CPU usage> >= < <= = !=1 to 100%
Memory usage> >= < <= = !=1 to 100%
Node running status=Normal / Abnormal-
Disk usage> >= < <= = !=1 to 100%
Inbound traffic> >= < <= = !=1 to 100000000kb/s
Outbound traffic> >= < <= = !=1 to 100000000kb/s
Disk I/O write> >= < <= = !=1 to 100000000kb/s
Disk I/O read> >= < <= = !=1 to 100000000kb/s

Monitoring items (Service)

Service monitoring checks whether any service component on the selected cluster/system platform is abnormal. If any service becomes abnormal, an alert is triggered.

Monitoring items (Scheduler)

MetricConditionThresholdUnit
Scheduler node status=Unavailable / Down (default: Down; multi-select supported)-
Job status=Running-

Metric notes

Scheduler node status metrics are sourced from Cluster Monitoring > Scheduler Monitoring > Node View.

Scheduler node status definitions

alloc, mix, etc. are scheduler-level node states from sinfo.

  • Available = alloc + mix + idle + completing
  • Unavailable (marked unavailable by admin) = drain + resv + maint
  • Down = down + fail + error

Alert Notifications

  • Send notifications: Required. Yes/No.
    • If Yes, email/WeCom/Feishu settings are shown.
    • If No, no notifications are sent. An alert record is still created.

Email

  • Shows a user list. Select one or more users. Selected users are disabled and cannot be selected again.
  • User list scope:
    • Administrators can see all users and can configure email notifications to any users.
    • Regular users can see only themselves and can configure notifications only for themselves.

Test

Sends a test message to the configured email address or WeCom/Feishu destination.

WeCom

  • Provide the WeCom robot webhook URL and a remark.

Feishu

  • For Feishu robot configuration, see Configure Feishu Robot.
  • Feishu notifications can also be added under Alert behavior.

Alert notification groups

You can create notification groups and bind WeCom/Feishu destinations to the group.

  • Create group:
    • Group name: Required. Globally unique.
    • Description: Optional.
    • Members / WeCom / Feishu: Optional.
    • Validation rules:
      1. The platform verifies global uniqueness and generates a group ID.
      2. A group record is added to the group list.
      3. The group is mapped to a Linux user group on cluster nodes and membership is synced to cluster nodes.
      4. Group notification methods include member emails and all bound WeCom/Feishu destinations.
  • Group list shows group ID, name, description, user count, bound WeCom/Feishu counts, and creation time.
  • Actions:
    • Edit: Edit description, add users, add WeCom, add Feishu.
    • Delete: Delete the group.

When creating or editing an alert policy, you can select a group. If selected, alerts notify all member emails and all bound WeCom/Feishu destinations.