Skip to main content

Threshold Detection

info

Threshold detection alert rules are used for continuous monitoring of metric data. When metric values exceed set thresholds, they automatically trigger alerts and notify relevant personnel, enabling proactive operation and maintenance.

Quick Start

Step 1: Enter the Creation Page

Enter Smart Alert → Alert Rules, click New Alert Rule, select Threshold Detection type and enter the configuration page.

Step 2: Configure Detection Rules

  1. Select Effective Scope (resource domain, default empty, required. Select the resource domain that the current user has permission for)
  2. In Metric Selection, select the metric to monitor, such as CPU usage, set the aggregation method (such as latest value) and add grouping dimension (such as host)
  3. Set Detection Interval (such as 5 minutes, i.e., each detection will query 5 minutes of metric data)
  4. Configure Trigger Conditions: Fill in the number of consecutive times the result data exceeds the threshold, and fill in the corresponding thresholds for each alert level (Critical / Error / Warning)
  5. Configure Data Gap and Data Delay strategies as needed

Step 3: Fill in Alert Content and Save

  1. Fill in Alert Title (supports variables, such as Host Name: ${host.customizedName}, IP Address: ${host.ipv4Address} ${metric} too high)
  2. Fill in Notification Content as needed (supports rich text and variables)
  3. Select Notification Strategy
  4. Set Effective Time (All Time / Periodic Time / Custom Time)
  5. Click Save to complete creation

Feature Description

1777024234514

Detection Rules

Basic Configuration

FieldRequiredDescription
Effective ScopeYesSelect the resource domain to which the alert rule belongs, used to isolate alert configurations of different resource domains
Metric SelectionYesSupports two methods:Select Query and PQL Query
Aggregation MethodYesAggregate data within the detection interval, such as latest value, average value, etc.
Grouping DimensionNoQuery and calculate separately by specified dimensions (such as host)
Detection IntervalYesTime window length for each data query, default 5 minutes
tip

Detection interval supports 1-30min, provides quick options and supports manual input

Trigger Conditions

FieldRequiredDescription
Continuous Trigger CountYesTrigger an alert only when the detection result exceeds the threshold for N consecutive times, avoiding false alarms caused by occasional glitches, default is 1 time
Comparison MethodYesSupports operators like >, >=, <, <=, =
Critical ThresholdNoMetric value exceeding this threshold triggers a critical level alert
Error ThresholdNoMetric value exceeding this threshold triggers an error level alert
Warning ThresholdNoMetric value exceeding this threshold triggers a warning level alert
Medium ThresholdNoMetric value exceeding this threshold triggers a medium level alert, this level is not displayed by default, can be added
Info ThresholdNoMetric value exceeding this threshold triggers an info level alert, this level is not displayed by default, can be added
Normal Recovery CountYesWhen no events are generated for N consecutive detections, the alert status returns to normal, default 3 times

Advanced Configuration

FieldRequiredDescription
Data GapNoAfter enabling, when no data is reported within the specified time, the metric result is treated as 0 to participate in threshold judgment, preventing missed reports due to collection interruption. DefaultOff or enable data gap to trigger alerts of specified levels
Data DelayNoAfter enabling, the query time window is moved forward by the specified duration to avoid missed reports due to long data links. DefaultEnabled, offset by 1 minute
Aggregation RuleNoWhen there are multiple time series under the grouping dimension, define the aggregation granularity. Default aggregation by host, each host independently alerts; aggregation by network area, each network area generates an alert containing multiple hosts

Alert Content

FieldRequiredDescription
Alert TitleYesThe title of the alert event, supports variables, such as ${host.customizedName}, ${metric}, it is recommended to include object and metric name for quick identification
Notification ContentNoThe body of the alert notification, supports rich text editing and variable interpolation.
Notification StrategyNoSelect the notification channel and recipient configuration after the alert is triggered, if none can clickCreate Notification Strategy to create a new one
LabelsNoTag alert rules for easy filtering and classification management
info

It is recommended to use variables instead of fixed text for alert titles to quickly locate problems in the alert list.

When notification content is left blank, the system default template is used, which includes basic information such as alert ID, time, status, and level, meeting the needs of most scenarios.

Status & Effective Time

FieldRequiredDescription
Effective TimeYesAll Time (7×24 hours) : Always effective; Periodic Time : Set by workday/weekend cycle; Custom Time : Specify specific time periods to be effective
Start/Stop StatusYesControl whether the alert rule is running. After being closed, the rule pauses detection and does not generate alert events, defaultEnabled

Common Scenarios

Scenario: Monitoring production server CPU continuous high load Select CPU usage metric, set grouping dimension as host, set critical threshold to 90%, error to 80%, continuous trigger count to 3, to avoid false alarms from short-term peaks.

Scenario: Reduce alert sensitivity during non-working hours Select Custom Time for effective time, configure it to be effective from 9:00-18:00 on workdays, and do not send notifications at other times to reduce night disturbances.

Scenario: Still able to alert when host collection is interrupted Enable Data Gap and treat gap results as 0, cooperating with threshold detection to ensure that alerts can also be triggered when the host Agent is abnormal, rather than being silent.

Notes

warning

Modifying Detection Interval or Continuous Trigger Count will affect the response delay of alerts: the longer the interval and the more times, the longer the time from the occurrence of an anomaly to receiving a notification. Please set it reasonably according to business tolerance.