Skip to main content

Monitor Principles

Overview

A monitor is an active alert mechanism based on periodic data queries. It regularly pulls metrics, logs, or other types of time-series data from data storage according to the user-configured detection interval, and determines whether to generate alert events based on preset trigger conditions.

The working process of the monitor is divided into three core phases:

  • Data Query Phase: Execute metrics or PQL queries at regular intervals according to the detection interval to obtain aggregated numerical results;
  • Event Judgment Phase: Compare the query results with the thresholds in the trigger conditions to determine whether the current cycle meets the alert conditions;
  • Alert Notification Phase: After determining the event level, send notifications to specified channels based on the alert content template and notification strategy.

Detection Rules

Detection rules are the core configuration part of the monitor, defining the data source, query method, detection frequency, and alert trigger logic.

Effective Scope

Used to limit the range of resource objects monitored by the monitor. The essence of the effective scope is to attach filtering conditions to the query statement, so that detection acts precisely on target resources, avoiding noise and performance overhead caused by full data scanning.

Detection Interval

The detection interval determines the data time window covered by each query. For example, if the detection interval is set to 5 minutes, each execution will query data from the past 5 minutes.

Internal Principle: The system does not trigger detection exactly at the整点, but randomly assigns an offset (jitter) when the service starts to disperse the concurrent execution of different monitors, avoiding centralized queries from overwhelming the storage layer.

Trigger Conditions

Trigger conditions determine when to generate alert events, including the following core elements:

Continuous Trigger Count

Configuration item: Trigger when the result data > threshold for N consecutive times.

The system executes a query in each detection cycle and compares the result with the threshold. Only when the results of N consecutive cycles meet the conditions will an alert event be truly generated. This design can effectively reduce false alarms caused by data jitter or short-term peaks.

  • N = 1: Alert immediately as long as the current cycle meets the conditions, suitable for scenarios with high real-time requirements;
  • N > 1: Alert only when multiple consecutive cycles meet the conditions, suitable for continuous anomaly detection, which can filter out transient jitter.

Event Level Thresholds

LevelDescription
CriticalHighest severity level, usually corresponding to complete service unavailability or extremely abnormal data indicators, requiring immediate response.
ErrorService degradation or core indicators exceeding dangerous thresholds, requiring prompt handling.
WarningIndicators deviate from normal range but have not yet caused serious impact, requiring attention.
MediumSlight abnormalities in indicators or potential risks, with no obvious impact on business, can be followed up in normal workflow.
InfoInformational notification, used to record system status changes or expected trigger events, no immediate handling required, only for reference and record keeping.
OkSystem returns to normal. When no alert events are generated for N consecutive detections, the status automatically returns to normal.

Multi-level Threshold Execution Logic: The system evaluates in the priority order of "Critical → Error → Warning". Once the conditions for a certain level are met, even if lower-level conditions are also met, only events corresponding to the higher level will be generated, and no duplicate triggers will occur.

Normal (Ok) Recovery Logic: When no alert conditions are triggered for N consecutive detections, the system automatically generates a "Normal" event, representing that the monitoring item has recovered from the abnormal state, which can be used to trigger recovery notifications.

Data Gap

The Data Gap (No Data) switch is used to handle situations where query results are empty due to interrupted data collection or reporting delays.

StatusDescriptionRemarks
Off (Default)When query results are empty, no events are generated, and the monitor remains silent.Default behavior
OnIf no data is received within the specified detection interval (empty query results), generate a "Data Gap" event and trigger alert notification.
It also supports setting gap data to 0 for threshold comparison
Used to monitor collection link health

Data Gap Judgment Logic

The core basis for the system to judge data gaps is: within a complete detection interval, there are no data points matching the query conditions written in the database.

  • If data delay compensation is enabled, the system will additionally offset forward by the configured number of minutes based on the originally intended query time window, and then determine whether there is data in this extended window;

Data Delay

When there is a delay in data reporting from the collection end to the storage layer (such as Agent batch reporting, network jitter, Pipeline processing time, etc.), directly querying the current time window may lead to incomplete data, which in turn may falsely trigger "Data Gap" events or cause inaccurate threshold judgments.

After enabling "Data Delay", the system will shift the query time of alert data forward by a specified number of minutes (such as 1 minute), so that the query time window falls within the historical interval where data has been stably stored, thereby avoiding false alarms caused by data delay.

Example: Detection interval is 5 minutes, delay compensation is set to 1 minute. Originally querying data from "the past 5 minutes", it actually becomes querying data from "6 minutes ago to 1 minute ago" to ensure the data in this window has been completely written.

Aggregation Rules

Aggregation rules define how to merge multiple alert events generated by the same monitor within the same time period into a single notification to reduce alert noise. For example: when multiple interfaces on the same service are abnormal at the same time, they can be aggregated into a single alert notification.

Detailed Explanation of Built-in Execution Principles

Detection Trigger Time

Monitors are not triggered strictly on the hour, but instead calculate a random offset (jitter) for each monitor during service initialization, and then execute in a fixed cycle thereafter. This mechanism spreads out the execution times of a large number of monitors, avoiding a flood of database queries caused by a large number of concurrent queries at the same time.

Detection Range Calibration (Data Integrity Guarantee)

Time-series data usually has a delay of several seconds to several minutes from generation to writing to the database (collection, transmission, Pipeline processing). If the monitor executes the query just before the data is written, it may read incomplete data, leading to misjudgment.

Current Solution

The system automatically moves the query window back by a fixed safety offset (usually 1-2 minutes) to ensure the query falls within the historical time period where data has been stably written. For example, when the detection interval is 5 minutes, the actual query is data from "6 minutes ago to 1 minute ago" from the current time, not the latest 5 minutes.

The system enables a 1-minute offset by default. If the user manually turns off "Data Delay", the offset will be ignored.

Data Gap and Data Recovery Events

Data gap and data recovery are two special types of monitoring events used to monitor the health status of data reporting links:

Event TypeTrigger ConditionRemarks
Data Gap EventWhen the query result is empty within a certain detection window (i.e., no data is written during that time period), and the data gap switch is enabled, the system generates this event. It means there may be a failure in the data collection or reporting link.Data gap switch needs to be enabled
Data Recovery EventAfter a data gap occurs, once valid data is queried in the next detection cycle (i.e., data starts reporting again), the system automatically generates this event, indicating that the link has returned to normal.Automatically triggered

Note: When the "Data Gap" switch is not enabled, an empty query result will not trigger any event, and the monitor will silently skip this cycle.

Common Questions

Q1: Why is it that after configuring rules, sometimes it takes a long time to receive alerts?

The reasons may be as follows:

  • The trigger condition is set to alert only after N consecutive times (e.g., N=3), requiring 3 cycles (3 minutes) to continuously meet the conditions;
  • The data delay offset is large, resulting in each actual detection data being historical data from several minutes ago;

Q2: The data clearly exists, why is the "Data Gap" alert triggered?

Common reasons:

  • Data collection has a large delay, causing the data in the window not to be written when queried. It is recommended to enable "Data Delay" and appropriately increase the offset;
  • The query conditions (effective scope / filtering tags) are too strict, and the actual data does not meet the filtering conditions.

Q3: When multiple alerts are triggered simultaneously, how many notifications will be received?

It depends on the aggregation rule configuration of the notification strategy. When aggregation is not enabled, each event independently triggers a notification; when aggregation is enabled, multiple events within the aggregation window are merged into one or several notifications and sent.

Q4: How to avoid repeated alerts for the same problem (alert fatigue)?

The following configurations are recommended:

  • Set "Repeat Notification Interval" in the notification strategy, for example, the same alert can be notified at most 3 times every 30 minutes;
  • Reasonably set the continuous N times in the trigger condition to filter out transient jitter;
  • Enable aggregation for low-priority indicators;
  • Use the "Blocking Strategy" feature to suspend notifications for specific monitors during maintenance windows.