Skip to main content

Overview

Modern system stability management faces two core challenges: how to discover problems faster and how to ensure problems are handled by the right people in a timely manner. The alert platform is designed to solve these two problems.

Why Do We Need Smart Alerts

As system scale expands, relying solely on manual inspections or simple monitoring dashboards is no longer sufficient:

  • With a large number of metrics, manual monitoring is inefficient and anomalies are easily overlooked
  • Fixed thresholds cannot adapt to natural business fluctuations, leading to frequent false alarms and alert fatigue
  • After an alert is triggered, the notification chain is unclear, responsible parties are not identified, and responses are delayed
  • Alert information is scattered, making it difficult to quickly locate root causes

The alert platform integrates detection, notification, and processing into one, helping teams shift from reactive response to proactive discovery.

What Smart Alerts Can Do

Cover Multiple Anomaly Detection Scenarios

The platform provides multiple detection methods to meet different data types and monitoring needs:

Fixed Threshold Detection is suitable for metrics with clear upper and lower limits (such as CPU, memory, error rate), with simple and direct rules and timely response.

Log Detection and Event Detection extend alert capabilities to unstructured logs and business event streams, whether it's a sudden increase in error logs or abnormal key business events, they can all be incorporated into a unified monitoring system.

AI Adaptive Detection automatically learns the historical patterns of metrics, establishes dynamic baselines, and identifies anomalies that truly deviate from normal patterns, significantly reducing false alarm noise caused by business fluctuations.

AI Forecast Detection based on trend prediction, provides early warning before problems actually occur, leaving sufficient处置 window for capacity expansion and fault prevention, transforming reactive response into proactive intervention.

Ensure Alerts Are Delivered and Handled

Whether alerts can be seen by the right people in a timely manner directly determines the Mean Time To Recovery (MTTR). The platform ensures this through flexible notification strategies:

  • Supports multi-channel (email, DingTalk, WeChat Work, etc.) parallel push to reduce the risk of missed notifications
  • Can distribute notifications to different teams based on alert levels or types to avoid frequent disturbances to unrelated personnel
  • Escalation notification mechanism ensures that when alerts are not responded to in a timely manner, they automatically notify superiors, forming a fallback guarantee
  • Repeat reminder function continues to track during the period when alerts remain unrecovered, preventing alerts from being forgotten

Support Full Lifecycle Management of Alerts

From alert generation to final closure, the platform provides complete visibility: the alert list aggregates all alerts, supporting multi-dimensional filtering and status tracking; the details page aggregates detection data, status changes, and trigger events to help quickly locate root causes; notification records provide complete delivery audit, facilitating review and accountability.

Core Modules

ModuleFunction
Alert RulesDefine detection logic and trigger conditions, supporting five types: threshold, log, event, AI adaptive, and AI forecast
Notification StrategyDefine notification channels, recipients, and escalation paths after alerts are triggered
Alert ListAggregate all alerts, providing a unified workbench for viewing, analyzing, and processing

Document Index

DocumentDescription
Threshold Detection Alert RulesConfigure alert rules for metric data based on fixed thresholds
Log Alert RulesConfigure alert rules based on log queries and statistical results
Event Alert RulesConfigure alert rules based on structured event data
AI Adaptive Alert RulesDetect metric anomalies based on AI baselines
AI Forecast Alert RulesDiscover future risks in advance based on AI predictions
Alert ListView, analyze, and process triggered alerts
Notification StrategyConfigure alert notification channels, recipients, and frequency