Overview
Modern system stability management faces two core challenges: how to discover problems faster and how to ensure problems are handled by the right people in a timely manner. The alert platform is designed to solve these two problems.
Why Do We Need Smart Alerts
As system scale expands, relying solely on manual inspections or simple monitoring dashboards is no longer sufficient:
- With a large number of metrics, manual monitoring is inefficient and anomalies are easily overlooked
- Fixed thresholds cannot adapt to natural business fluctuations, leading to frequent false alarms and alert fatigue
- After an alert is triggered, the notification chain is unclear, responsible parties are not identified, and responses are delayed
- Alert information is scattered, making it difficult to quickly locate root causes
The alert platform integrates detection, notification, and processing into one, helping teams shift from reactive response to proactive discovery.
What Smart Alerts Can Do
Cover Multiple Anomaly Detection Scenarios
The platform provides multiple detection methods to meet different data types and monitoring needs:
Fixed Threshold Detection is suitable for metrics with clear upper and lower limits (such as CPU, memory, error rate), with simple and direct rules and timely response.
Log Detection and Event Detection extend alert capabilities to unstructured logs and business event streams, whether it's a sudden increase in error logs or abnormal key business events, they can all be incorporated into a unified monitoring system.
AI Adaptive Detection automatically learns the historical patterns of metrics, establishes dynamic baselines, and identifies anomalies that truly deviate from normal patterns, significantly reducing false alarm noise caused by business fluctuations.
AI Forecast Detection based on trend prediction, provides early warning before problems actually occur, leaving sufficient处置 window for capacity expansion and fault prevention, transforming reactive response into proactive intervention.
Ensure Alerts Are Delivered and Handled
Whether alerts can be seen by the right people in a timely manner directly determines the Mean Time To Recovery (MTTR). The platform ensures this through flexible notification strategies:
- Supports multi-channel (email, DingTalk, WeChat Work, etc.) parallel push to reduce the risk of missed notifications
- Can distribute notifications to different teams based on alert levels or types to avoid frequent disturbances to unrelated personnel
- Escalation notification mechanism ensures that when alerts are not responded to in a timely manner, they automatically notify superiors, forming a fallback guarantee
- Repeat reminder function continues to track during the period when alerts remain unrecovered, preventing alerts from being forgotten
Support Full Lifecycle Management of Alerts
From alert generation to final closure, the platform provides complete visibility: the alert list aggregates all alerts, supporting multi-dimensional filtering and status tracking; the details page aggregates detection data, status changes, and trigger events to help quickly locate root causes; notification records provide complete delivery audit, facilitating review and accountability.
Core Modules
| Module | Function |
|---|---|
| Alert Rules | Define detection logic and trigger conditions, supporting five types: threshold, log, event, AI adaptive, and AI forecast |
| Notification Strategy | Define notification channels, recipients, and escalation paths after alerts are triggered |
| Alert List | Aggregate all alerts, providing a unified workbench for viewing, analyzing, and processing |
Document Index
| Document | Description |
|---|---|
| Threshold Detection Alert Rules | Configure alert rules for metric data based on fixed thresholds |
| Log Alert Rules | Configure alert rules based on log queries and statistical results |
| Event Alert Rules | Configure alert rules based on structured event data |
| AI Adaptive Alert Rules | Detect metric anomalies based on AI baselines |
| AI Forecast Alert Rules | Discover future risks in advance based on AI predictions |
| Alert List | View, analyze, and process triggered alerts |
| Notification Strategy | Configure alert notification channels, recipients, and frequency |