Menu
Choose a product
Scroll for more
Grafana Cloud
Custom failure alert mapping
Failure alerts indicate that the system has entered an invalid, undesired, or inconsistent state. Unlike saturation or error alerts, which report operational symptoms, failure alerts describe incorrect configuration or topology, such as:
- Mismatched replica counts
- Incorrect leader/master assignment
- Missing nodes
- Resource configuration violations
- Broken invariants or cluster state inconsistencies
Failure alerts contribute directly to entity health scoring and appear in RCA workbench timelines.
When to create a failure alert
Create a failure alert when:
- Desired and actual state must match (for example, replicas, scaling targets, node roles)
- A known invariant is violated
- A configuration setting makes the system functionally incorrect
- A system component is missing or in the wrong state
- A resource is used incorrectly relative to its design (not merely exhausted)
Required labels
A failure alert must include the following labels:
| Label | Purpose |
|---|---|
asserts_alert_category=failure | Identifies the alert as a system-state failure |
asserts_entity_type | Identifies the type of entity receiving the insight |
asserts_severity | Indicates the impact level (info, warning, critical) |
Recommended:
| Label | Purpose |
|---|---|
asserts_env | Enables accurate entity resolution across environments |
asserts_site | Identifies region or cluster alignment |
Best practices to write failure alerts
Use the following best practices to help you write custom failure alerts.
Compare desired vs actual state
promql
desired_replicas - actual_replicas > 0Use for: to reduce flapping
YAML
for: 2mPreserve scoping labels to aggregate
Failure alerts must retain entity-identifying labels.
Handle missing data explicitly
- Use
absent()when metric disappearance is a failure - Combine with
up{}when metric disappearance should be ignored - Avoid firing solely due to scrape failures
Example: Redis master missing
YAML
# Redis Master Missing
# Note this covers both cluster mode and HA mode, thus we are counting by redis_mode
- alert: RedisMissingMaster
expr: |-
count by (job, service, redis_mode, namespace, asserts_env, asserts_site) (
redis_instance_info{role="master"}
) == 0
for: 1m
labels:
asserts_severity: critical
asserts_entity_type: Service
asserts_alert_category: failureExample: Replica mismatch
YAML
alert: DeploymentReplicaMismatch
expr: |
kube_deployment_spec_replicas{deployment="checkout"}
!= kube_deployment_status_replicas{deployment="checkout"}
labels:
asserts_alert_category: failure
asserts_entity_type: Service
asserts_severity: warning
asserts_env: prod
annotations:
summary: 'Replica count mismatch'
description: 'The checkout deployment has mismatched desired/actual replicas.'Example: Incorrect database connection configuration
YAML
alert: PostgreSQLHighConnectionsConfigFailure
expr: |
sum(pg_stat_activity_count{asserts_env!=""}) by (asserts_env, namespace, service)
> (
avg(pg_settings_max_connections{asserts_env!=""})
- avg(pg_settings_superuser_reserved_connections{asserts_env!=""})
) * 0.7
labels:
asserts_alert_category: failure
asserts_entity_type: Service
asserts_severity: critical
annotations:
summary: 'PostgreSQL configuration failure'
description: 'Active connections are nearing max minus reserved admin slots.'How failure alerts appear in the knowledge graph
When a failure alert fires:
- The affected entity shows a critical or degraded health state
- The alert appears in RCA workbench timeline as a failure insight
- Clearing the condition returns the entity to a healthy state
Failure alerts combine with saturation, anomaly, and error insights to create a full picture of system behavior.
Next steps
- To learn how to create alerts, refer to Configure alert rules
- To learn how to import a YAML file for alert creation, refer to Import to Grafana-managed rules
Was this page helpful?
Related resources from Grafana Labs
Additional helpful documentation, links, and articles:
Video

Getting started with managing your metrics, logs, and traces using Grafana
In this webinar, we’ll demo how to get started using the LGTM Stack: Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics.
Video

Intro to Kubernetes monitoring in Grafana Cloud
In this webinar you’ll learn how Grafana offers developers and SREs a simple and quick-to-value solution for monitoring their Kubernetes infrastructure.
Video

Building advanced Grafana dashboards
In this webinar, we’ll demo how to build and format Grafana dashboards.