Prometheus Alerting Rules: Stop Wasting Your Sleep on Bad Alerts

Let’s talk about Prometheus alerting rules.

They look simple enough — just a PromQL expression, right? But in production, bad alerting rules are worse than no alerts at all. Getting woken up at 3 AM by a false alarm? Yeah, we’ve all been there.

I once inherited a system with 7 different CPUUsageHigh alert rules. They overlapped, contradicted each other, and had thresholds pulled out of thin air. I deleted every single one and rewrote them from scratch. Here’s what I learned.

The Golden Rule: Alerts ≠ Monitoring

This is the biggest mistake I see. People treat Prometheus like a monitoring dashboard and slap an alert on every metric.

Stop it.

The entire point of an alert is to trigger human action. If there’s nothing to do when the alert fires, it’s noise. Pure noise.

My team has a hard rule: every alert rule MUST have a runbook_url label pointing to actual operational docs. If you can’t write a runbook for it, you can’t ship the alert.

Anatomy of a Good Alert Rule

Here’s what NOT to do:

groups:
- name: bad_examples
  rules:
  - alert: HighCPU
    expr: avg(rate(node_cpu_seconds_total{mode="user"}[5m])) > 0.9
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "CPU is high"

Problems with this rule:

Hardcoded threshold with no context
1-minute for is way too short — transient spikes will page you
The annotation tells you nothing useful

Here’s the fix:

groups:
- name: node_alerts
  rules:
  - alert: NodeCPUUsageHigh
    expr: |
      (
        1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)
      ) > 0.9
    for: 5m
    labels:
      severity: warning
      team: infra
    annotations:
      summary: "Node {{ $labels.instance }} CPU usage > 90%"
      description: "CPU usage on {{ $labels.instance }} has been at {{ $value | humanizePercentage }} for more than 5 minutes."
      runbook_url: "https://wiki.ourteam.com/runbooks/node-high-cpu"

What changed:

by (instance) preserves the actual machine name
for: 5m filters out noise
Annotations include instance name and exact value
Added team label for Alertmanager routing

Naming Conventions That Don’t Suck

I’ve seen rules named Alert_1, Alert_2… That’s a signal that nobody cares about the system.

Here’s the naming convention we use:

Component	Pattern	Example
Node	`Node<Metric><Condition>`	`NodeDiskSpaceFull`
Container	`Container<Metric><Condition>`	`ContainerOOMKilled`
Application	`<AppName><Metric><Condition>`	`APIServerLatencyHigh`
Business	`Business<Metric><Condition>`	`BusinessOrderFailureSpike`

Three things your alert name must include: what component + what metric + what’s wrong.

Severity Levels: Don’t Cry Wolf

If everything is critical, nothing is critical. I’ve seen teams where 90% of alerts were critical. Result? Nobody paid attention, and real criticals got buried.

We use three levels:

Level	Meaning	Response Time	Example
`critical`	Service down or data loss	Immediate, within 15 min	Disk full, API completely dead
`warning`	Degraded quality, still serving	Within 1 hour during work hours	P99 latency doubled but within SLA
`info`	Needs attention, not urgent	None	Certificate expires in 30 days

Key point: Always pair critical with a reasonable for. I’ve seen for: 0s on critical alerts — one network hiccup and the entire on-call rotation gets woken up.

Common Pitfalls (I’ve Hit All of These)

1. Wrong Aggregation Granularity

# Bad: This alerts on cluster-wide CPU, but you won't know which node
expr: avg(node_cpu_seconds_total) > 0.9

# Good: Aggregate by instance
expr: avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) < 0.1

2. Forgetting About Counter Resets

Counter metrics (like http_requests_total) reset on process restart. Using rate or increase handles this automatically. But delta? Not so much.

# Bad: Process restart triggers false alert
expr: delta(http_requests_total[5m]) < 0

# Good: rate handles resets correctly
expr: rate(http_requests_total[5m]) < 0.01

3. Alert Fatigue

This is the killer. Too many alerts → nobody reads them → real incidents get missed → system goes down.

Fix: Do regular alert audits. Every quarter, I pull the firing history for every rule. If a rule fires frequently but nobody responds to it, either delete it or downgrade its severity.

Pro Tip: Precompute with Recording Rules

If your alert rules involve complex aggregations, don’t recompute them every evaluation cycle. Use Recording Rules:

groups:
- name: recording_rules
  rules:
  - record: job:node_cpu_usage:avg_5m
    expr: avg by(job) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
    
- name: alert_rules
  rules:
  - alert: NodeCPUUsageHigh
    expr: (1 - job:node_cpu_usage:avg_5m) > 0.9
    for: 5m

Benefits:

Reduces Prometheus CPU load
Cleaner alert rules
Reusable in Grafana dashboards

Test Your Rules. Seriously.

Don’t just write rules and push them. Prometheus 2.28+ has promtool for testing:

# test_rule.yaml
rule_files:
  - alerts.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'node_cpu_seconds_total{instance="node1",mode="idle"}'
        values: '0 0 0 0 0'  # 5 minutes of 0 = 100% busy
    alert_rule_test:
      - eval_time: 5m
        alertname: NodeCPUUsageHigh
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: node1
            exp_annotations:
              summary: "Node node1 CPU usage > 90%"

Run it:

promtool test rules test_rule.yaml

99% of teams skip this step. I guarantee you — untested alert rules will blow up in production.

Best Practices Summary

Practice	Description	Priority
Every alert needs a runbook	Operational docs or don’t ship	⭐⭐⭐⭐⭐
Use `for` wisely	At least 2-5 minutes to filter noise	⭐⭐⭐⭐⭐
Standardize naming	Component + Metric + Condition	⭐⭐⭐⭐
Tier your severity	Don’t make everything critical	⭐⭐⭐⭐⭐
Precompute with Recording Rules	Reduce Prometheus load	⭐⭐⭐⭐
Audit alerts regularly	Prune useless rules	⭐⭐⭐⭐⭐
Test alert rules	Use promtool before deploying	⭐⭐⭐⭐
Never leave annotations empty	Include instance and values	⭐⭐⭐⭐⭐

FAQ

Q: What if I have too many alert rules?

A: Consolidate. Merge similar alerts — for example, combine DiskUsageHigh and InodeUsageHigh into StorageUsageHigh with a description that distinguishes the two. My team went from 120 rules to 45, and alert noise dropped 70%.

Q: How long should the `for` parameter be?

A: Depends on context. Infrastructure alerts (CPU, memory) — 5-10 minutes. Application-level (error rate, latency) — 2-3 minutes. Core principle: Better to be late than wrong.

Q: How do I handle duplicate alerts?

A: Use Alertmanager’s group_by and repeat_interval. Group similar alerts together, and don’t repeat the same alert too often. I set repeat_interval: 4h — if the alert isn’t resolved, remind me again in 4 hours.

Q: How do I set sensible thresholds?

A: Don’t guess. Collect two weeks of baseline data first. Look at P99 and P95 values under normal conditions. Set thresholds based on that data. The worst threshold I ever saw was 5% CPU — the machine was designed to run at 80% load.

Good alerting rules mean good sleep. Bad ones mean 3 AM pages.

Get them right. Your future self will thank you.

Questions? Drop them in the comments.

References & Community Insights

The following authoritative resources were referenced for architectural best practices and specifications: