Alerting & On-Call
Configure effective alerting and on-call management for production systems.
When to Use This Skill
Use this skill when:
- Setting up alerting rules and thresholds
- Configuring on-call rotations and schedules
- Implementing alert routing and escalation
- Reducing alert fatigue
- Managing incident response workflows
Prerequisites
- Monitoring system (Prometheus, Datadog, etc.)
- On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
- Communication channels (Slack, email)
Alerting Best Practices
Alert Categories
yaml
1# Severity levels
2critical:
3 - Service completely down
4 - Data loss imminent
5 - Security breach
6 response: Immediate page, wake people up
7
8high:
9 - Service degraded significantly
10 - Error rate above SLO
11 - Capacity near limit
12 response: Page during business hours, notify after hours
13
14medium:
15 - Performance degradation
16 - Non-critical component failure
17 - Warning thresholds exceeded
18 response: Notify via Slack, review next business day
19
20low:
21 - Informational alerts
22 - Capacity planning triggers
23 - Routine maintenance needed
24 response: Email notification, weekly review
Alert Design Principles
yaml
1# Good alert characteristics
2alerts:
3 actionable:
4 - Every alert should require human action
5 - Include runbook links
6 - Clear remediation steps
7
8 relevant:
9 - Alert on symptoms, not causes
10 - Focus on user impact
11 - Avoid alerting on expected behavior
12
13 timely:
14 - Appropriate thresholds
15 - Suitable evaluation windows
16 - Account for normal variance
17
18 unique:
19 - No duplicate alerts
20 - Proper alert grouping
21 - Clear ownership
Prometheus Alerting
Alert Rules
yaml
1# prometheus/rules/alerts.yml
2groups:
3 - name: service_alerts
4 rules:
5 # High-level service health
6 - alert: ServiceDown
7 expr: up{job="myapp"} == 0
8 for: 1m
9 labels:
10 severity: critical
11 annotations:
12 summary: "Service {{ $labels.instance }} is down"
13 description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."
14 runbook_url: "https://wiki.example.com/runbooks/service-down"
15
16 # Error rate alert
17 - alert: HighErrorRate
18 expr: |
19 sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
20 / sum(rate(http_requests_total[5m])) by (service) > 0.05
21 for: 5m
22 labels:
23 severity: critical
24 annotations:
25 summary: "High error rate for {{ $labels.service }}"
26 description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
27
28 # Latency alert (SLO-based)
29 - alert: HighLatency
30 expr: |
31 histogram_quantile(0.95,
32 sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
33 ) > 0.5
34 for: 5m
35 labels:
36 severity: high
37 annotations:
38 summary: "P95 latency above 500ms for {{ $labels.service }}"
Alertmanager Configuration
yaml
1# alertmanager.yml
2global:
3 resolve_timeout: 5m
4 slack_api_url: 'https://hooks.slack.com/services/xxx'
5 pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
6
7templates:
8 - '/etc/alertmanager/templates/*.tmpl'
9
10route:
11 receiver: 'default-receiver'
12 group_by: ['alertname', 'service']
13 group_wait: 30s
14 group_interval: 5m
15 repeat_interval: 4h
16
17 routes:
18 # Critical alerts go to PagerDuty
19 - match:
20 severity: critical
21 receiver: 'pagerduty-critical'
22 group_wait: 0s
23 repeat_interval: 1h
24
25 # High severity during business hours
26 - match:
27 severity: high
28 receiver: 'slack-high'
29 active_time_intervals:
30 - business-hours
31
32 # Route by team
33 - match_re:
34 team: platform.*
35 receiver: 'platform-team'
36
37receivers:
38 - name: 'default-receiver'
39 slack_configs:
40 - channel: '#alerts'
41 send_resolved: true
42
43 - name: 'pagerduty-critical'
44 pagerduty_configs:
45 - service_key: 'xxx'
46 severity: critical
47 description: '{{ .CommonAnnotations.summary }}'
48 details:
49 firing: '{{ template "pagerduty.firing" . }}'
50
51 - name: 'slack-high'
52 slack_configs:
53 - channel: '#alerts-high'
54 title: '{{ .CommonAnnotations.summary }}'
55 text: '{{ .CommonAnnotations.description }}'
56 actions:
57 - type: button
58 text: 'Runbook'
59 url: '{{ .CommonAnnotations.runbook_url }}'
60 - type: button
61 text: 'Dashboard'
62 url: '{{ .CommonAnnotations.dashboard_url }}'
63
64 - name: 'platform-team'
65 slack_configs:
66 - channel: '#platform-alerts'
67
68time_intervals:
69 - name: business-hours
70 time_intervals:
71 - weekdays: ['monday:friday']
72 times:
73 - start_time: '09:00'
74 end_time: '17:00'
75
76inhibit_rules:
77 - source_match:
78 severity: critical
79 target_match:
80 severity: high
81 equal: ['service']
Service Configuration
yaml
1# Terraform example
2resource "pagerduty_service" "myapp" {
3 name = "MyApp Production"
4 description = "Production application service"
5 escalation_policy = pagerduty_escalation_policy.default.id
6 alert_creation = "create_alerts_and_incidents"
7 auto_resolve_timeout = 14400 # 4 hours
8 acknowledgement_timeout = 600 # 10 minutes
9
10 incident_urgency_rule {
11 type = "use_support_hours"
12
13 during_support_hours {
14 type = "constant"
15 urgency = "high"
16 }
17
18 outside_support_hours {
19 type = "constant"
20 urgency = "low"
21 }
22 }
23}
24
25resource "pagerduty_escalation_policy" "default" {
26 name = "Default Escalation"
27 num_loops = 2
28
29 rule {
30 escalation_delay_in_minutes = 10
31 target {
32 type = "schedule_reference"
33 id = pagerduty_schedule.primary.id
34 }
35 }
36
37 rule {
38 escalation_delay_in_minutes = 15
39 target {
40 type = "user_reference"
41 id = pagerduty_user.manager.id
42 }
43 }
44}
Schedule Configuration
yaml
1resource "pagerduty_schedule" "primary" {
2 name = "Primary On-Call"
3 time_zone = "America/New_York"
4
5 layer {
6 name = "Weekly Rotation"
7 start = "2024-01-01T00:00:00-05:00"
8 rotation_virtual_start = "2024-01-01T00:00:00-05:00"
9 rotation_turn_length_seconds = 604800 # 1 week
10 users = [for user in pagerduty_user.oncall : user.id]
11 }
12
13 # Override layer for holidays
14 layer {
15 name = "Holiday Coverage"
16 start = "2024-01-01T00:00:00-05:00"
17 rotation_virtual_start = "2024-01-01T00:00:00-05:00"
18 rotation_turn_length_seconds = 86400
19 users = [pagerduty_user.holiday_coverage.id]
20
21 restriction {
22 type = "daily_restriction"
23 start_time_of_day = "00:00:00"
24 duration_seconds = 86400
25 start_day_of_week = 0 # Sunday
26 }
27 }
28}
Grafana OnCall
Integration Setup
yaml
1# docker-compose.yml addition
2services:
3 oncall:
4 image: grafana/oncall
5 environment:
6 - SECRET_KEY=your-secret-key
7 - BASE_URL=http://oncall:8080
8 - GRAFANA_API_URL=http://grafana:3000
9 ports:
10 - "8080:8080"
Escalation Chain
yaml
1# Example escalation chain structure
2escalation_chains:
3 - name: "Production Critical"
4 steps:
5 - step: 1
6 type: notify
7 persons:
8 - "@oncall-primary"
9 wait_delay: 0
10
11 - step: 2
12 type: notify
13 persons:
14 - "@oncall-secondary"
15 wait_delay: 5m
16
17 - step: 3
18 type: notify
19 persons:
20 - "@engineering-manager"
21 wait_delay: 10m
22
23 - step: 4
24 type: trigger_action
25 action: "escalate_to_incident_commander"
26 wait_delay: 15m
Alert Templates
Slack Alert Template
go
1{{ define "slack.title" }}
2[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
3{{ end }}
4
5{{ define "slack.text" }}
6{{ range .Alerts }}
7*Alert:* {{ .Annotations.summary }}
8*Severity:* {{ .Labels.severity }}
9*Description:* {{ .Annotations.description }}
10*Runbook:* {{ .Annotations.runbook_url }}
11{{ end }}
12{{ end }}
go
1{{ define "pagerduty.firing" }}
2{{ range .Alerts.Firing }}
3Alert: {{ .Labels.alertname }}
4Service: {{ .Labels.service }}
5Instance: {{ .Labels.instance }}
6Value: {{ .Annotations.value }}
7Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
8{{ end }}
9{{ end }}
On-Call Best Practices
Rotation Guidelines
yaml
1on_call_guidelines:
2 rotation_length: 1 week
3 handoff_time: "10:00 AM Monday"
4
5 responsibilities:
6 - Monitor alerts during shift
7 - Respond within SLA (critical: 5min, high: 15min)
8 - Document incidents
9 - Handoff unresolved issues
10
11 support:
12 - Secondary on-call for backup
13 - Clear escalation path
14 - Manager availability for major incidents
15
16 wellness:
17 - Maximum 1 week on-call per month
18 - Comp time after high-alert periods
19 - No-interrupt recovery day after shift
Runbook Template
markdown
1# Alert: High Error Rate
2
3## Summary
4Error rate has exceeded the threshold of 5% for the service.
5
6## Impact
7Users may experience errors when accessing the application.
8
9## Investigation Steps
101. Check service logs: `kubectl logs -l app=myapp -n production`
112. Review recent deployments: `kubectl rollout history deployment/myapp`
123. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432`
134. Review error traces in APM dashboard
14
15## Remediation
16### If caused by recent deployment:
17```bash
18kubectl rollout undo deployment/myapp -n production
bash
1kubectl delete pod -l app=postgres -n production
Escalation
If not resolved within 15 minutes, escalate to:
- Database team: @db-oncall
- Platform team: @platform-oncall
## Alert Fatigue Reduction
### Strategies
```yaml
fatigue_reduction:
aggregate_alerts:
- Group related alerts
- Use inhibit rules
- Implement alert correlation
tune_thresholds:
- Base on SLOs, not arbitrary values
- Account for normal variance
- Use appropriate evaluation windows
automate_responses:
- Auto-remediation for known issues
- Self-healing infrastructure
- Automated scaling
regular_review:
- Weekly alert review
- Remove unused alerts
- Update thresholds based on data
Common Issues
Issue: Alert Storm
Problem: Too many alerts firing simultaneously
Solution: Implement proper grouping and inhibition rules
Issue: Missed Alerts
Problem: Critical alerts not reaching on-call
Solution: Test escalation policies, verify contact methods
Issue: False Positives
Problem: Alerts firing without actual issues
Solution: Tune thresholds, increase evaluation windows
Best Practices
- Define clear severity levels
- Every alert needs a runbook
- Test on-call notifications regularly
- Review and tune alerts weekly
- Implement proper escalation paths
- Use alert grouping and inhibition
- Track alert metrics (MTTR, frequency)
- Practice incident response regularly