What is alerting-oncall?

Perfect for Operations Agents needing advanced alerting and on-call management for production systems using Prometheus and PagerDuty. Alerting-OnCall is a skill that enables effective alerting and on-call management for production systems, reducing alert fatigue and streamlining incident response workflows.

How do I install alerting-oncall?

Run the command: npx killer-skills add allthingslinux/atl.services. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for alerting-oncall?

Key use cases include: Configuring alerting rules and thresholds for production systems, Automating on-call rotations and schedules using PagerDuty, Implementing alert routing and escalation workflows, Optimizing incident response workflows to reduce downtime.

Which IDEs are compatible with alerting-oncall?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for alerting-oncall?

Requires a monitoring system like Prometheus or Datadog. Needs an on-call platform such as PagerDuty or Opsgenie. Dependent on communication channels like Slack for alert notifications.

Alerting & On-Call

Name: alerting-oncall
Availability: InStock
Author: allthingslinux

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

Setting up alerting rules and thresholds
Configuring on-call rotations and schedules
Implementing alert routing and escalation
Reducing alert fatigue
Managing incident response workflows

Prerequisites

Monitoring system (Prometheus, Datadog, etc.)
On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

yaml
1# Severity levels
2critical:
3  - Service completely down
4  - Data loss imminent
5  - Security breach
6  response: Immediate page, wake people up
7
8high:
9  - Service degraded significantly
10  - Error rate above SLO
11  - Capacity near limit
12  response: Page during business hours, notify after hours
13
14medium:
15  - Performance degradation
16  - Non-critical component failure
17  - Warning thresholds exceeded
18  response: Notify via Slack, review next business day
19
20low:
21  - Informational alerts
22  - Capacity planning triggers
23  - Routine maintenance needed
24  response: Email notification, weekly review

Alert Design Principles

yaml
1# Good alert characteristics
2alerts:
3  actionable:
4    - Every alert should require human action
5    - Include runbook links
6    - Clear remediation steps
7
8  relevant:
9    - Alert on symptoms, not causes
10    - Focus on user impact
11    - Avoid alerting on expected behavior
12
13  timely:
14    - Appropriate thresholds
15    - Suitable evaluation windows
16    - Account for normal variance
17
18  unique:
19    - No duplicate alerts
20    - Proper alert grouping
21    - Clear ownership

Prometheus Alerting

Alert Rules

yaml
1# prometheus/rules/alerts.yml
2groups:
3  - name: service_alerts
4    rules:
5      # High-level service health
6      - alert: ServiceDown
7        expr: up{job="myapp"} == 0
8        for: 1m
9        labels:
10          severity: critical
11        annotations:
12          summary: "Service {{ $labels.instance }} is down"
13          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute."
14          runbook_url: "https://wiki.example.com/runbooks/service-down"
15
16      # Error rate alert
17      - alert: HighErrorRate
18        expr: |
19          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
20          / sum(rate(http_requests_total[5m])) by (service) > 0.05
21        for: 5m
22        labels:
23          severity: critical
24        annotations:
25          summary: "High error rate for {{ $labels.service }}"
26          description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes"
27
28      # Latency alert (SLO-based)
29      - alert: HighLatency
30        expr: |
31          histogram_quantile(0.95, 
32            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
33          ) > 0.5
34        for: 5m
35        labels:
36          severity: high
37        annotations:
38          summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

yaml
1# alertmanager.yml
2global:
3  resolve_timeout: 5m
4  slack_api_url: 'https://hooks.slack.com/services/xxx'
5  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
6
7templates:
8  - '/etc/alertmanager/templates/*.tmpl'
9
10route:
11  receiver: 'default-receiver'
12  group_by: ['alertname', 'service']
13  group_wait: 30s
14  group_interval: 5m
15  repeat_interval: 4h
16  
17  routes:
18    # Critical alerts go to PagerDuty
19    - match:
20        severity: critical
21      receiver: 'pagerduty-critical'
22      group_wait: 0s
23      repeat_interval: 1h
24
25    # High severity during business hours
26    - match:
27        severity: high
28      receiver: 'slack-high'
29      active_time_intervals:
30        - business-hours
31
32    # Route by team
33    - match_re:
34        team: platform.*
35      receiver: 'platform-team'
36
37receivers:
38  - name: 'default-receiver'
39    slack_configs:
40      - channel: '#alerts'
41        send_resolved: true
42
43  - name: 'pagerduty-critical'
44    pagerduty_configs:
45      - service_key: 'xxx'
46        severity: critical
47        description: '{{ .CommonAnnotations.summary }}'
48        details:
49          firing: '{{ template "pagerduty.firing" . }}'
50
51  - name: 'slack-high'
52    slack_configs:
53      - channel: '#alerts-high'
54        title: '{{ .CommonAnnotations.summary }}'
55        text: '{{ .CommonAnnotations.description }}'
56        actions:
57          - type: button
58            text: 'Runbook'
59            url: '{{ .CommonAnnotations.runbook_url }}'
60          - type: button
61            text: 'Dashboard'
62            url: '{{ .CommonAnnotations.dashboard_url }}'
63
64  - name: 'platform-team'
65    slack_configs:
66      - channel: '#platform-alerts'
67
68time_intervals:
69  - name: business-hours
70    time_intervals:
71      - weekdays: ['monday:friday']
72        times:
73          - start_time: '09:00'
74            end_time: '17:00'
75
76inhibit_rules:
77  - source_match:
78      severity: critical
79    target_match:
80      severity: high
81    equal: ['service']

PagerDuty Integration

Service Configuration

yaml
1# Terraform example
2resource "pagerduty_service" "myapp" {
3  name                    = "MyApp Production"
4  description             = "Production application service"
5  escalation_policy       = pagerduty_escalation_policy.default.id
6  alert_creation          = "create_alerts_and_incidents"
7  auto_resolve_timeout    = 14400  # 4 hours
8  acknowledgement_timeout = 600    # 10 minutes
9
10  incident_urgency_rule {
11    type    = "use_support_hours"
12    
13    during_support_hours {
14      type    = "constant"
15      urgency = "high"
16    }
17    
18    outside_support_hours {
19      type    = "constant"
20      urgency = "low"
21    }
22  }
23}
24
25resource "pagerduty_escalation_policy" "default" {
26  name      = "Default Escalation"
27  num_loops = 2
28
29  rule {
30    escalation_delay_in_minutes = 10
31    target {
32      type = "schedule_reference"
33      id   = pagerduty_schedule.primary.id
34    }
35  }
36
37  rule {
38    escalation_delay_in_minutes = 15
39    target {
40      type = "user_reference"
41      id   = pagerduty_user.manager.id
42    }
43  }
44}

Schedule Configuration

yaml
1resource "pagerduty_schedule" "primary" {
2  name      = "Primary On-Call"
3  time_zone = "America/New_York"
4
5  layer {
6    name                         = "Weekly Rotation"
7    start                        = "2024-01-01T00:00:00-05:00"
8    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
9    rotation_turn_length_seconds = 604800  # 1 week
10    users                        = [for user in pagerduty_user.oncall : user.id]
11  }
12
13  # Override layer for holidays
14  layer {
15    name                         = "Holiday Coverage"
16    start                        = "2024-01-01T00:00:00-05:00"
17    rotation_virtual_start       = "2024-01-01T00:00:00-05:00"
18    rotation_turn_length_seconds = 86400
19    users                        = [pagerduty_user.holiday_coverage.id]
20
21    restriction {
22      type              = "daily_restriction"
23      start_time_of_day = "00:00:00"
24      duration_seconds  = 86400
25      start_day_of_week = 0  # Sunday
26    }
27  }
28}

Grafana OnCall

Integration Setup

yaml
1# docker-compose.yml addition
2services:
3  oncall:
4    image: grafana/oncall
5    environment:
6      - SECRET_KEY=your-secret-key
7      - BASE_URL=http://oncall:8080
8      - GRAFANA_API_URL=http://grafana:3000
9    ports:
10      - "8080:8080"

Escalation Chain

yaml
1# Example escalation chain structure
2escalation_chains:
3  - name: "Production Critical"
4    steps:
5      - step: 1
6        type: notify
7        persons:
8          - "@oncall-primary"
9        wait_delay: 0
10        
11      - step: 2
12        type: notify
13        persons:
14          - "@oncall-secondary"
15        wait_delay: 5m
16        
17      - step: 3
18        type: notify
19        persons:
20          - "@engineering-manager"
21        wait_delay: 10m
22        
23      - step: 4
24        type: trigger_action
25        action: "escalate_to_incident_commander"
26        wait_delay: 15m

Alert Templates

Slack Alert Template

go
1{{ define "slack.title" }}
2[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
3{{ end }}
4
5{{ define "slack.text" }}
6{{ range .Alerts }}
7*Alert:* {{ .Annotations.summary }}
8*Severity:* {{ .Labels.severity }}
9*Description:* {{ .Annotations.description }}
10*Runbook:* {{ .Annotations.runbook_url }}
11{{ end }}
12{{ end }}

PagerDuty Details Template

go
1{{ define "pagerduty.firing" }}
2{{ range .Alerts.Firing }}
3Alert: {{ .Labels.alertname }}
4Service: {{ .Labels.service }}
5Instance: {{ .Labels.instance }}
6Value: {{ .Annotations.value }}
7Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
8{{ end }}
9{{ end }}

On-Call Best Practices

Rotation Guidelines

yaml
1on_call_guidelines:
2  rotation_length: 1 week
3  handoff_time: "10:00 AM Monday"
4  
5  responsibilities:
6    - Monitor alerts during shift
7    - Respond within SLA (critical: 5min, high: 15min)
8    - Document incidents
9    - Handoff unresolved issues
10    
11  support:
12    - Secondary on-call for backup
13    - Clear escalation path
14    - Manager availability for major incidents
15    
16  wellness:
17    - Maximum 1 week on-call per month
18    - Comp time after high-alert periods
19    - No-interrupt recovery day after shift

Runbook Template

markdown
1# Alert: High Error Rate
2
3## Summary
4Error rate has exceeded the threshold of 5% for the service.
5
6## Impact
7Users may experience errors when accessing the application.
8
9## Investigation Steps
101. Check service logs: `kubectl logs -l app=myapp -n production`
112. Review recent deployments: `kubectl rollout history deployment/myapp`
123. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432`
134. Review error traces in APM dashboard
14
15## Remediation
16### If caused by recent deployment:
17```bash
18kubectl rollout undo deployment/myapp -n production

bash
1kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

Database team: @db-oncall
Platform team: @platform-oncall


## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues Solution: Tune thresholds, increase evaluation windows

Best Practices

Define clear severity levels
Every alert needs a runbook
Test on-call notifications regularly
Review and tune alerts weekly
Implement proper escalation paths
Use alert grouping and inhibition
Track alert metrics (MTTR, frequency)
Practice incident response regularly

prometheus-grafana - Monitoring setup
incident-response - Incident handling
runbook-automation - Runbook creation

About this Skill

Features

# Core Topics

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for alerting-oncall

! Prerequisites & Limits

Browser Sandbox Environment

⚡️ Ready to unleash?

alerting-oncall