alerting-oncall — configuring alerting rules with Prometheus alerting-oncall, atl.services, community, configuring alerting rules with Prometheus, ide skills, on-call management with Alerting-OnCall, reducing alert fatigue with Alerting-OnCall, Alerting-OnCall install, Claude Code, Cursor, Windsurf

v1.0
GitHub

About this Skill

Perfect for Operations Agents needing advanced alerting and on-call management for production systems using Prometheus and PagerDuty. Alerting-OnCall is a skill that enables effective alerting and on-call management for production systems, reducing alert fatigue and streamlining incident response workflows.

Features

Configures alerting rules and thresholds for monitoring systems like Prometheus and Datadog
Sets up on-call rotations and schedules using platforms like PagerDuty and Opsgenie
Implements alert routing and escalation to reduce alert fatigue
Manages incident response workflows using communication channels like Slack
Integrates with on-call platforms like Grafana OnCall for seamless incident management

# Core Topics

allthingslinux allthingslinux
[0]
[0]
Updated: 3/8/2026

Agent Capability Analysis

The alerting-oncall skill by allthingslinux is an open-source community AI agent skill for Claude Code and other IDE workflows, helping agents execute tasks with better context, repeatability, and domain-specific guidance. Optimized for configuring alerting rules with Prometheus, on-call management with Alerting-OnCall, reducing alert fatigue with Alerting-OnCall.

Ideal Agent Persona

Perfect for Operations Agents needing advanced alerting and on-call management for production systems using Prometheus and PagerDuty.

Core Value

Empowers agents to configure effective alerting rules and thresholds, manage on-call rotations, and implement incident response workflows using protocols like alert routing and escalation, reducing alert fatigue by integrating with communication channels like Slack.

Capabilities Granted for alerting-oncall

Configuring alerting rules and thresholds for production systems
Automating on-call rotations and schedules using PagerDuty
Implementing alert routing and escalation workflows
Optimizing incident response workflows to reduce downtime

! Prerequisites & Limits

  • Requires a monitoring system like Prometheus or Datadog
  • Needs an on-call platform such as PagerDuty or Opsgenie
  • Dependent on communication channels like Slack for alert notifications
Labs Demo

Browser Sandbox Environment

⚡️ Ready to unleash?

Experience this Agent in a zero-setup browser environment powered by WebContainers. No installation required.

Boot Container Sandbox

alerting-oncall

Install alerting-oncall, an AI agent skill for AI agent workflows and automation. Works with Claude Code, Cursor, and Windsurf with one-command setup.

SKILL.md
Readonly

Alerting & On-Call

Configure effective alerting and on-call management for production systems.

When to Use This Skill

Use this skill when:

  • Setting up alerting rules and thresholds
  • Configuring on-call rotations and schedules
  • Implementing alert routing and escalation
  • Reducing alert fatigue
  • Managing incident response workflows

Prerequisites

  • Monitoring system (Prometheus, Datadog, etc.)
  • On-call platform (PagerDuty, Opsgenie, Grafana OnCall)
  • Communication channels (Slack, email)

Alerting Best Practices

Alert Categories

yaml
1# Severity levels 2critical: 3 - Service completely down 4 - Data loss imminent 5 - Security breach 6 response: Immediate page, wake people up 7 8high: 9 - Service degraded significantly 10 - Error rate above SLO 11 - Capacity near limit 12 response: Page during business hours, notify after hours 13 14medium: 15 - Performance degradation 16 - Non-critical component failure 17 - Warning thresholds exceeded 18 response: Notify via Slack, review next business day 19 20low: 21 - Informational alerts 22 - Capacity planning triggers 23 - Routine maintenance needed 24 response: Email notification, weekly review

Alert Design Principles

yaml
1# Good alert characteristics 2alerts: 3 actionable: 4 - Every alert should require human action 5 - Include runbook links 6 - Clear remediation steps 7 8 relevant: 9 - Alert on symptoms, not causes 10 - Focus on user impact 11 - Avoid alerting on expected behavior 12 13 timely: 14 - Appropriate thresholds 15 - Suitable evaluation windows 16 - Account for normal variance 17 18 unique: 19 - No duplicate alerts 20 - Proper alert grouping 21 - Clear ownership

Prometheus Alerting

Alert Rules

yaml
1# prometheus/rules/alerts.yml 2groups: 3 - name: service_alerts 4 rules: 5 # High-level service health 6 - alert: ServiceDown 7 expr: up{job="myapp"} == 0 8 for: 1m 9 labels: 10 severity: critical 11 annotations: 12 summary: "Service {{ $labels.instance }} is down" 13 description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 1 minute." 14 runbook_url: "https://wiki.example.com/runbooks/service-down" 15 16 # Error rate alert 17 - alert: HighErrorRate 18 expr: | 19 sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) 20 / sum(rate(http_requests_total[5m])) by (service) > 0.05 21 for: 5m 22 labels: 23 severity: critical 24 annotations: 25 summary: "High error rate for {{ $labels.service }}" 26 description: "Error rate is {{ $value | humanizePercentage }} for the last 5 minutes" 27 28 # Latency alert (SLO-based) 29 - alert: HighLatency 30 expr: | 31 histogram_quantile(0.95, 32 sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service) 33 ) > 0.5 34 for: 5m 35 labels: 36 severity: high 37 annotations: 38 summary: "P95 latency above 500ms for {{ $labels.service }}"

Alertmanager Configuration

yaml
1# alertmanager.yml 2global: 3 resolve_timeout: 5m 4 slack_api_url: 'https://hooks.slack.com/services/xxx' 5 pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' 6 7templates: 8 - '/etc/alertmanager/templates/*.tmpl' 9 10route: 11 receiver: 'default-receiver' 12 group_by: ['alertname', 'service'] 13 group_wait: 30s 14 group_interval: 5m 15 repeat_interval: 4h 16 17 routes: 18 # Critical alerts go to PagerDuty 19 - match: 20 severity: critical 21 receiver: 'pagerduty-critical' 22 group_wait: 0s 23 repeat_interval: 1h 24 25 # High severity during business hours 26 - match: 27 severity: high 28 receiver: 'slack-high' 29 active_time_intervals: 30 - business-hours 31 32 # Route by team 33 - match_re: 34 team: platform.* 35 receiver: 'platform-team' 36 37receivers: 38 - name: 'default-receiver' 39 slack_configs: 40 - channel: '#alerts' 41 send_resolved: true 42 43 - name: 'pagerduty-critical' 44 pagerduty_configs: 45 - service_key: 'xxx' 46 severity: critical 47 description: '{{ .CommonAnnotations.summary }}' 48 details: 49 firing: '{{ template "pagerduty.firing" . }}' 50 51 - name: 'slack-high' 52 slack_configs: 53 - channel: '#alerts-high' 54 title: '{{ .CommonAnnotations.summary }}' 55 text: '{{ .CommonAnnotations.description }}' 56 actions: 57 - type: button 58 text: 'Runbook' 59 url: '{{ .CommonAnnotations.runbook_url }}' 60 - type: button 61 text: 'Dashboard' 62 url: '{{ .CommonAnnotations.dashboard_url }}' 63 64 - name: 'platform-team' 65 slack_configs: 66 - channel: '#platform-alerts' 67 68time_intervals: 69 - name: business-hours 70 time_intervals: 71 - weekdays: ['monday:friday'] 72 times: 73 - start_time: '09:00' 74 end_time: '17:00' 75 76inhibit_rules: 77 - source_match: 78 severity: critical 79 target_match: 80 severity: high 81 equal: ['service']

PagerDuty Integration

Service Configuration

yaml
1# Terraform example 2resource "pagerduty_service" "myapp" { 3 name = "MyApp Production" 4 description = "Production application service" 5 escalation_policy = pagerduty_escalation_policy.default.id 6 alert_creation = "create_alerts_and_incidents" 7 auto_resolve_timeout = 14400 # 4 hours 8 acknowledgement_timeout = 600 # 10 minutes 9 10 incident_urgency_rule { 11 type = "use_support_hours" 12 13 during_support_hours { 14 type = "constant" 15 urgency = "high" 16 } 17 18 outside_support_hours { 19 type = "constant" 20 urgency = "low" 21 } 22 } 23} 24 25resource "pagerduty_escalation_policy" "default" { 26 name = "Default Escalation" 27 num_loops = 2 28 29 rule { 30 escalation_delay_in_minutes = 10 31 target { 32 type = "schedule_reference" 33 id = pagerduty_schedule.primary.id 34 } 35 } 36 37 rule { 38 escalation_delay_in_minutes = 15 39 target { 40 type = "user_reference" 41 id = pagerduty_user.manager.id 42 } 43 } 44}

Schedule Configuration

yaml
1resource "pagerduty_schedule" "primary" { 2 name = "Primary On-Call" 3 time_zone = "America/New_York" 4 5 layer { 6 name = "Weekly Rotation" 7 start = "2024-01-01T00:00:00-05:00" 8 rotation_virtual_start = "2024-01-01T00:00:00-05:00" 9 rotation_turn_length_seconds = 604800 # 1 week 10 users = [for user in pagerduty_user.oncall : user.id] 11 } 12 13 # Override layer for holidays 14 layer { 15 name = "Holiday Coverage" 16 start = "2024-01-01T00:00:00-05:00" 17 rotation_virtual_start = "2024-01-01T00:00:00-05:00" 18 rotation_turn_length_seconds = 86400 19 users = [pagerduty_user.holiday_coverage.id] 20 21 restriction { 22 type = "daily_restriction" 23 start_time_of_day = "00:00:00" 24 duration_seconds = 86400 25 start_day_of_week = 0 # Sunday 26 } 27 } 28}

Grafana OnCall

Integration Setup

yaml
1# docker-compose.yml addition 2services: 3 oncall: 4 image: grafana/oncall 5 environment: 6 - SECRET_KEY=your-secret-key 7 - BASE_URL=http://oncall:8080 8 - GRAFANA_API_URL=http://grafana:3000 9 ports: 10 - "8080:8080"

Escalation Chain

yaml
1# Example escalation chain structure 2escalation_chains: 3 - name: "Production Critical" 4 steps: 5 - step: 1 6 type: notify 7 persons: 8 - "@oncall-primary" 9 wait_delay: 0 10 11 - step: 2 12 type: notify 13 persons: 14 - "@oncall-secondary" 15 wait_delay: 5m 16 17 - step: 3 18 type: notify 19 persons: 20 - "@engineering-manager" 21 wait_delay: 10m 22 23 - step: 4 24 type: trigger_action 25 action: "escalate_to_incident_commander" 26 wait_delay: 15m

Alert Templates

Slack Alert Template

go
1{{ define "slack.title" }} 2[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} 3{{ end }} 4 5{{ define "slack.text" }} 6{{ range .Alerts }} 7*Alert:* {{ .Annotations.summary }} 8*Severity:* {{ .Labels.severity }} 9*Description:* {{ .Annotations.description }} 10*Runbook:* {{ .Annotations.runbook_url }} 11{{ end }} 12{{ end }}

PagerDuty Details Template

go
1{{ define "pagerduty.firing" }} 2{{ range .Alerts.Firing }} 3Alert: {{ .Labels.alertname }} 4Service: {{ .Labels.service }} 5Instance: {{ .Labels.instance }} 6Value: {{ .Annotations.value }} 7Started: {{ .StartsAt.Format "2006-01-02 15:04:05" }} 8{{ end }} 9{{ end }}

On-Call Best Practices

Rotation Guidelines

yaml
1on_call_guidelines: 2 rotation_length: 1 week 3 handoff_time: "10:00 AM Monday" 4 5 responsibilities: 6 - Monitor alerts during shift 7 - Respond within SLA (critical: 5min, high: 15min) 8 - Document incidents 9 - Handoff unresolved issues 10 11 support: 12 - Secondary on-call for backup 13 - Clear escalation path 14 - Manager availability for major incidents 15 16 wellness: 17 - Maximum 1 week on-call per month 18 - Comp time after high-alert periods 19 - No-interrupt recovery day after shift

Runbook Template

markdown
1# Alert: High Error Rate 2 3## Summary 4Error rate has exceeded the threshold of 5% for the service. 5 6## Impact 7Users may experience errors when accessing the application. 8 9## Investigation Steps 101. Check service logs: `kubectl logs -l app=myapp -n production` 112. Review recent deployments: `kubectl rollout history deployment/myapp` 123. Check database connectivity: `kubectl exec -it myapp -- nc -zv postgres 5432` 134. Review error traces in APM dashboard 14 15## Remediation 16### If caused by recent deployment: 17```bash 18kubectl rollout undo deployment/myapp -n production
bash
1kubectl delete pod -l app=postgres -n production

Escalation

If not resolved within 15 minutes, escalate to:

  • Database team: @db-oncall
  • Platform team: @platform-oncall

## Alert Fatigue Reduction

### Strategies

```yaml
fatigue_reduction:
  aggregate_alerts:
    - Group related alerts
    - Use inhibit rules
    - Implement alert correlation
    
  tune_thresholds:
    - Base on SLOs, not arbitrary values
    - Account for normal variance
    - Use appropriate evaluation windows
    
  automate_responses:
    - Auto-remediation for known issues
    - Self-healing infrastructure
    - Automated scaling
    
  regular_review:
    - Weekly alert review
    - Remove unused alerts
    - Update thresholds based on data

Common Issues

Issue: Alert Storm

Problem: Too many alerts firing simultaneously Solution: Implement proper grouping and inhibition rules

Issue: Missed Alerts

Problem: Critical alerts not reaching on-call Solution: Test escalation policies, verify contact methods

Issue: False Positives

Problem: Alerts firing without actual issues Solution: Tune thresholds, increase evaluation windows

Best Practices

  • Define clear severity levels
  • Every alert needs a runbook
  • Test on-call notifications regularly
  • Review and tune alerts weekly
  • Implement proper escalation paths
  • Use alert grouping and inhibition
  • Track alert metrics (MTTR, frequency)
  • Practice incident response regularly

FAQ & Installation Steps

These questions and steps mirror the structured data on this page for better search understanding.

? Frequently Asked Questions

What is alerting-oncall?

Perfect for Operations Agents needing advanced alerting and on-call management for production systems using Prometheus and PagerDuty. Alerting-OnCall is a skill that enables effective alerting and on-call management for production systems, reducing alert fatigue and streamlining incident response workflows.

How do I install alerting-oncall?

Run the command: npx killer-skills add allthingslinux/atl.services. It works with Cursor, Windsurf, VS Code, Claude Code, and 19+ other IDEs.

What are the use cases for alerting-oncall?

Key use cases include: Configuring alerting rules and thresholds for production systems, Automating on-call rotations and schedules using PagerDuty, Implementing alert routing and escalation workflows, Optimizing incident response workflows to reduce downtime.

Which IDEs are compatible with alerting-oncall?

This skill is compatible with Cursor, Windsurf, VS Code, Trae, Claude Code, OpenClaw, Aider, Codex, OpenCode, Goose, Cline, Roo Code, Kiro, Augment Code, Continue, GitHub Copilot, Sourcegraph Cody, and Amazon Q Developer. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for alerting-oncall?

Requires a monitoring system like Prometheus or Datadog. Needs an on-call platform such as PagerDuty or Opsgenie. Dependent on communication channels like Slack for alert notifications.

How To Install

  1. 1. Open your terminal

    Open the terminal or command line in your project directory.

  2. 2. Run the install command

    Run: npx killer-skills add allthingslinux/atl.services. The CLI will automatically detect your IDE or AI agent and configure the skill.

  3. 3. Start using the skill

    The skill is now active. Your AI agent can use alerting-oncall immediately in the current project.

Related Skills

Looking for an alternative to alerting-oncall or another community skill for your workflow? Explore these related open-source skills.

View All

widget-generator

Logo of f
f

f.k.a. Awesome ChatGPT Prompts. Share, discover, and collect prompts from the community. Free and open source — self-host for your organization with complete privacy.

149.6k
0
AI

flags

Logo of vercel
vercel

flags is a Next.js feature management skill that enables developers to efficiently add or modify framework feature flags, streamlining React application development.

138.4k
0
Browser

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
AI