
Monitoring Expert
Draft Prometheus alert rules for error rate, latency, service health, CPU, and memory so production issues page you with sane thresholds and annotations.
Install
npx skills add https://github.com/jeffallan/claude-skills --skill monitoring-expertWhat is this skill?
- Prometheus alert groups for application signals: HighErrorRate, HighLatency, ServiceDown
- Infrastructure rules: HighMemoryUsage and HighCPUUsage with `for` durations
- Example expr patterns using rate(), histogram_quantile, and `up`
- Severity labels (critical vs warning) and humanized annotation templates
Adoption & trust: 3k installs on skills.sh; 9.7k GitHub stars; 3/3 security scanners passed (skills.sh audits).
Recommended Skills
Azure Kubernetesmicrosoft/azure-skills
Github Actions Docsxixu-me/skills
Deploy To Vercelvercel-labs/agent-skills
Vercel Cli With Tokensvercel-labs/agent-skills
Turborepovercel/turborepo
Docker Expertsickn33/antigravity-awesome-skills
Journey fit
Common Questions / FAQ
Is Monitoring Expert safe to install?
skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Monitoring Expert
# Alerting Rules ## Prometheus Alert Rules ```yaml # alerts.yml groups: - name: application rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: High error rate detected description: Error rate is {{ $value | humanizePercentage }} - alert: HighLatency expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: High latency detected description: 95th percentile latency is {{ $value }}s - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: Service {{ $labels.instance }} is down - name: infrastructure rules: - alert: HighMemoryUsage expr: | (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: High memory usage on {{ $labels.instance }} - alert: HighCPUUsage expr: | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning annotations: summary: High CPU usage on {{ $labels.instance }} - alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1 for: 5m labels: severity: critical annotations: summary: Disk space low on {{ $labels.instance }} ``` ## Alert Design Principles ```yaml # Good alert: Actionable, specific - alert: DatabaseConnectionPoolExhausted expr: db_pool_available_connections == 0 for: 2m annotations: runbook_url: https://wiki.example.com/runbooks/db-pool # Bad alert: Too noisy, not actionable - alert: AnyError expr: errors_total > 0 # Will always fire ``` ## Severity Levels | Severity | Response | Example | |----------|----------|---------| | `critical` | Page immediately | Service down, data loss | | `warning` | Investigate soon | High latency, low disk | | `info` | Check in morning | Unusual traffic pattern | ## Alertmanager Configuration ```yaml # alertmanager.yml global: slack_api_url: 'https://hooks.slack.com/...' route: receiver: 'slack-notifications' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: 'pagerduty' - match: severity: warning receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' send_resolved: true - name: 'pagerduty' pagerduty_configs: - service_key: 'your-key' ``` ## Quick Reference | Field | Purpose | |-------|---------| | `expr` | PromQL query | | `for` | Duration before firing | | `labels` | Classification (severity) | | `annotations` | Human-readable info | | Threshold | Use | |-----------|-----| | Error rate > 5% | Critical | | p95 latency > 1s | Warning | | Disk < 10% | Critical | | Memory > 90% | Warning | # Application Profiling ## Node.js Profiling ### CPU Profiling with clinic.js ```bash # Install npm install -g clinic # CPU profiling clinic doctor -- node app.js # Flame graph clinic flame -- node app.js # Bubble profiling clinic bubbleprof -- node app.js # Generate report clinic doctor --collect-only -- node app.js clinic doctor --visualize-only PID.clinic-doctor ``` ### Built-in Node.js Profiler ```javascript // Start profiling node --prof app.js # Process the output node --prof-process isolate-0x*.log > processed.txt # Chrome