
Dd Monitors
List, inspect, and file-create Datadog monitors with pup while following alerting hygiene so solo-run production does not drown in noise.
Overview
dd-monitors is an agent skill for the Operate phase that manages Datadog monitors and downtimes through pup with file-based create and alerting best practices.
Install
npx skills add https://github.com/datadog-labs/agent-skills --skill dd-monitorsWhat is this skill?
- Token-efficient command order: context first, discovery, confirm, then target pup call
- List, get, and file-based create monitors via pup monitors commands
- Silence notifications using downtime create/cancel (no mute subcommands documented)
- Tag-scoped listing such as team:platform for scoped inventory
- Alert fatigue best-practices table for sustainable on-call as a solo operator
- Documents a 5-step token-efficient pup command execution order
Adoption & trust: 771 installs on skills.sh; 127 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have metrics in Datadog but no repeatable way for your agent to list, create, or safely silence monitors without burning tokens on wrong IDs.
Who is it for?
Solo operators already on Datadog who want file-based monitor definitions and pup-driven workflows in the agent.
Skip if: Teams not using Datadog or builders still in local-only dev with no production observability stack.
When should I use this skill?
User works with Datadog monitors, pup monitors list/get/create, alerting, downtimes, or monitor yaml in the repo.
What do I get? / Deliverables
Monitors are created or inspected from versioned JSON/YAML and notifications are silenced via downtimes when you need controlled quiet periods.
- Monitor JSON/YAML definitions applied via pup
- Scoped monitor inventory from list/get commands
- Downtime windows for planned silence of notifications
Recommended Skills
Journey fit
Monitor lifecycle management is post-ship operations work—alerting rules belong on the operate shelf once something is in production. Monitoring is the exact subphase: create/list/get monitors, downtimes for silencing, and alert-fatigue guidance tied to observability.
How it compares
Datadog monitor CLI orchestration via pup—not generic log search or APM trace analysis skills.
Common Questions / FAQ
Who is dd-monitors for?
Developers shipping SaaS or APIs who own production alerting on Datadog and want agent-assisted monitor CRUD and downtime handling.
When should I use dd-monitors?
Use it in operate monitoring when adding alerts after deploy, auditing tag-scoped monitor lists, or scheduling downtimes—not during idea or landing-page validation.
Is dd-monitors safe to install?
It instructs shell access to Datadog via pup and can change alerting in your org; review the Security Audits panel on this page and restrict credentials and scopes.
SKILL.md
READMESKILL.md - Dd Monitors
# Datadog Monitors Create, manage, and maintain monitors for alerting. ## Prerequisites This requires pup in your path. See [Setup Pup](https://github.com/datadog-labs/agent-skills/tree/main?tab=readme-ov-file#setup-pup). ## Command Execution Order (Token-Efficient) For scoped commands, use this order: 1. Check context first (prior outputs, conversation, saved values). 2. If a required value is missing, run a discovery command first. 3. If still ambiguous, ask the user to confirm. 4. Then run the target command. 5. Avoid speculative commands likely to fail. ## Quick Start ```bash pup auth login ``` ## Common Operations ### List Monitors ```bash pup monitors list pup monitors list --tags "team:platform" ``` ### Get Monitor ```bash pup monitors get <id> ``` ### Create Monitor ```bash pup monitors create --file monitor.json ``` ### Silence Alerts (Downtime) ```bash # No pup monitors mute/unmute commands. # Use downtime payloads to silence monitor notifications. pup downtime create --file downtime.json pup downtime cancel <downtime_id> ``` ## Monitor Creation Best Practices ### 1. Avoid Alert Fatigue | Rule | Why | |------|-----| | **No flapping alerts** | Use `last_Xm` not `last_1m` | | **Meaningful thresholds** | Based on SLOs, not guesses | | **Actionable alerts** | If no action needed, don't alert | | **Include runbook** | `@runbook-url` in message | ```python # WRONG - will flap constantly query = "avg(last_1m):avg:system.cpu.user{*} > 50" # ❌ Too sensitive # CORRECT - stable alerting query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80" # ✅ Reasonable window ``` ### 2. Use Proper Scoping ```python # WRONG - alerts on everything query = "avg(last_5m):avg:system.cpu.user{*} > 80" # ❌ No scope # CORRECT - scoped to what matters query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80" # ✅ ``` ### 3. Set Recovery Thresholds ```python monitor = { "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80", "options": { "thresholds": { "critical": 80, "critical_recovery": 70, # ✅ Prevents flapping "warning": 60, "warning_recovery": 50 } } } ``` ### 4. Include Context in Messages ```python message = """ ## High CPU Alert Host: {{host.name}} Current Value: {{value}} Threshold: {{threshold}} ### Runbook 1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'` 2. Check recent deploys 3. Scale if needed @slack-ops @pagerduty-oncall """ ``` ## NEVER Delete Monitors Directly Use safe deletion workflow (same as dashboards): ```python def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool: """Mark monitor instead of deleting.""" monitor = client.get_monitor(monitor_id) name = monitor.get("name", "") if "[MARKED FOR DELETION]" in name: print(f"Already marked: {name}") return False new_name = f"[MARKED FOR DELETION] {name}" client.update_monitor(monitor_id, {"name": new_name}) print(f"✓ Marked: {new_name}") return True ``` ## Monitor Types | Type | Use Case | |------|----------| | `metric alert` | CPU, memory, custom metrics | | `query alert` | Complex metric queries | | `service check` | Agent check status | | `event alert` | Event stream patterns | | `log alert` | Log pattern matching | | `composite` | Combine multiple monitors | | `apm` | APM metrics | ## Audit Monitors ```bash # Find monitors without owners pup monitors list | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}' # Find noisy monitors (high alert count) pup monitors list | jq 'sort_