Dd Monitors

Name: Dd Monitors
Author: datadog-labs

datadog-labs/agent-skills

List, inspect, and file-create Datadog monitors with pup while following alerting hygiene so solo-run production does not drown in noise.

Overview

dd-monitors is an agent skill for the Operate phase that manages Datadog monitors and downtimes through pup with file-based create and alerting best practices.

Install

npx skills add https://github.com/datadog-labs/agent-skills --skill dd-monitors

What is this skill?

Token-efficient command order: context first, discovery, confirm, then target pup call
List, get, and file-based create monitors via pup monitors commands
Silence notifications using downtime create/cancel (no mute subcommands documented)
Tag-scoped listing such as team:platform for scoped inventory
Alert fatigue best-practices table for sustainable on-call as a solo operator
Documents a 5-step token-efficient pup command execution order

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 771 installs on skills.sh; 127 GitHub stars; 3/3 security scanners passed (skills.sh audits).

What problem does it solve?

You have metrics in Datadog but no repeatable way for your agent to list, create, or safely silence monitors without burning tokens on wrong IDs.

Who is it for?

Solo operators already on Datadog who want file-based monitor definitions and pup-driven workflows in the agent.

Skip if: Teams not using Datadog or builders still in local-only dev with no production observability stack.

When should I use this skill?

User works with Datadog monitors, pup monitors list/get/create, alerting, downtimes, or monitor yaml in the repo.

What do I get? / Deliverables

Monitors are created or inspected from versioned JSON/YAML and notifications are silenced via downtimes when you need controlled quiet periods.

Monitor JSON/YAML definitions applied via pup
Scoped monitor inventory from list/get commands
Downtime windows for planned silence of notifications

Recommended Skills

Azure Deploymicrosoft/azure-skills

Azure Deploy is a Microsoft agent skill that executes cloud releases for applications that are already planned and valid…374k installs·1.2k stars

Azure Preparemicrosoft/azure-skills

Azure Prepare is Microsoft's skill for getting applications ready to run on Azure—writing the deployment plan, generatin…374k installs·1.2k stars

Azure Storagemicrosoft/azure-skills

Azure Storage skill helps agents pick the right Azure storage service—Blob for objects, Files for SMB shares, Queues for…374k installs·1.2k stars

Azure Validatemicrosoft/azure-skills

Microsoft-guided preflight validation for Azure deployments including IaC, identity, and service-specific readiness.374k installs·1.2k stars

Appinsights Instrumentationmicrosoft/azure-skills

appinsights-instrumentation is a Microsoft Azure-skills package that walks solo builders through enabling Application In…374k installs·1.2k stars

Azure Resource Lookupmicrosoft/azure-skills

Azure Resource Lookup is a Microsoft agent skill that helps solo builders and small teams answer “what do I have in Azur…373k installs·1.2k stars

Journey fit

Primary fit

OperateMonitoring & observability

Monitor lifecycle management is post-ship operations work—alerting rules belong on the operate shelf once something is in production. Monitoring is the exact subphase: create/list/get monitors, downtimes for silencing, and alert-fatigue guidance tied to observability.

How it compares

Datadog monitor CLI orchestration via pup—not generic log search or APM trace analysis skills.

Common Questions / FAQ

Who is dd-monitors for?

Developers shipping SaaS or APIs who own production alerting on Datadog and want agent-assisted monitor CRUD and downtime handling.

When should I use dd-monitors?

Use it in operate monitoring when adding alerts after deploy, auditing tag-scoped monitor lists, or scheduling downtimes—not during idea or landing-page validation.

Is dd-monitors safe to install?

It instructs shell access to Datadog via pup and can change alerting in your org; review the Security Audits panel on this page and restrict credentials and scopes.

SKILL.md

READMESKILL.md - Dd Monitors

# Datadog Monitors

Create, manage, and maintain monitors for alerting.


## Prerequisites
This requires pup in your path. See [Setup Pup](https://github.com/datadog-labs/agent-skills/tree/main?tab=readme-ov-file#setup-pup).

## Command Execution Order (Token-Efficient)

For scoped commands, use this order:

1. Check context first (prior outputs, conversation, saved values).
2. If a required value is missing, run a discovery command first.
3. If still ambiguous, ask the user to confirm.
4. Then run the target command.
5. Avoid speculative commands likely to fail.


## Quick Start

```bash
pup auth login
```

## Common Operations

### List Monitors

```bash
pup monitors list
pup monitors list --tags "team:platform"
```

### Get Monitor

```bash
pup monitors get <id>
```

### Create Monitor

```bash
pup monitors create --file monitor.json
```

### Silence Alerts (Downtime)

```bash
# No pup monitors mute/unmute commands.
# Use downtime payloads to silence monitor notifications.
pup downtime create --file downtime.json
pup downtime cancel <downtime_id>
```

## Monitor Creation Best Practices

### 1. Avoid Alert Fatigue

| Rule | Why |
|------|-----|
| **No flapping alerts** | Use `last_Xm` not `last_1m` |
| **Meaningful thresholds** | Based on SLOs, not guesses |
| **Actionable alerts** | If no action needed, don't alert |
| **Include runbook** | `@runbook-url` in message |

```python
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window
```

### 2. Use Proper Scoping

```python
# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅
```

### 3. Set Recovery Thresholds

```python
monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}
```

### 4. Include Context in Messages

```python
message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""
```

## NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

```python
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True
```

## Monitor Types

| Type | Use Case |
|------|----------|
| `metric alert` | CPU, memory, custom metrics |
| `query alert` | Complex metric queries |
| `service check` | Agent check status |
| `event alert` | Event stream patterns |
| `log alert` | Log pattern matching |
| `composite` | Combine multiple monitors |
| `apm` | APM metrics |

## Audit Monitors

```bash
# Find monitors without owners
pup monitors list | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list | jq 'sort_

What is this skill?

Token-efficient command order: context first, discovery, confirm, then target pup call

List, get, and file-based create monitors via pup monitors commands

Silence notifications using downtime create/cancel (no mute subcommands documented)

Tag-scoped listing such as team:platform for scoped inventory

Alert fatigue best-practices table for sustainable on-call as a solo operator

Documents a 5-step token-efficient pup command execution order

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 771 installs on skills.sh; 127 GitHub stars; 3/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

OperateMonitoring & observability

SKILL.md

READMESKILL.md - Dd Monitors

# Datadog Monitors

Create, manage, and maintain monitors for alerting.


## Prerequisites
This requires pup in your path. See [Setup Pup](https://github.com/datadog-labs/agent-skills/tree/main?tab=readme-ov-file#setup-pup).

## Command Execution Order (Token-Efficient)

For scoped commands, use this order:

1. Check context first (prior outputs, conversation, saved values).
2. If a required value is missing, run a discovery command first.
3. If still ambiguous, ask the user to confirm.
4. Then run the target command.
5. Avoid speculative commands likely to fail.


## Quick Start

```bash
pup auth login
```

## Common Operations

### List Monitors

```bash
pup monitors list
pup monitors list --tags "team:platform"
```

### Get Monitor

```bash
pup monitors get <id>
```

### Create Monitor

```bash
pup monitors create --file monitor.json
```

### Silence Alerts (Downtime)

```bash
# No pup monitors mute/unmute commands.
# Use downtime payloads to silence monitor notifications.
pup downtime create --file downtime.json
pup downtime cancel <downtime_id>
```

## Monitor Creation Best Practices

### 1. Avoid Alert Fatigue

| Rule | Why |
|------|-----|
| **No flapping alerts** | Use `last_Xm` not `last_1m` |
| **Meaningful thresholds** | Based on SLOs, not guesses |
| **Actionable alerts** | If no action needed, don't alert |
| **Include runbook** | `@runbook-url` in message |

```python
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window
```

### 2. Use Proper Scoping

```python
# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅
```

### 3. Set Recovery Thresholds

```python
monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}
```

### 4. Include Context in Messages

```python
message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""
```

## NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

```python
def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True
```

## Monitor Types

| Type | Use Case |
|------|----------|
| `metric alert` | CPU, memory, custom metrics |
| `query alert` | Complex metric queries |
| `service check` | Agent check status |
| `event alert` | Event stream patterns |
| `log alert` | Log pattern matching |
| `composite` | Combine multiple monitors |
| `apm` | APM metrics |

## Audit Monitors

```bash
# Find monitors without owners
pup monitors list | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list | jq 'sort_

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is dd-monitors for?

When should I use dd-monitors?

Is dd-monitors safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is dd-monitors for?

When should I use dd-monitors?

Is dd-monitors safe to install?

SKILL.md