Debugging Dags

Name: Debugging Dags
Author: astronomer

astronomer/agents

984 installs
412 repo stars
Updated July 27, 2026
astronomer/agents

debugging-dags is an Airflow DAG diagnosis skill that runs structured root cause analysis with the af CLI for developers who need deep pipeline failure investigation beyond a quick log check.

About

debugging-dags is an Airflow DAG diagnosis skill from astronomer/agents that performs comprehensive failure root cause analysis on data pipelines using the af CLI. The skill triggers on complex debugging requests like diagnose and fix the pipeline, full root cause analysis, or why is this failing and how to prevent it—beyond simple log peeks handled by the airflow entrypoint skill. debugging-dags follows a systematic data-engineering investigation approach to identify root causes and deliver actionable remediation and prevention recommendations. Developers reach for debugging-dags when Airflow DAG failures need structured CLI-driven diagnosis rather than surface-level status checks.

Structured multi-step flow: identify failure, pull task logs, categorize failure type, recommend prevention
Uses Astronomer af CLI (runs diagnose, dags stats, health, dags errors, tasks logs)
Separates simple log questions from deep RCA requests the airflow entrypoint handles lightly
Documents astro otto and uv tool install paths for getting af on PATH
Focuses on scrolling past Airflow boilerplate to the underlying exception

Debugging Dags by the numbers

984 all-time installs (skills.sh)
+22 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #47 of 610 Debugging skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/astronomer/agents --skill debugging-dags

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/astronomer/agents/debugging-dags.svg)](https://skillselion.com/skills/astronomer/agents/debugging-dags)

Installs	984
repo stars	★ 412
Security audit	3 / 3 scanners passed
Last updated	July 27, 2026
Repository	astronomer/agents ↗

How do you diagnose Airflow DAG failures deeply?

Run structured Airflow DAG failure diagnosis with af CLI when a pipeline needs root cause analysis—not just a quick log peek.

Who is it for?

Data engineers facing complex Airflow DAG failures who need systematic af CLI-driven root cause analysis and prevention guidance.

Skip if: Simple why did the DAG fail log requests or greenfield DAG authoring without existing failure context.

When should I use this skill?

A user requests full root cause analysis, pipeline diagnosis and fix, or prevention recommendations for a failing Airflow DAG.

What you get

Root cause analysis report, remediation steps, and prevention recommendations for failed Airflow DAG runs.

Root cause analysis
Remediation plan
Prevention recommendations

Files

SKILL.mdMarkdownGitHub ↗

DAG Diagnosis

You are a data engineer debugging a failed Airflow DAG. Follow this systematic approach to identify the root cause and provide actionable remediation.

Running the CLI

These commands assume af is on PATH. Run via astro otto to get it automatically, or install standalone with uv tool install astro-airflow-mcp.

---

Step 1: Identify the Failure

If a specific DAG was mentioned:

Run af runs diagnose <dag_id> <dag_run_id> (if run_id is provided)
If no run_id specified, run af dags stats to find recent failures

If no DAG was specified:

Run af health to find recent failures across all DAGs
Check for import errors with af dags errors
Show DAGs with recent failures
Ask which DAG to investigate further

Step 2: Get the Error Details

Once you have identified a failed task:

1. Get task logs using af tasks logs <dag_id> <dag_run_id> <task_id> 2. Look for the actual exception - scroll past the Airflow boilerplate to find the real error 3. Categorize the failure type:

Data issue: Missing data, schema change, null values, constraint violation
Code issue: Bug, syntax error, import failure, type error
Infrastructure issue: Connection timeout, resource exhaustion, permission denied
Dependency issue: Upstream failure, external API down, rate limiting

Step 3: Check Context

Gather additional context to understand WHY this happened:

1. Recent changes: Was there a code deploy? Check git history if available 2. Package version changes: Was a package upgraded — in the image, in a venv-style operator, or at the index? See Package version changes below. 3. Data volume: Did data volume spike? Run a quick count on source tables 4. Upstream health: Did upstream tasks succeed but produce unexpected data? 5. Historical pattern: Is this a recurring failure? Check if same task failed before 6. Timing: Did this fail at an unusual time? (resource contention, maintenance windows)

Use af runs get <dag_id> <dag_run_id> to compare the failed run against recent successful runs.

Package version changes

A common cause of failures with no git activity is dependency drift — the user's code didn't change, but a package they depend on did. Check in this order:

1. Worker image diff (preferred when available). Every Astro deploy = new image tag, so the registry has a "before" and "after". Diff pip freeze between current and previous image — that's ground truth for what changed:

   docker run --rm <current_image> pip freeze > /tmp/now.txt
   docker run --rm <previous_image> pip freeze > /tmp/prev.txt
   diff /tmp/prev.txt /tmp/now.txt

Also compare docker run --rm <image> python --version between the two — a Python minor-version bump (3.11 → 3.12, or even a patch) can break wheel compatibility even when pip freeze looks identical. af config providers lists currently installed provider versions, useful for cross-checking against modules named in the traceback.

2. Venv-style operators bypass the worker image. @task.virtualenv, PythonVirtualenvOperator, ExternalPythonOperator, and KubernetesPodOperator build their environment per task run, so an image diff won't catch failures inside them. If the failed task is one of these, read its requirements / image / python_version / python args directly:

Unbounded specifier (e.g. pandas>=2.0.0 with no upper bound, or no specifier at all) → a new upstream release is the prime suspect.
image="foo:latest" or no tag → the image moved underneath you.
python_version="3.11" (on @task.virtualenv / PythonVirtualenvOperator) or a python path (on ExternalPythonOperator) resolving to a different interpreter than it used to — a Python minor-version change can break wheel compatibility for unchanged requirements. Same vector applies to the worker image itself if the base Python changed there.

Fix is to pin: pandas>=2.0.0,<3.0.0, a lockfile, a specific image SHA, or a fully-qualified Python version (python_version="3.11.7" instead of "3.11").

3. Index lookup when image diff isn't conclusive (no image history, or a venv-style operator). Identify the configured index first — it may not be PyPI:

Env vars: UV_INDEX_URL, PIP_INDEX_URL, PIP_EXTRA_INDEX_URL
pyproject.toml → [[tool.uv.index]]
~/.pip/pip.conf, /etc/pip.conf
Dockerfile --index-url flags

Then query for releases of the suspect package since the first failure started. PyPI:

   curl -s https://pypi.org/pypi/<pkg>/json | jq '.releases | to_entries | map({version: .key, uploaded: .value[0].upload_time}) | sort_by(.uploaded) | reverse | .[:5]'

Private indexes usually expose the same /pypi/<pkg>/json shape; fall back to the Simple API (/simple/<pkg>/) or ask the user if neither works.

A release timestamp landing between the last green run and the first red run, for a package named in the traceback, is the answer.

On Astro

If you're running on Astro, these additional tools can help with diagnosis:

Deployment activity log: Check the Astro UI for recent deploys — a failed deploy or recent code change is often the cause of sudden failures
Astro alerts: Configure alerts in the Astro UI for proactive failure monitoring (DAG failure, task duration, SLA miss)
Observability: Use the Astro observability dashboard to track DAG health trends and spot recurring issues

On OSS Airflow

Airflow UI: Use the DAGs page, Graph view, and task logs to inspect recent runs and failures

Step 4: Provide Actionable Output

Structure your diagnosis as:

Root Cause

What actually broke? Be specific - not "the task failed" but "the task failed because column X was null in 15% of rows when the code expected 0%".

Impact Assessment

What data is affected? Which tables didn't get updated?
What downstream processes are blocked?
Is this blocking production dashboards or reports?

Immediate Fix

Specific steps to resolve RIGHT NOW: 1. If it's a data issue: SQL to fix or skip bad records 2. If it's a code issue: The exact code change needed 3. If it's infra: Who to contact or what to restart

Prevention

How to prevent this from happening again:

Add data quality checks?
Add better error handling?
Add alerting for edge cases?
Update documentation?
Pin dependencies (constraints file, lockfile, or upper-bound specifiers on venv/external/pod operators) to avoid silent upstream drift?

Quick Commands

Provide ready-to-use commands:

To clear and rerun the entire DAG run: af runs clear <dag_id> <run_id>
To clear and rerun specific failed tasks: af tasks clear <dag_id> <run_id> <task_ids> -D
To delete a stuck or unwanted run: af runs delete <dag_id> <run_id>

Related skills

Azure DiagnosticsSystematically diagnose and resolve production issues on Microsoft Azure using official Microsoft guidance.472k1.3k

Azure MessagingQuickly diagnose and fix connection, authentication, and message-processing failures when using Azure Event Hubs or Service Bus SDKs.460k1.3k

Use My BrowserWhen their agent task requires access to the live browser session, rendered DOM state, authenticated dashboards, localhost apps, or DevTools-selected elements instea269k72

Diagnosing BugsGet a systematic, step-by-step process that surfaces the real root cause instead of guessing at bugs.222k183k

Systematic DebuggingFollow a repeatable four-phase process that forces root-cause discovery before any code changes.197k260k

Safe DebugGet conservative, non-destructive diagnosis of deep learning and agent errors before any code changes are made.161k507

How it compares

Use debugging-dags for deep pipeline RCA rather than the lighter airflow entrypoint skill for quick log and status checks.

FAQ

When should debugging-dags be used over basic Airflow debugging?

debugging-dags handles complex failure investigation requiring root cause analysis and prevention recommendations. Simple requests like why did the DAG fail or show logs are handled by the airflow entrypoint skill directly.

What CLI does debugging-dags use?

debugging-dags assumes the Astronomer af CLI is on PATH and runs structured diagnostic commands through it. The skill follows a systematic data-engineering approach to identify root causes and remediation.

Is Debugging Dags safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Debuggingmonitoringinfra