
Tracing Upstream Lineage
Trace where warehouse tables or columns come from by following Airflow DAGs, tasks, and Astro Lineage when data breaks or definitions are unclear.
Overview
Tracing Upstream Lineage is an agent skill for the Operate phase that traces upstream data dependencies for Airflow tables, columns, and DAGs.
Install
npx skills add https://github.com/astronomer/agents --skill tracing-upstream-lineageWhat is this skill?
- Three-step investigation: target type (table, column, DAG), producing DAG discovery, upstream source trace
- CLI workflow: af dags list, af dags source, af tasks list to read INSERT/MERGE/C CREATE targets
- Astro Lineage tab called out for visual cross-DAG exploration without manual code search
- OSS Airflow path via DAG source and task logs when Astro UI is unavailable
- Naming heuristics linking DAG ids like load_customers to customers tables
- Three investigation steps: target type, producing DAG, data sources
Adoption & trust: 759 installs on skills.sh; 384 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You see wrong or missing data in a table but cannot quickly identify which DAG or source system feeds it.
Who is it for?
Builders running Airflow or Astro who need a structured lineage investigation during outages or schema surprises.
Skip if: Greenfield pipeline design from scratch or non-Airflow stacks without DAG-based orchestration.
When should I use this skill?
User asks where data comes from, what feeds a table, upstream dependencies, data sources, or needs to understand data origins.
What do I get? / Deliverables
You map the target to a producing DAG and upstream sources using af CLI exploration and, on Astro, the Lineage tab for faster cross-pipeline visibility.
- Identified producing DAG and task path for the target
- Upstream source list inferred from DAG source and tasks
- Lineage investigation notes for Astro vs OSS paths
Recommended Skills
Journey fit
Operate / monitoring is the canonical shelf because upstream lineage answers production questions—what feeds this table—during incidents and data-quality investigations. Lineage tracing is observability for data pipelines: it links symptoms in dashboards or tables back to producing DAGs and sources, matching monitoring and dependency clarity.
How it compares
Operational lineage investigation skill, not a dbt model generator or generic SQL debugger.
Common Questions / FAQ
Who is tracing-upstream-lineage for?
Data-minded solo developers and small teams operating Airflow/Astro pipelines who must explain data origins under time pressure.
When should I use tracing-upstream-lineage?
In Operate when a dashboard breaks, a column looks wrong, or someone asks what feeds a table—especially before changing downstream models or SLAs.
Is tracing-upstream-lineage safe to install?
It instructs read-oriented CLI and UI exploration; review the Security Audits panel on this Prism page and limit agent access to production Airflow credentials.
SKILL.md
READMESKILL.md - Tracing Upstream Lineage
# Upstream Lineage: Sources Trace the origins of data - answer "Where does this data come from?" ## Lineage Investigation ### Step 1: Identify the Target Type Determine what we're tracing: - **Table**: Trace what populates this table - **Column**: Trace where this specific column comes from - **DAG**: Trace what data sources this DAG reads from ### Step 2: Find the Producing DAG Tables are typically populated by Airflow DAGs. Find the connection: 1. **Search DAGs by name**: Use `af dags list` and look for DAG names matching the table name - `load_customers` -> `customers` table - `etl_daily_orders` -> `orders` table 2. **Explore DAG source code**: Use `af dags source <dag_id>` to read the DAG definition - Look for INSERT, MERGE, CREATE TABLE statements - Find the target table in the code 3. **Check DAG tasks**: Use `af tasks list <dag_id>` to see what operations the DAG performs ### On Astro If you're running on Astro, the **Lineage tab** in the Astro UI provides visual lineage exploration across DAGs and datasets. Use it to quickly trace upstream dependencies without manually searching DAG source code. ### On OSS Airflow Use DAG source code and task logs to trace lineage (no built-in cross-DAG UI). ### Step 3: Trace Data Sources From the DAG code, identify source tables and systems: **SQL Sources** (look for FROM clauses): ```python # In DAG code: SELECT * FROM source_schema.source_table # <- This is an upstream source ``` **External Sources** (look for connection references): - `S3Operator` -> S3 bucket source - `PostgresOperator` -> Postgres database source - `SalesforceOperator` -> Salesforce API source - `HttpOperator` -> REST API source **File Sources**: - CSV/Parquet files in object storage - SFTP drops - Local file paths ### Step 4: Build the Lineage Chain Recursively trace each source: ``` TARGET: analytics.orders_daily ^ +-- DAG: etl_daily_orders ^ +-- SOURCE: raw.orders (table) | ^ | +-- DAG: ingest_orders | ^ | +-- SOURCE: Salesforce API (external) | +-- SOURCE: dim.customers (table) ^ +-- DAG: load_customers ^ +-- SOURCE: PostgreSQL (external DB) ``` ### Step 5: Check Source Health For each upstream source: - **Tables**: Check freshness with the **checking-freshness** skill - **DAGs**: Check recent run status with `af dags stats` - **External systems**: Note connection info from DAG code ## Lineage for Columns When tracing a specific column: 1. Find the column in the target table schema 2. Search DAG source code for references to that column name 3. Trace through transformations: - Direct mappings: `source.col AS target_col` - Transformations: `COALESCE(a.col, b.col) AS target_col` - Aggregations: `SUM(detail.amount) AS total_amount` ## Output: Lineage Report ### Summary One-line answer: "This table is populated by DAG X from sources Y and Z" ### Lineage Diagram ``` [Salesforce] --> [raw.opportunities] --> [stg.opportunities] --> [fct.sales] | | DAG: ingest_sfdc DAG: transform_sales ``` ### Source Details | Source | Type | Connection | Freshness | Owner | |--------|------|------------|-----------|-------| | raw.orders | Table | Internal | 2h ago | data-team | | Salesforce | API | salesforce_conn | Real-time | sales-ops | ### Transformation Chain Describe how data flows and transforms: 1. Raw data lands in `raw.orders` via Salesforce API sync 2. DAG `transform_orders` cleans and dedupes into `stg.orders` 3. DAG `build_order_facts` joins