
Warehouse Init
Bootstrap a version-controlled `.astro/warehouse.md` so your agent can resolve table and column names without hammering the live warehouse on every question.
Overview
warehouse-init is an agent skill for the Build phase that discovers warehouse schema metadata and generates `.astro/warehouse.md` for fast, query-free table lookups.
Install
npx skills add https://github.com/astronomer/agents --skill warehouse-initWhat is this skill?
- Discovers databases, schemas, tables, and columns from configured warehouse targets
- Enriches catalog entries with codebase context via parallel Explore subagent (dbt YAML, SQL, schema docs)
- Records row counts and flags large tables for cost-aware querying
- Writes team-shareable `.astro/warehouse.md` for instant concept→table lookups
- One-time init with refresh when schema or models change; driven by `/astronomer-data:warehouse-init`
- 5-step process including parallel codebase Explore subagent
Adoption & trust: 554 installs on skills.sh; 384 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your agent keeps guessing table names or running expensive discovery queries because nobody maintains a single, searchable schema map in the repo.
Who is it for?
Projects already on Astronomer’s analyzing-data stack with `warehouse.yml` configured and dbt or SQL models in the repo.
Skip if: Greenfield apps with no warehouse yet, or teams that forbid agents from shelling out to data CLI scripts.
When should I use this skill?
User says "/astronomer-data:warehouse-init" or asks to set up data discovery for the warehouse.
What do I get? / Deliverables
You get a maintained `.astro/warehouse.md` with enriched table and column context so later data-analysis skills can look up entities instantly without repeated warehouse introspection.
- `.astro/warehouse.md` with databases, tables, columns, row counts, and codebase-enriched descriptions
Recommended Skills
Journey fit
Warehouse schema discovery happens when you wire the product to real data models—after you commit to build but before day-to-day analytics work. It connects the codebase (dbt, SQL, docs) to external warehouse metadata, which is classic data-integration setup rather than app UI or shipping gates.
How it compares
Use for one-shot schema materialization in-repo—not as a live BI semantic layer or replacement for dbt docs hosting.
Common Questions / FAQ
Who is warehouse-init for?
Solo builders and small teams using Astronomer data agents who need a durable, git-friendly warehouse catalog for agent prompts and code search.
When should I use warehouse-init?
Run it during Build when wiring analytics integrations, after adding a new warehouse database, or whenever schema or dbt models change enough that `.astro/warehouse.md` is stale.
Is warehouse-init safe to install?
Review the Security Audits panel on this Prism page and treat warehouse credentials and CLI access as sensitive; the skill reads config and runs discovery scripts against your configured databases.
SKILL.md
READMESKILL.md - Warehouse Init
# Initialize Warehouse Schema Generate a comprehensive, user-editable schema reference file for the data warehouse. **Scripts:** `../analyzing-data/scripts/` — All CLI commands below are relative to the `analyzing-data` skill's directory. Before running any `scripts/cli.py` command, `cd` to `../analyzing-data/` relative to this file. ## What This Does 1. Discovers all databases, schemas, tables, and columns from the warehouse 2. **Enriches with codebase context** (dbt models, gusty SQL, schema docs) 3. Records row counts and identifies large tables 4. Generates `.astro/warehouse.md` - a version-controllable, team-shareable reference 5. Enables instant concept→table lookups without warehouse queries ## Process ### Step 1: Read Warehouse Configuration ```bash cat ~/.astro/agents/warehouse.yml ``` Get the list of databases to discover (e.g., `databases: [HQ, ANALYTICS, RAW]`). ### Step 2: Search Codebase for Context (Parallel) **Launch a subagent to find business context in code:** ``` Task( subagent_type="Explore", prompt=""" Search for data model documentation in the codebase: 1. dbt models: **/models/**/*.yml, **/schema.yml - Extract table descriptions, column descriptions - Note primary keys and tests 2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter - Parse frontmatter for: description, primary_key, tests - Note schema mappings 3. AGENTS.md or CLAUDE.md files with data layer documentation Return a mapping of: table_name -> {description, primary_key, important_columns, layer} """ ) ``` ### Step 3: Parallel Warehouse Discovery **Launch one subagent per database** using the Task tool: ``` For each database in configured_databases: Task( subagent_type="general-purpose", prompt=""" Discover all metadata for database {DATABASE}. Use the CLI to run SQL queries: # Scripts are relative to ../analyzing-data/ uv run scripts/cli.py exec "df = run_sql('...')" uv run scripts/cli.py exec "print(df)" 1. Query schemas: SELECT SCHEMA_NAME FROM {DATABASE}.INFORMATION_SCHEMA.SCHEMATA 2. Query tables with row counts: SELECT TABLE_SCHEMA, TABLE_NAME, ROW_COUNT, COMMENT FROM {DATABASE}.INFORMATION_SCHEMA.TABLES ORDER BY TABLE_SCHEMA, TABLE_NAME 3. For important schemas (MODEL_*, METRICS_*, MART_*), query columns: SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, COMMENT FROM {DATABASE}.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = 'X' Return a structured summary: - Database name - List of schemas with table counts - For each table: name, row_count, key columns - Flag any tables with >100M rows as "large" """ ) ``` **Run all subagents in parallel** (single message with multiple Task calls). ### Step 4: Discover Categorical Value Families For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families: ```bash uv run cli.py exec "df = run_sql(''' SELECT DISTINCT column_name, COUNT(*) as occurrences FROM table WHERE column_name IS NOT NULL GROUP BY column_name ORDER BY occurrences DESC LIMIT 50 ''')" uv run cli.py exec "print(df)" ``` Group related values into families by common prefix/suffix (e.g., `Export*` for ExportCSV, ExportJSON, ExportParquet). ### Step 5: Merge Results Combine warehouse metadata + codebase context: 1. **Quick Reference table** - concept → table mappings (pre-populated from code if found) 2. **Categorical Columns** - value families for key filter columns 3. **Database sections** - one per databas