
Data Analysis
Run SQL over Excel and CSV exports with DuckDB—schema inspection, queries, summaries, and exports—without standing up a warehouse.
Overview
data-analysis is an agent skill most often used in Grow (also Validate, Build) that analyzes Excel and CSV files with DuckDB SQL, schema inspection, summaries, and cached reloads.
Install
npx skills add https://github.com/bytedance/deer-flow --skill data-analysisWhat is this skill?
- DuckDB in-process SQL over Excel (.xlsx/.xls) and CSV inputs
- Schema inspection, arbitrary SQL, statistical summaries, and result export
- Persistent cache keyed by combined file hash under a temp DuckDB cache directory
- Auto-installs duckdb and openpyxl when dependencies are missing
- Table map JSON sidecar for stable table names across repeated runs
- Supports Excel (.xlsx/.xls) and CSV file inputs via DuckDB
- Uses combined SHA256 file-hash cache directory .data-analysis-cache under system temp
Adoption & trust: 2.1k installs on skills.sh; 70.7k GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You have business data locked in spreadsheets or CSV exports and need SQL-level answers without building a full analytics stack first.
Who is it for?
Indie SaaS founders analyzing exports, campaign CSVs, or product usage spreadsheets inside an agent-driven workflow.
Skip if: Real-time production pipelines, multi-tenant warehouse modeling, or analyses that require GPU ML training stacks.
When should I use this skill?
You have Excel or CSV files on disk and need schema discovery, SQL, statistical summaries, or exports via DuckDB inside an agent session.
What do I get? / Deliverables
DuckDB loads your files, answers schema and SQL questions, and returns summaries or exports you can drop into decisions or docs.
- Schema description and table map for loaded files
- SQL query results and statistical summaries
- Exported analysis outputs from DuckDB runs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Grow/analytics is the canonical shelf for turning exported business files into answers; the same script supports validation and build-time data checks earlier. Analytics covers ad-hoc measurement on spreadsheets and logs indie products already produce.
Where it fits
Profile a early-user survey CSV to see which segments justify building first.
Validate a webhook export file matches expected columns before wiring live ingestion.
Run SQL on weekly revenue spreadsheets to compute churn and expansion without a warehouse.
Re-query the same hashed export cache after fixing a data bug to confirm metrics moved.
How it compares
Lightweight local SQL over files—not a hosted BI product or a long-running Spark-style data platform.
Common Questions / FAQ
Who is data-analysis for?
Builders who work from Excel/CSV exports and want their coding agent to run DuckDB SQL, inspect schemas, and summarize metrics without a separate notebook habit.
When should I use data-analysis?
In Validate to sanity-check prototype or survey data; in Build when verifying integration exports; and in Grow when reviewing retention, revenue, or campaign CSVs for decisions.
Is data-analysis safe to install?
The skill executes local Python, reads files you point at, and may pip-install duckdb/openpyxl—inspect paths and data sensitivity; review the Security Audits panel on this Prism page before running on confidential exports.
SKILL.md
READMESKILL.md - Data Analysis
""" Data Analysis Script using DuckDB. Analyzes Excel (.xlsx/.xls) and CSV files using DuckDB's in-process SQL engine. Supports schema inspection, SQL queries, statistical summaries, and result export. """ import argparse import hashlib import json import logging import os import re import subprocess import sys import tempfile logging.basicConfig(level=logging.INFO, format="%(message)s") logger = logging.getLogger(__name__) try: import duckdb except ImportError: logger.error("duckdb is not installed. Installing...") subprocess.run([sys.executable, "-m", "pip", "install", "duckdb", "openpyxl", "-q"], check=True) import duckdb try: import openpyxl # noqa: F401 except ImportError: subprocess.run([sys.executable, "-m", "pip", "install", "openpyxl", "-q"], check=True) # Cache directory for persistent DuckDB databases CACHE_DIR = os.path.join(tempfile.gettempdir(), ".data-analysis-cache") TABLE_MAP_SUFFIX = ".table_map.json" def compute_files_hash(files: list[str]) -> str: """Compute a combined SHA256 hash of all input files for cache key.""" hasher = hashlib.sha256() for file_path in sorted(files): try: with open(file_path, "rb") as f: while chunk := f.read(8192): hasher.update(chunk) except OSError: # Include path as fallback if file can't be read hasher.update(file_path.encode()) return hasher.hexdigest() def get_cache_db_path(files_hash: str) -> str: """Get the path to the cached DuckDB database file.""" os.makedirs(CACHE_DIR, exist_ok=True) return os.path.join(CACHE_DIR, f"{files_hash}.duckdb") def get_table_map_path(files_hash: str) -> str: """Get the path to the cached table map JSON file.""" return os.path.join(CACHE_DIR, f"{files_hash}{TABLE_MAP_SUFFIX}") def save_table_map(files_hash: str, table_map: dict[str, str]) -> None: """Save table map to a JSON file alongside the cached DB.""" path = get_table_map_path(files_hash) with open(path, "w", encoding="utf-8") as f: json.dump(table_map, f, ensure_ascii=False) def load_table_map(files_hash: str) -> dict[str, str] | None: """Load table map from cache. Returns None if not found.""" path = get_table_map_path(files_hash) if not os.path.exists(path): return None try: with open(path, "r", encoding="utf-8") as f: return json.load(f) except Exception: return None def sanitize_table_name(name: str) -> str: """Sanitize a sheet/file name into a valid SQL table name.""" sanitized = re.sub(r"[^\w]", "_", name) if sanitized and sanitized[0].isdigit(): sanitized = f"t_{sanitized}" return sanitized def load_files(con: duckdb.DuckDBPyConnection, files: list[str]) -> dict[str, str]: """ Load Excel/CSV files into DuckDB tables. Returns a mapping of original_name -> sanitized_table_name. """ con.execute("INSTALL spatial; LOAD spatial;") table_map: dict[str, str] = {} for file_path in files: if not os.path.exists(file_path): logger.error(f"File not found: {file_path}") continue ext = os.path.splitext(file_path)[1].lower() if ext in (".xlsx", ".xls"): _load_excel(con, file_path, table_map) elif ext == ".csv": _load_csv(con, file_path, table_map) else: logger.warning(f"Unsupported file format: {ext} ({file_path})") return table_map def _load_excel( con: duckdb.DuckDBPyConnection, file_path: str, table_map: dict[str, str] ) -> None: """Load all sheets from an Excel file into DuckDB tables.""" import openpyxl wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True) sheet_names = wb.sheetnames wb.close() for sheet_name in sheet_names: table_name = sanitize_table_name(sheet_name) # Handle duplicate table names original_table_name =