
Chdb Sql
Run ClickHouse-flavored SQL locally on Parquet, CSV, and JSONL with chdb for quick analytics without standing up a server.
Overview
chdb SQL is an agent skill most often used in Build (also Grow, Operate) that provides runnable chdb SQL examples for file-backed analytics in Python.
Install
npx skills add https://github.com/clickhouse/agent-skills --skill chdb-sqlWhat is this skill?
- Nine self-contained sections from file queries through streaming and common errors
- Query Parquet, CSVWithNames, JSONEachRow, and glob patterns like logs/2024-*.parquet
- Cross-source SQL joins and session tables for analytical builds
- Python DataFrame outputs, parametrized queries, window functions, and UDF examples
- Streaming large results and a dedicated common-errors troubleshooting section
- 9 numbered example sections in the skill readme
Adoption & trust: 809 installs on skills.sh; 458 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have Parquet, CSV, or JSONL sitting on disk but spinning up ClickHouse or pandas glue code for every ad-hoc question wastes an afternoon.
Who is it for?
Indie builders doing local OLAP on exports, event logs, or warehouse extracts with minimal infrastructure.
Skip if: Production DBA work on managed ClickHouse clusters when you already need replication, RBAC, and migration runbooks instead of file() tutorials.
When should I use this skill?
You need copy-paste chdb SQL for files, joins, sessions, or streaming analytics in Python.
What do I get? / Deliverables
You run validated chdb.query patterns for files, joins, windows, and streaming outputs you can lift into production pipelines later.
- Runnable chdb.query snippets with documented expected output
- Patterns for parametrized queries and DataFrame/Pretty render modes
- Reference flows for cross-source joins and error fixes
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Local analytical SQL over files fits Build when you prototype data paths and backends, before promoting queries to production warehouses. Backend covers embedded chdb sessions, file() reads, joins across sources, and streaming large results from app-side Python.
Where it fits
Aggregate sales.csv and users.parquet to sanity-check revenue filters before API endpoints ship.
Rank countries by user count from a marketing export using GROUP BY on file().
Count log levels across logs/2024-*.parquet after an incident without uploading raw logs to a cloud warehouse.
How it compares
Skill cookbook for embedded chdb SQL—not a hosted ClickHouse MCP server or full ETL framework.
Common Questions / FAQ
Who is chdb-sql for?
Developers and solo data-curious builders who want ClickHouse SQL semantics on local files via Python without operating a separate database service.
When should I use chdb-sql?
In Build when prototyping metrics over Parquet/CSV, in Grow when exploring funnel or revenue slices on dumps, and in Operate when grouping log levels across dated Parquet globs.
Is chdb-sql safe to install?
Examples read local files you specify; review the Security Audits panel on this page and avoid querying directories that contain secrets or PII you do not intend to load into the agent context.
SKILL.md
READMESKILL.md - Chdb Sql
# chdb SQL Examples > All examples are self-contained and runnable. > Expected output is shown in comments. ## Table of Contents 1. [Query Any File](#1-query-any-file) 2. [Cross-Source SQL Joins](#2-cross-source-sql-joins) 3. [Session: Build Analytical Tables](#3-session-build-analytical-tables) 4. [Python Data as SQL Table](#4-python-data-as-sql-table) 5. [Parametrized Queries](#5-parametrized-queries) 6. [Window Functions](#6-window-functions) 7. [User-Defined Functions (UDF)](#7-user-defined-functions-udf) 8. [Streaming Large Results](#8-streaming-large-results) 9. [Common Errors & Fixes](#9-common-errors--fixes) --- ## 1. Query Any File ```python import chdb # Parquet result = chdb.query(""" SELECT country, count() AS cnt FROM file('users.parquet', Parquet) GROUP BY country ORDER BY cnt DESC LIMIT 10 """, "Pretty") result.show() # Expected: top 10 countries by user count, formatted table # CSV df = chdb.query(""" SELECT * FROM file('sales.csv', CSVWithNames) WHERE revenue > 10000 ORDER BY revenue DESC """, "DataFrame") print(df) # Expected: pandas DataFrame with high-revenue rows # JSON Lines chdb.query(""" SELECT * FROM file('events.jsonl', JSONEachRow) WHERE event_type = 'purchase' """).show() # Glob pattern — query all matching files df = chdb.query(""" SELECT level, count() AS cnt FROM file('logs/2024-*.parquet', Parquet) GROUP BY level ORDER BY cnt DESC """, "DataFrame") print(df) # Expected: # level cnt # 0 INFO 45230 # 1 WARN 3210 # 2 ERROR 890 ``` --- ## 2. Cross-Source SQL Joins ```python import chdb # MySQL + Parquet join chdb.query(""" SELECT u.name, u.email, o.product, o.amount FROM mysql('db:3306', 'crm', 'users', 'root', 'pass') AS u JOIN file('orders.parquet', Parquet) AS o ON u.id = o.user_id WHERE o.amount > 100 ORDER BY o.amount DESC LIMIT 20 """, "Pretty").show() # S3 + PostgreSQL join df = chdb.query(""" SELECT e.event_type, p.country, count() AS cnt FROM s3('s3://bucket/events.parquet', 'KEY', 'SECRET', 'Parquet') AS e JOIN postgresql('pg:5432', 'users', 'profiles', 'user', 'pass') AS p ON e.user_id = p.id GROUP BY e.event_type, p.country ORDER BY cnt DESC """, "DataFrame") print(df) # ClickHouse + local CSV chdb.query(""" SELECT r.host, l.status_code, count() AS requests FROM remote('ch:9000', 'logs', 'access_log', 'default', '') AS r JOIN file('server_config.csv', CSVWithNames) AS l ON r.host = l.hostname GROUP BY r.host, l.status_code ORDER BY requests DESC """).show() ``` --- ## 3. Session: Build Analytical Tables ```python from chdb import session as chs sess = chs.Session("./analytics_db") # Ingest from multiple external sources into local tables sess.query(""" CREATE TABLE users ENGINE = MergeTree() ORDER BY id AS SELECT * FROM mysql('db:3306', 'crm', 'users', 'root', 'pass') """) sess.query(""" CREATE TABLE events ENGINE = MergeTree() ORDER BY (ts, user_id) AS SELECT * FROM s3('s3://logs/events/*.parquet', NOSIGN) """) # Analyze locally — fast iterative queries result = sess.query(""" SELECT u.country, e.event_type, count() AS cnt, uniqExact(e.user_id) AS unique_users FROM events e JOIN users u ON e.user_id = u.id WHERE e.ts >= today() - 7 GROUP BY u.country, e.event_type ORDER BY cnt DESC LIMIT 20 """, "Pretty") result.show() # Expected: formatted table with country, event_type, count, unique users # Check table contents sess.query("SELECT count() FROM users").show() sess.query("SELECT count() FROM events").show() sess.close() ``` --- ## 4. Python Data as SQL Table ```python import chdb import pandas as pd # Query a Python dict directly in SQL scores = {"student": ["Alice", "Bob", "Carol"], "math": [95, 87, 92], "science": [88, 91, 85]} chdb.query("SELECT student, math + science AS total FROM Python(scores) ORDER BY total DESC").show() # Expect