Chdb Sql

Local analytical SQL over files fits Build when you prototype data paths and backends, before promoting queries to production warehouses. Backend covers embedded chdb sessions, file() reads, joins across sources, and streaming large results from app-side Python.

Also useful

Also useful

Where it fits

Example use

Aggregate sales.csv and users.parquet to sanity-check revenue filters before API endpoints ship.

Example use

Rank countries by user count from a marketing export using GROUP BY on file().

Example use

Count log levels across logs/2024-*.parquet after an incident without uploading raw logs to a cloud warehouse.

How it compares

Skill cookbook for embedded chdb SQL—not a hosted ClickHouse MCP server or full ETL framework.

Common Questions / FAQ

Who is chdb-sql for?

Developers and solo data-curious builders who want ClickHouse SQL semantics on local files via Python without operating a separate database service.

When should I use chdb-sql?

In Build when prototyping metrics over Parquet/CSV, in Grow when exploring funnel or revenue slices on dumps, and in Operate when grouping log levels across dated Parquet globs.

Is chdb-sql safe to install?

Examples read local files you specify; review the Security Audits panel on this page and avoid querying directories that contain secrets or PII you do not intend to load into the agent context.

SKILL.md

READMESKILL.md - Chdb Sql

# chdb SQL Examples

> All examples are self-contained and runnable.
> Expected output is shown in comments.

## Table of Contents

1. [Query Any File](#1-query-any-file)
2. [Cross-Source SQL Joins](#2-cross-source-sql-joins)
3. [Session: Build Analytical Tables](#3-session-build-analytical-tables)
4. [Python Data as SQL Table](#4-python-data-as-sql-table)
5. [Parametrized Queries](#5-parametrized-queries)
6. [Window Functions](#6-window-functions)
7. [User-Defined Functions (UDF)](#7-user-defined-functions-udf)
8. [Streaming Large Results](#8-streaming-large-results)
9. [Common Errors & Fixes](#9-common-errors--fixes)

---

## 1. Query Any File

```python
import chdb

# Parquet
result = chdb.query("""
    SELECT country, count() AS cnt
    FROM file('users.parquet', Parquet)
    GROUP BY country
    ORDER BY cnt DESC
    LIMIT 10
""", "Pretty")
result.show()
# Expected: top 10 countries by user count, formatted table

# CSV
df = chdb.query("""
    SELECT * FROM file('sales.csv', CSVWithNames)
    WHERE revenue > 10000
    ORDER BY revenue DESC
""", "DataFrame")
print(df)
# Expected: pandas DataFrame with high-revenue rows

# JSON Lines
chdb.query("""
    SELECT * FROM file('events.jsonl', JSONEachRow)
    WHERE event_type = 'purchase'
""").show()

# Glob pattern — query all matching files
df = chdb.query("""
    SELECT level, count() AS cnt
    FROM file('logs/2024-*.parquet', Parquet)
    GROUP BY level
    ORDER BY cnt DESC
""", "DataFrame")
print(df)
# Expected:
#     level    cnt
# 0   INFO   45230
# 1   WARN    3210
# 2  ERROR     890
```

---

## 2. Cross-Source SQL Joins

```python
import chdb

# MySQL + Parquet join
chdb.query("""
    SELECT u.name, u.email, o.product, o.amount
    FROM mysql('db:3306', 'crm', 'users', 'root', 'pass') AS u
    JOIN file('orders.parquet', Parquet) AS o ON u.id = o.user_id
    WHERE o.amount > 100
    ORDER BY o.amount DESC
    LIMIT 20
""", "Pretty").show()

# S3 + PostgreSQL join
df = chdb.query("""
    SELECT e.event_type, p.country, count() AS cnt
    FROM s3('s3://bucket/events.parquet', 'KEY', 'SECRET', 'Parquet') AS e
    JOIN postgresql('pg:5432', 'users', 'profiles', 'user', 'pass') AS p
        ON e.user_id = p.id
    GROUP BY e.event_type, p.country
    ORDER BY cnt DESC
""", "DataFrame")
print(df)

# ClickHouse + local CSV
chdb.query("""
    SELECT r.host, l.status_code, count() AS requests
    FROM remote('ch:9000', 'logs', 'access_log', 'default', '') AS r
    JOIN file('server_config.csv', CSVWithNames) AS l ON r.host = l.hostname
    GROUP BY r.host, l.status_code
    ORDER BY requests DESC
""").show()
```

---

## 3. Session: Build Analytical Tables

```python
from chdb import session as chs

sess = chs.Session("./analytics_db")

# Ingest from multiple external sources into local tables
sess.query("""
    CREATE TABLE users ENGINE = MergeTree() ORDER BY id AS
    SELECT * FROM mysql('db:3306', 'crm', 'users', 'root', 'pass')
""")

sess.query("""
    CREATE TABLE events ENGINE = MergeTree() ORDER BY (ts, user_id) AS
    SELECT * FROM s3('s3://logs/events/*.parquet', NOSIGN)
""")

# Analyze locally — fast iterative queries
result = sess.query("""
    SELECT
        u.country,
        e.event_type,
        count() AS cnt,
        uniqExact(e.user_id) AS unique_users
    FROM events e
    JOIN users u ON e.user_id = u.id
    WHERE e.ts >= today() - 7
    GROUP BY u.country, e.event_type
    ORDER BY cnt DESC
    LIMIT 20
""", "Pretty")
result.show()
# Expected: formatted table with country, event_type, count, unique users

# Check table contents
sess.query("SELECT count() FROM users").show()
sess.query("SELECT count() FROM events").show()

sess.close()
```

---

## 4. Python Data as SQL Table

```python
import chdb
import pandas as pd

# Query a Python dict directly in SQL
scores = {"student": ["Alice", "Bob", "Carol"], "math": [95, 87, 92], "science": [88, 91, 85]}
chdb.query("SELECT student, math + science AS total FROM Python(scores) ORDER BY total DESC").show()
# Expect

What is this skill?

Nine self-contained sections from file queries through streaming and common errors

Query Parquet, CSVWithNames, JSONEachRow, and glob patterns like logs/2024-*.parquet

Cross-source SQL joins and session tables for analytical builds

Python DataFrame outputs, parametrized queries, window functions, and UDF examples

Streaming large results and a dedicated common-errors troubleshooting section

9 numbered example sections in the skill readme

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 809 installs on skills.sh; 458 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

You run validated chdb.query patterns for files, joins, windows, and streaming outputs you can lift into production pipelines later.

Runnable chdb.query snippets with documented expected output

Patterns for parametrized queries and DataFrame/Pretty render modes

Reference flows for cross-source joins and error fixes

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Aggregate sales.csv and users.parquet to sanity-check revenue filters before API endpoints ship.

Example use

Rank countries by user count from a marketing export using GROUP BY on file().

Example use