Chdb Datastore

Canonical shelf is Build because the skill teaches how to integrate disparate data sources and query them from application or agent code. Integrations fits cross-source joins, URI-backed stores, and lake-format access patterns that glue external systems together.

Also useful

Also useful

Where it fits

Example use

Join Stripe export Parquet with a Postgres users table to prototype a revenue dashboard query.

Example use

Run a three-way join across logs on disk, RDS, and S3 to debug a billing discrepancy in production.

Example use

Aggregate campaign CSVs from GCS with in-app events to measure funnel conversion without a full warehouse migration.

How it compares

Use instead of ad-hoc pandas scripts that load everything into memory—this skill package documents chdb’s federated query model, not a hosted BI dashboard.

Common Questions / FAQ

Who is chdb-datastore for?

Solo and indie developers who ship Python backends or data agents and want ClickHouse chdb examples for joining files, SQL databases, and cloud storage without rebuilding their analytics stack.

When should I use chdb-datastore?

Use it during Build when wiring integrations, during Operate when reconciling production data sources, or during Grow when you need cross-source analytics for retention or revenue questions.

Is chdb-datastore safe to install?

Review the Security Audits panel on this Prism page for install risk and file-hash signals before pointing agents at production credentials or cloud buckets.

SKILL.md

READMESKILL.md - Chdb Datastore

# DataStore Examples

> All examples are self-contained and runnable.
> Expected output is shown in comments.

## Table of Contents

1. [Pandas Replacement: One Import Change](#1-pandas-replacement-one-import-change)
2. [Analyze Local Files](#2-analyze-local-files)
3. [Cross-Source Join: MySQL + Parquet](#3-cross-source-join-mysql--parquet)
4. [Cross-Source Join: S3 + PostgreSQL](#4-cross-source-join-s3--postgresql)
5. [Three-Way Join: File + Database + Cloud](#5-three-way-join-file--database--cloud)
6. [Data Lake Formats: Iceberg, Delta, Hudi](#6-data-lake-formats-iceberg-delta-hudi)
7. [URI Shorthand Access](#7-uri-shorthand-access)
8. [Cloud Storage Variants (S3/GCS/Azure/HDFS)](#8-cloud-storage-variants)
9. [Cross-Source Write](#9-cross-source-write)
10. [Explore Remote Schema](#10-explore-remote-schema)
11. [Common Errors & Fixes](#11-common-errors--fixes)

---

## 1. Pandas Replacement: One Import Change

The simplest way to use chdb — change one line, keep everything else:

```python
# Before (standard pandas):
# import pandas as pd

# After (chdb-accelerated):
import chdb.datastore as pd

df = pd.DataStore({"name": ["Alice", "Bob", "Carol", "Dave"],
                   "dept": ["Eng", "Sales", "Eng", "Sales"],
                   "salary": [95000, 72000, 110000, 68000]})

# Same pandas API — everything works
result = (df[df["salary"] > 70000]
    .groupby("dept")
    .agg({"salary": ["mean", "count"]})
    .sort_values("mean", ascending=False))

print(result)
# Expected output:
#     dept  mean  count
# 0    Eng  102500    2
# 1  Sales   72000    1
```

**Why it's faster:** Operations compile to ClickHouse SQL and execute as a single optimized query, instead of step-by-step Python evaluation.

---

## 2. Analyze Local Files

```python
from datastore import DataStore

# Parquet — pandas-style analysis
ds = DataStore.from_file("sales.parquet")
top_products = (ds[ds['revenue'] > 0]
    .groupby('product')
    .agg({'revenue': 'sum', 'quantity': 'sum'})
    .sort_values('revenue', ascending=False)
    .head(10))
print(top_products)

# CSV with filtering
ds = DataStore.from_file("employees.csv")
senior = ds[(ds['years'] > 5) & (ds['dept'] == 'Engineering')]
print(senior[['name', 'title', 'salary']].sort_values('salary', ascending=False))

# Glob pattern — query all matching files at once
ds = DataStore.from_file("logs/2024-*.csv")
errors = ds[ds['level'] == 'ERROR'].groupby('module')['message'].count()
print(errors.sort_values(ascending=False))

# See the SQL behind any query
print(top_products.to_sql())
```

---

## 3. Cross-Source Join: MySQL + Parquet

```python
from datastore import DataStore

customers = DataStore.from_mysql(
    host="db:3306", database="crm", table="customers",
    user="reader", password="pass")

orders = DataStore.from_file("orders.parquet")

result = (customers
    .join(orders, left_on="id", right_on="customer_id", how="inner")
    .groupby("country")
    .agg({"amount": ["sum", "mean"], "order_id": "count"})
    .sort_values("sum", ascending=False))

print(result)
# Expected: country-level order summary with total, average, and count

print(result.to_sql())
# Shows the cross-source SQL generated by chdb
```

---

## 4. Cross-Source Join: S3 + PostgreSQL

```python
from datastore import DataStore

events = DataStore.from_s3(
    "s3://analytics/events/2024-*.parquet",
    access_key_id="AKIA...", secret_access_key="secret...")

profiles = DataStore.from_postgresql(
    host="pg.example.com:5432", database="users",
    table="profiles", user="analyst", password="pass")

result = (events
    .join(profiles, left_on="user_id", right_on="id")
    .filter(events['event_type'] == 'purchase')
    .groupby(["country", "age_group"])
    .agg({"amount": "sum", "event_id": "count"})
    .sort_values("sum", ascending=False))

print(result)
# Expected: purchase events aggregated by country and age group
```

---

## 5. Three-Way Join: File + Database + Cloud

```python
from datastore import DataStore

products

What is this skill?

Swap pandas for chdb.datastore with one import while keeping familiar groupby/agg workflows

Runnable examples for MySQL + Parquet, S3 + PostgreSQL, and three-way file + DB + cloud joins

Coverage of Iceberg, Delta, and Hudi data-lake formats plus cross-source write paths

URI shorthand and S3, GCS, Azure, and HDFS storage variants in self-contained scripts

Common errors section with fixes for faster agent debugging

11 numbered example sections in the skill readme

Self-contained runnable examples with expected output in comments

Compatible agents: Claude Code, Cursor, Codex, any compatible agent

Adoption & trust: 774 installs on skills.sh; 458 GitHub stars; 2/3 security scanners passed (skills.sh audits).

What do I get? / Deliverables

After running the skill, your agent can implement chdb DataStore pipelines that join cross-source data in Python and optionally write results back to remote targets.

Runnable Python scripts following documented join and write patterns

Cross-source query snippets agents can drop into backends or notebooks

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

Join Stripe export Parquet with a Postgres users table to prototype a revenue dashboard query.

Example use

Run a three-way join across logs on disk, RDS, and S3 to debug a billing discrepancy in production.

Example use