
Chdb Datastore
Wire solo-builder analytics and ETL-style queries across files, SQL databases, and object storage without migrating everything into one warehouse first.
Overview
chdb-datastore is an agent skill for the Build phase that teaches chdb DataStore patterns to query and join local files, databases, and cloud objects with a pandas-like Python API.
Install
npx skills add https://github.com/clickhouse/agent-skills --skill chdb-datastoreWhat is this skill?
- Swap pandas for chdb.datastore with one import while keeping familiar groupby/agg workflows
- Runnable examples for MySQL + Parquet, S3 + PostgreSQL, and three-way file + DB + cloud joins
- Coverage of Iceberg, Delta, and Hudi data-lake formats plus cross-source write paths
- URI shorthand and S3, GCS, Azure, and HDFS storage variants in self-contained scripts
- Common errors section with fixes for faster agent debugging
- 11 numbered example sections in the skill readme
- Self-contained runnable examples with expected output in comments
Adoption & trust: 774 installs on skills.sh; 458 GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have metrics and tables scattered across Parquet files, live SQL databases, and cloud storage but no fast way to analyze them together from your app or agent.
Who is it for?
Indie builders adding analytics, reporting, or agent-readable data fusion on top of existing Postgres, MySQL, or object-storage assets.
Skip if: Teams that need a governed enterprise warehouse, real-time streaming ingest, or pure spreadsheet workflows with no Python.
When should I use this skill?
You need chdb DataStore code patterns for local files, cross-database joins, lake formats, or cloud URI access in Python.
What do I get? / Deliverables
After running the skill, your agent can implement chdb DataStore pipelines that join cross-source data in Python and optionally write results back to remote targets.
- Runnable Python scripts following documented join and write patterns
- Cross-source query snippets agents can drop into backends or notebooks
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Canonical shelf is Build because the skill teaches how to integrate disparate data sources and query them from application or agent code. Integrations fits cross-source joins, URI-backed stores, and lake-format access patterns that glue external systems together.
Where it fits
Join Stripe export Parquet with a Postgres users table to prototype a revenue dashboard query.
Run a three-way join across logs on disk, RDS, and S3 to debug a billing discrepancy in production.
Aggregate campaign CSVs from GCS with in-app events to measure funnel conversion without a full warehouse migration.
How it compares
Use instead of ad-hoc pandas scripts that load everything into memory—this skill package documents chdb’s federated query model, not a hosted BI dashboard.
Common Questions / FAQ
Who is chdb-datastore for?
Solo and indie developers who ship Python backends or data agents and want ClickHouse chdb examples for joining files, SQL databases, and cloud storage without rebuilding their analytics stack.
When should I use chdb-datastore?
Use it during Build when wiring integrations, during Operate when reconciling production data sources, or during Grow when you need cross-source analytics for retention or revenue questions.
Is chdb-datastore safe to install?
Review the Security Audits panel on this Prism page for install risk and file-hash signals before pointing agents at production credentials or cloud buckets.
SKILL.md
READMESKILL.md - Chdb Datastore
# DataStore Examples > All examples are self-contained and runnable. > Expected output is shown in comments. ## Table of Contents 1. [Pandas Replacement: One Import Change](#1-pandas-replacement-one-import-change) 2. [Analyze Local Files](#2-analyze-local-files) 3. [Cross-Source Join: MySQL + Parquet](#3-cross-source-join-mysql--parquet) 4. [Cross-Source Join: S3 + PostgreSQL](#4-cross-source-join-s3--postgresql) 5. [Three-Way Join: File + Database + Cloud](#5-three-way-join-file--database--cloud) 6. [Data Lake Formats: Iceberg, Delta, Hudi](#6-data-lake-formats-iceberg-delta-hudi) 7. [URI Shorthand Access](#7-uri-shorthand-access) 8. [Cloud Storage Variants (S3/GCS/Azure/HDFS)](#8-cloud-storage-variants) 9. [Cross-Source Write](#9-cross-source-write) 10. [Explore Remote Schema](#10-explore-remote-schema) 11. [Common Errors & Fixes](#11-common-errors--fixes) --- ## 1. Pandas Replacement: One Import Change The simplest way to use chdb — change one line, keep everything else: ```python # Before (standard pandas): # import pandas as pd # After (chdb-accelerated): import chdb.datastore as pd df = pd.DataStore({"name": ["Alice", "Bob", "Carol", "Dave"], "dept": ["Eng", "Sales", "Eng", "Sales"], "salary": [95000, 72000, 110000, 68000]}) # Same pandas API — everything works result = (df[df["salary"] > 70000] .groupby("dept") .agg({"salary": ["mean", "count"]}) .sort_values("mean", ascending=False)) print(result) # Expected output: # dept mean count # 0 Eng 102500 2 # 1 Sales 72000 1 ``` **Why it's faster:** Operations compile to ClickHouse SQL and execute as a single optimized query, instead of step-by-step Python evaluation. --- ## 2. Analyze Local Files ```python from datastore import DataStore # Parquet — pandas-style analysis ds = DataStore.from_file("sales.parquet") top_products = (ds[ds['revenue'] > 0] .groupby('product') .agg({'revenue': 'sum', 'quantity': 'sum'}) .sort_values('revenue', ascending=False) .head(10)) print(top_products) # CSV with filtering ds = DataStore.from_file("employees.csv") senior = ds[(ds['years'] > 5) & (ds['dept'] == 'Engineering')] print(senior[['name', 'title', 'salary']].sort_values('salary', ascending=False)) # Glob pattern — query all matching files at once ds = DataStore.from_file("logs/2024-*.csv") errors = ds[ds['level'] == 'ERROR'].groupby('module')['message'].count() print(errors.sort_values(ascending=False)) # See the SQL behind any query print(top_products.to_sql()) ``` --- ## 3. Cross-Source Join: MySQL + Parquet ```python from datastore import DataStore customers = DataStore.from_mysql( host="db:3306", database="crm", table="customers", user="reader", password="pass") orders = DataStore.from_file("orders.parquet") result = (customers .join(orders, left_on="id", right_on="customer_id", how="inner") .groupby("country") .agg({"amount": ["sum", "mean"], "order_id": "count"}) .sort_values("sum", ascending=False)) print(result) # Expected: country-level order summary with total, average, and count print(result.to_sql()) # Shows the cross-source SQL generated by chdb ``` --- ## 4. Cross-Source Join: S3 + PostgreSQL ```python from datastore import DataStore events = DataStore.from_s3( "s3://analytics/events/2024-*.parquet", access_key_id="AKIA...", secret_access_key="secret...") profiles = DataStore.from_postgresql( host="pg.example.com:5432", database="users", table="profiles", user="analyst", password="pass") result = (events .join(profiles, left_on="user_id", right_on="id") .filter(events['event_type'] == 'purchase') .groupby(["country", "age_group"]) .agg({"amount": "sum", "event_id": "count"}) .sort_values("sum", ascending=False)) print(result) # Expected: purchase events aggregated by country and age group ``` --- ## 5. Three-Way Join: File + Database + Cloud ```python from datastore import DataStore products