chdb-datastore

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

chdb DataStore — It's Just Faster Pandas

chdb DataStore — 更快的pandas替代方案

The Key Insight

核心思路

python

undefined

python

undefined

Change this:

import pandas as pd

To this:

import chdb.datastore as pd

Everything else stays the same.


DataStore is a **lazy, ClickHouse-backed pandas replacement**. Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., `print()`, `len()`, iteration).

```bash
pip install chdb


DataStore是一款**基于ClickHouse的惰性执行pandas替代工具**。您现有的pandas代码无需修改即可运行——但所有操作会被编译为优化后的SQL，仅在需要结果时才会执行（例如调用`print()`、`len()`或进行迭代时）。

```bash
pip install chdb

Decision Tree: Pick the Right Approach

决策树：选择合适的使用方式

1. "I have a file/database and want to analyze it with pandas"
   → DataStore.from_file() / from_mysql() / from_s3() etc.
   → See references/connectors.md

2. "I need to join data from different sources"
   → Create DataStores from each source, use .join()
   → See examples/examples.md #3-5

3. "My pandas code is too slow"
   → import chdb.datastore as pd — change one line, keep the rest

4. "I need raw SQL queries"
   → Use the chdb-sql skill instead

1. "我有文件/数据库，想用pandas分析"
   → 使用DataStore.from_file() / from_mysql() / from_s3()等方法
   → 参考references/connectors.md

2. "我需要跨数据源关联数据"
   → 为每个数据源创建DataStore实例，使用.join()方法
   → 查看examples/examples.md #3-5

3. "我的pandas代码运行太慢"
   → 导入`chdb.datastore as pd` —— 只需修改一行代码，其余保持不变

4. "我需要执行原生SQL查询"
   → 请改用chdb-sql技能

Connect to Any Data Source — One Pattern

连接任意数据源——统一模式

python

from datastore import DataStore

python

from datastore import DataStore

Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)

ds = DataStore.from_file("sales.parquet")

Database

ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")

Cloud storage

ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)

URI shorthand — auto-detects source type

ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")


All 16+ sources and URI schemes → [connectors.md](references/connectors.md)

ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")


所有16+种数据源及URI协议 → [connectors.md](references/connectors.md)

After Connecting — Full Pandas API

连接后——完整支持Pandas API

python

result = ds[ds["age"] > 25]                                          # filter
result = ds[["name", "city"]]                                        # select columns
result = ds.sort_values("revenue", ascending=False)                  # sort
result = ds.groupby("dept")["salary"].mean()                         # groupby
result = ds.assign(margin=lambda x: x["profit"] / x["revenue"])     # computed column
ds["name"].str.upper()                                               # string accessor
ds["date"].dt.year                                                   # datetime accessor
result = ds1.join(ds2, on="id")                                      # join
result = ds.head(10)                                                 # preview
print(ds.to_sql())                                                   # see generated SQL

209 DataFrame methods supported. Full API → api-reference.md

python

result = ds[ds["age"] > 25]                                          # filter
result = ds[["name", "city"]]                                        # select columns
result = ds.sort_values("revenue", ascending=False)                  # sort
result = ds.groupby("dept")["salary"].mean()                         # groupby
result = ds.assign(margin=lambda x: x["profit"] / x["revenue"])     # computed column
ds["name"].str.upper()                                               # string accessor
ds["date"].dt.year                                                   # datetime accessor
result = ds1.join(ds2, on="id")                                      # join
result = ds.head(10)                                                 # preview
print(ds.to_sql())                                                   # see generated SQL

支持209种DataFrame方法。完整API → api-reference.md

Cross-Source Join — The Killer Feature

跨源关联——核心亮点

python

from datastore import DataStore

customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")

result = (orders
    .join(customers, left_on="customer_id", right_on="id")
    .groupby("country")
    .agg({"amount": "sum", "rating": "mean"})
    .sort_values("sum", ascending=False))
print(result)

More join examples → examples.md

python

from datastore import DataStore

customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")

result = (orders
    .join(customers, left_on="customer_id", right_on="id")
    .groupby("country")
    .agg({"amount": "sum", "rating": "mean"})
    .sort_values("sum", ascending=False))
print(result)

更多关联示例 → examples.md

Writing Data

写入数据

python

source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="summary.parquet", format="Parquet")

target.insert_into("category", "total", "count").select_from(
    source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()

python

source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="summary.parquet", format="Parquet")

target.insert_into("category", "total", "count").select_from(
    source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()

Troubleshooting

故障排查

Problem	Fix
`ImportError: No module named 'chdb'`	`pip install chdb`
`ImportError: cannot import 'DataStore'`	Use `from datastore import DataStore` or `from chdb.datastore import DataStore`
Database connection timeout	Include port in host: `host="db:3306"` not `host="db"`
Join returns empty result	Check key types match (both int or both string); use `.to_sql()` to inspect
Unexpected results	Call `ds.to_sql()` to see the generated SQL and debug
Environment check	Run `python scripts/verify_install.py` (from skill directory)

问题	解决方法
`ImportError: No module named 'chdb'`	`pip install chdb`
`ImportError: cannot import 'DataStore'`	使用 `from datastore import DataStore` 或 `from chdb.datastore import DataStore`
数据库连接超时	在host中包含端口： `host="db:3306"` 而非 `host="db"`
关联返回空结果	检查关联键类型是否匹配（均为int或均为string）；使用 `.to_sql()` 查看生成的SQL
结果不符合预期	调用 `ds.to_sql()` 查看生成的SQL并进行调试
环境检查	运行 `python scripts/verify_install.py` （从技能目录执行）

References

参考资料

API Reference — Full DataStore method signatures
Connectors — All 16+ data source connection methods
Examples — 10+ runnable examples with expected output
Verify Install — Environment verification script
Official Docs

Note: This skill teaches how to use chdb DataStore. For raw SQL queries, use the
chdb-sql
skill. For contributing to chdb source code, see CLAUDE.md in the project root.

API参考 — 完整的DataStore方法签名
连接器 — 所有16+种数据源的连接方法
示例 — 10+个可运行示例及预期输出
安装验证 — 环境验证脚本
官方文档

注意：本技能介绍如何使用chdb DataStore。如需原生SQL查询，请使用
chdb-sql
技能。如需为chdb源代码做贡献，请查看项目根目录下的CLAUDE.md。