chdb-datastore
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesechdb DataStore — It's Just Faster Pandas
chdb DataStore — 更快的pandas替代方案
The Key Insight
核心思路
python
undefinedpython
undefinedChange this:
Change this:
import pandas as pd
import pandas as pd
To this:
To this:
import chdb.datastore as pd
import chdb.datastore as pd
Everything else stays the same.
Everything else stays the same.
DataStore is a **lazy, ClickHouse-backed pandas replacement**. Your existing pandas code works unchanged — but operations compile to optimized SQL and execute only when results are needed (e.g., `print()`, `len()`, iteration).
```bash
pip install chdb
DataStore是一款**基于ClickHouse的惰性执行pandas替代工具**。您现有的pandas代码无需修改即可运行——但所有操作会被编译为优化后的SQL,仅在需要结果时才会执行(例如调用`print()`、`len()`或进行迭代时)。
```bash
pip install chdbDecision Tree: Pick the Right Approach
决策树:选择合适的使用方式
1. "I have a file/database and want to analyze it with pandas"
→ DataStore.from_file() / from_mysql() / from_s3() etc.
→ See references/connectors.md
2. "I need to join data from different sources"
→ Create DataStores from each source, use .join()
→ See examples/examples.md #3-5
3. "My pandas code is too slow"
→ import chdb.datastore as pd — change one line, keep the rest
4. "I need raw SQL queries"
→ Use the chdb-sql skill instead1. "我有文件/数据库,想用pandas分析"
→ 使用DataStore.from_file() / from_mysql() / from_s3()等方法
→ 参考references/connectors.md
2. "我需要跨数据源关联数据"
→ 为每个数据源创建DataStore实例,使用.join()方法
→ 查看examples/examples.md #3-5
3. "我的pandas代码运行太慢"
→ 导入`chdb.datastore as pd` —— 只需修改一行代码,其余保持不变
4. "我需要执行原生SQL查询"
→ 请改用chdb-sql技能Connect to Any Data Source — One Pattern
连接任意数据源——统一模式
python
from datastore import DataStorepython
from datastore import DataStoreLocal file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)
Local file (auto-detects .parquet, .csv, .json, .arrow, .orc, .avro, .tsv, .xml)
ds = DataStore.from_file("sales.parquet")
ds = DataStore.from_file("sales.parquet")
Database
Database
ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
ds = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
Cloud storage
Cloud storage
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
ds = DataStore.from_s3("s3://bucket/data.parquet", nosign=True)
URI shorthand — auto-detects source type
URI shorthand — auto-detects source type
ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
All 16+ sources and URI schemes → [connectors.md](references/connectors.md)ds = DataStore.uri("mysql://root:pass@db:3306/shop/orders")
所有16+种数据源及URI协议 → [connectors.md](references/connectors.md)After Connecting — Full Pandas API
连接后——完整支持Pandas API
python
result = ds[ds["age"] > 25] # filter
result = ds[["name", "city"]] # select columns
result = ds.sort_values("revenue", ascending=False) # sort
result = ds.groupby("dept")["salary"].mean() # groupby
result = ds.assign(margin=lambda x: x["profit"] / x["revenue"]) # computed column
ds["name"].str.upper() # string accessor
ds["date"].dt.year # datetime accessor
result = ds1.join(ds2, on="id") # join
result = ds.head(10) # preview
print(ds.to_sql()) # see generated SQL209 DataFrame methods supported. Full API → api-reference.md
python
result = ds[ds["age"] > 25] # filter
result = ds[["name", "city"]] # select columns
result = ds.sort_values("revenue", ascending=False) # sort
result = ds.groupby("dept")["salary"].mean() # groupby
result = ds.assign(margin=lambda x: x["profit"] / x["revenue"]) # computed column
ds["name"].str.upper() # string accessor
ds["date"].dt.year # datetime accessor
result = ds1.join(ds2, on="id") # join
result = ds.head(10) # preview
print(ds.to_sql()) # see generated SQL支持209种DataFrame方法。完整API → api-reference.md
Cross-Source Join — The Killer Feature
跨源关联——核心亮点
python
from datastore import DataStore
customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")
result = (orders
.join(customers, left_on="customer_id", right_on="id")
.groupby("country")
.agg({"amount": "sum", "rating": "mean"})
.sort_values("sum", ascending=False))
print(result)More join examples → examples.md
python
from datastore import DataStore
customers = DataStore.from_mysql(host="db:3306", database="crm", table="customers", user="root", password="pass")
orders = DataStore.from_file("orders.parquet")
result = (orders
.join(customers, left_on="customer_id", right_on="id")
.groupby("country")
.agg({"amount": "sum", "rating": "mean"})
.sort_values("sum", ascending=False))
print(result)更多关联示例 → examples.md
Writing Data
写入数据
python
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="summary.parquet", format="Parquet")
target.insert_into("category", "total", "count").select_from(
source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()python
source = DataStore.from_mysql(host="db:3306", database="shop", table="orders", user="root", password="pass")
target = DataStore("file", path="summary.parquet", format="Parquet")
target.insert_into("category", "total", "count").select_from(
source.groupby("category").select("category", "sum(amount) AS total", "count() AS count")
).execute()Troubleshooting
故障排查
| Problem | Fix |
|---|---|
| |
| Use |
| Database connection timeout | Include port in host: |
| Join returns empty result | Check key types match (both int or both string); use |
| Unexpected results | Call |
| Environment check | Run |
| 问题 | 解决方法 |
|---|---|
| |
| 使用 |
| 数据库连接超时 | 在host中包含端口: |
| 关联返回空结果 | 检查关联键类型是否匹配(均为int或均为string);使用 |
| 结果不符合预期 | 调用 |
| 环境检查 | 运行 |
References
参考资料
- API Reference — Full DataStore method signatures
- Connectors — All 16+ data source connection methods
- Examples — 10+ runnable examples with expected output
- Verify Install — Environment verification script
- Official Docs
Note: This skill teaches how to use chdb DataStore. For raw SQL queries, use theskill. For contributing to chdb source code, see CLAUDE.md in the project root.chdb-sql
- API参考 — 完整的DataStore方法签名
- 连接器 — 所有16+种数据源的连接方法
- 示例 — 10+个可运行示例及预期输出
- 安装验证 — 环境验证脚本
- 官方文档
注意:本技能介绍如何使用chdb DataStore。 如需原生SQL查询,请使用技能。 如需为chdb源代码做贡献,请查看项目根目录下的CLAUDE.md。chdb-sql