spice-connect-data
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseConnect to Data Sources
连接到数据源
Spice federates SQL queries across 30+ data sources without ETL. Connect databases, data lakes, warehouses, and APIs, then query across them with standard SQL.
Spice无需ETL即可在30+种数据源之间执行联邦SQL查询。连接数据库、数据湖、数据仓库和API,然后使用标准SQL进行跨源查询。
How Federation Works
联邦查询的工作原理
Configure datasets pointing to different sources. Spice's query planner (built on Apache DataFusion) optimizes and routes queries with filter pushdown and column projection:
yaml
datasets:
- from: postgres:customers
name: customers
params:
pg_host: db.example.com
pg_user: ${secrets:PG_USER}
- from: s3://bucket/orders/
name: orders
params:
file_format: parquet
- from: snowflake:analytics.sales
name: salessql
-- Query across all three sources in one statement
SELECT c.name, o.order_total, s.region
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN sales s ON o.id = s.order_id
WHERE s.region = 'EMEA';配置指向不同数据源的数据集。Spice的查询规划器(基于Apache DataFusion构建)通过下推过滤和列投影来优化和路由查询:
yaml
datasets:
- from: postgres:customers
name: customers
params:
pg_host: db.example.com
pg_user: ${secrets:PG_USER}
- from: s3://bucket/orders/
name: orders
params:
file_format: parquet
- from: snowflake:analytics.sales
name: salessql
-- Query across all three sources in one statement
SELECT c.name, o.order_total, s.region
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN sales s ON o.id = s.order_id
WHERE s.region = 'EMEA';Dataset Configuration
数据集配置
yaml
datasets:
- from: <connector>:<identifier>
name: <dataset_name>
params:
# connector-specific parameters
acceleration:
enabled: true # optional: materialize locally (see spice-acceleration)yaml
datasets:
- from: <connector>:<identifier>
name: <dataset_name>
params:
# connector-specific parameters
acceleration:
enabled: true # optional: materialize locally (see spice-acceleration)Supported Connectors
支持的连接器
Databases
数据库
| Connector | From Format | Status |
|---|---|---|
| PostgreSQL | | Stable (also Amazon Redshift) |
| MySQL | | Stable |
| DuckDB | | Stable |
| MS SQL Server | | Beta |
| DynamoDB | | Release Candidate |
| MongoDB | | Alpha |
| ClickHouse | | Alpha |
| 连接器 | 来源格式 | 状态 |
|---|---|---|
| PostgreSQL | | 稳定版(兼容Amazon Redshift) |
| MySQL | | 稳定版 |
| DuckDB | | 稳定版 |
| MS SQL Server | | 测试版 |
| DynamoDB | | 候选发布版 |
| MongoDB | | 预览版 |
| ClickHouse | | 预览版 |
Data Warehouses
数据仓库
| Connector | From Format | Status |
|---|---|---|
| Snowflake | | Beta |
| Databricks (Delta Lake) | | Stable |
| Spark | | Beta |
| 连接器 | 来源格式 | 状态 |
|---|---|---|
| Snowflake | | 测试版 |
| Databricks (Delta Lake) | | 稳定版 |
| Spark | | 测试版 |
Data Lakes & Object Storage
数据湖与对象存储
| Connector | From Format | Status |
|---|---|---|
| S3 | | Stable |
| Delta Lake | | Stable |
| Iceberg | | Beta |
| Azure BlobFS | | Alpha |
| File (local) | | Stable |
| 连接器 | 来源格式 | 状态 |
|---|---|---|
| S3 | | 稳定版 |
| Delta Lake | | 稳定版 |
| Iceberg | | 测试版 |
| Azure BlobFS | | 预览版 |
| File (本地) | | 稳定版 |
Other Sources
其他数据源
| Connector | From Format | Status |
|---|---|---|
| Spice.ai | | Stable |
| Dremio | | Stable |
| GitHub | | Stable |
| GraphQL | | Release Candidate |
| FlightSQL | | Beta |
| ODBC | | Beta |
| FTP/SFTP | | Alpha |
| HTTP/HTTPS | | Alpha |
| Kafka | | Alpha |
| Debezium CDC | | Alpha |
| SharePoint | | Alpha |
| IMAP | | Alpha |
| 连接器 | 来源格式 | 状态 |
|---|---|---|
| Spice.ai | | 稳定版 |
| Dremio | | 稳定版 |
| GitHub | | 稳定版 |
| GraphQL | | 候选发布版 |
| FlightSQL | | 测试版 |
| ODBC | | 测试版 |
| FTP/SFTP | | 预览版 |
| HTTP/HTTPS | | 预览版 |
| Kafka | | 预览版 |
| Debezium CDC | | 预览版 |
| SharePoint | | 预览版 |
| IMAP | | 预览版 |
Common Examples
常见示例
PostgreSQL
PostgreSQL
yaml
datasets:
- from: postgres:public.users
name: users
params:
pg_host: localhost
pg_port: 5432
pg_user: ${ env:PG_USER }
pg_pass: ${ env:PG_PASS }
acceleration:
enabled: trueyaml
datasets:
- from: postgres:public.users
name: users
params:
pg_host: localhost
pg_port: 5432
pg_user: ${ env:PG_USER }
pg_pass: ${ env:PG_PASS }
acceleration:
enabled: trueS3 with Parquet
S3(Parquet格式)
yaml
datasets:
- from: s3://my-bucket/data/sales/
name: sales
params:
file_format: parquet
s3_region: us-east-1
acceleration:
enabled: true
engine: duckdbyaml
datasets:
- from: s3://my-bucket/data/sales/
name: sales
params:
file_format: parquet
s3_region: us-east-1
acceleration:
enabled: true
engine: duckdbGitHub Issues
GitHub Issues
yaml
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
params:
github_token: ${ secrets:GITHUB_TOKEN }
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 24h
refresh_data_window: 14dyaml
datasets:
- from: github:github.com/spiceai/spiceai/issues
name: spiceai.issues
params:
github_token: ${ secrets:GITHUB_TOKEN }
acceleration:
enabled: true
refresh_mode: append
refresh_check_interval: 24h
refresh_data_window: 14dLocal File
本地文件
yaml
datasets:
- from: file:./data/sales.parquet
name: salesyaml
datasets:
- from: file:./data/sales.parquet
name: salesFile Formats
文件格式
Connectors reading from object stores (S3, ABFS) or network storage (FTP, SFTP) support:
| Format | | Type |
|---|---|---|
| Apache Parquet | | Structured |
| CSV | | Structured |
| Markdown | | Document |
| Text | | Document |
| Document | |
| Microsoft Word | | Document |
Document files produce a table with and columns:
locationcontentyaml
datasets:
- from: file:docs/decisions/
name: my_documents
params:
file_format: md从对象存储(S3、ABFS)或网络存储(FTP、SFTP)读取数据的连接器支持以下格式:
| 格式 | | 类型 |
|---|---|---|
| Apache Parquet | | 结构化数据 |
| CSV | | 结构化数据 |
| Markdown | | 文档型数据 |
| Text | | 文档型数据 |
| 文档型数据 | |
| Microsoft Word | | 文档型数据 |
文档型文件会生成包含和列的表格:
locationcontentyaml
datasets:
- from: file:docs/decisions/
name: my_documents
params:
file_format: mdHive Partitioning
Hive分区
yaml
datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: truesql
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';yaml
datasets:
- from: s3://bucket/data/
name: partitioned_data
params:
file_format: parquet
hive_partitioning_enabled: truesql
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';Dataset Naming
数据集命名
- →
name: foospice.public.foo - →
name: myschema.foospice.myschema.foo - Use to organize datasets into schemas
.
- →
name: foospice.public.foo - →
name: myschema.foospice.myschema.foo - 使用将数据集组织到不同的schema中
.
Catalogs
数据目录
Catalog connectors expose external data catalogs, preserving the source schema hierarchy. Tables are accessed as .
<catalog>.<schema>.<table>Note: Acceleration is not supported for catalog tables. Use datasets for accelerated access.
yaml
catalogs:
- from: <connector>
name: <catalog_name>
params:
# connector-specific parameters
include:
- 'schema.*' # optional: filter with glob patterns数据目录连接器会暴露外部数据目录,并保留源schema层级。表可通过访问。
<catalog>.<schema>.<table>注意: 数据目录表不支持加速。如需加速访问,请使用数据集。
yaml
catalogs:
- from: <connector>
name: <catalog_name>
params:
# connector-specific parameters
include:
- 'schema.*' # 可选:使用通配符过滤Supported Catalogs
支持的数据目录
| Connector | From Value | Status |
|---|---|---|
| Unity Catalog | | Stable |
| Databricks | | Beta |
| Iceberg | | Beta |
| Spice.ai | | Beta |
| AWS Glue | | Alpha |
| 连接器 | 来源值 | 状态 |
|---|---|---|
| Unity Catalog | | 稳定版 |
| Databricks | | 测试版 |
| Iceberg | | 测试版 |
| Spice.ai | | 测试版 |
| AWS Glue | | 预览版 |
Catalog Example
数据目录示例
yaml
catalogs:
- from: unity_catalog
name: unity
params:
unity_catalog_endpoint: https://my-workspace.cloud.databricks.com
databricks_token: ${ secrets:DATABRICKS_TOKEN }
include:
- 'my_schema.*'sql
SELECT * FROM unity.my_schema.customers LIMIT 10;yaml
catalogs:
- from: unity_catalog
name: unity
params:
unity_catalog_endpoint: https://my-workspace.cloud.databricks.com
databricks_token: ${ secrets:DATABRICKS_TOKEN }
include:
- 'my_schema.*'sql
SELECT * FROM unity.my_schema.customers LIMIT 10;Views
视图
Views are virtual tables defined by SQL queries — useful for pre-aggregations, transformations, and simplified access:
yaml
views:
- name: daily_sales
sql: |
SELECT DATE(created_at) as date, SUM(amount) as total, COUNT(*) as orders
FROM orders
GROUP BY DATE(created_at)
- name: order_details
sql: |
SELECT o.id, c.name as customer, p.name as product, o.quantity
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.idViews can be accelerated:
yaml
views:
- name: rankings
sql: |
SELECT product_id, SUM(quantity) as total_sold
FROM orders GROUP BY product_id ORDER BY total_sold DESC LIMIT 100
acceleration:
enabled: true
refresh_check_interval: 1hViews are read-only and queried like regular tables: .
SELECT * FROM daily_sales视图是由SQL查询定义的虚拟表——适用于预聚合、数据转换和简化访问:
yaml
views:
- name: daily_sales
sql: |
SELECT DATE(created_at) as date, SUM(amount) as total, COUNT(*) as orders
FROM orders
GROUP BY DATE(created_at)
- name: order_details
sql: |
SELECT o.id, c.name as customer, p.name as product, o.quantity
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id视图可启用加速:
yaml
views:
- name: rankings
sql: |
SELECT product_id, SUM(quantity) as total_sold
FROM orders GROUP BY product_id ORDER BY total_sold DESC LIMIT 100
acceleration:
enabled: true
refresh_check_interval: 1h视图为只读,可像普通表一样查询:。
SELECT * FROM daily_salesWriting Data
写入数据
Spice supports writing to Apache Iceberg tables and Amazon S3 Tables via :
INSERT INTOyaml
datasets:
- from: iceberg:https://catalog.example.com/v1/namespaces/sales/tables/transactions
name: transactions
access: read_write # required for writessql
INSERT INTO transactions SELECT * FROM staging_transactions;Spice支持通过将数据写入Apache Iceberg表和Amazon S3表:
INSERT INTOyaml
datasets:
- from: iceberg:https://catalog.example.com/v1/namespaces/sales/tables/transactions
name: transactions
access: read_write # 写入操作必填sql
INSERT INTO transactions SELECT * FROM staging_transactions;Referencing Secrets
引用密钥
Use syntax in params. See spice-secrets for full configuration:
${ store_name:KEY }yaml
params:
pg_user: ${ env:PG_USER }
pg_pass: ${ secrets:PG_PASSWORD }在params中使用语法引用密钥。完整配置请参考spice-secrets:
${ store_name:KEY }yaml
params:
pg_user: ${ env:PG_USER }
pg_pass: ${ secrets:PG_PASSWORD }