spice-connect-data

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Connect to Data Sources

连接到数据源

Spice federates SQL queries across 30+ data sources without ETL. Connect databases, data lakes, warehouses, and APIs, then query across them with standard SQL.
Spice无需ETL即可在30+种数据源之间执行联邦SQL查询。连接数据库、数据湖、数据仓库和API,然后使用标准SQL进行跨源查询。

How Federation Works

联邦查询的工作原理

Configure datasets pointing to different sources. Spice's query planner (built on Apache DataFusion) optimizes and routes queries with filter pushdown and column projection:
yaml
datasets:
  - from: postgres:customers
    name: customers
    params:
      pg_host: db.example.com
      pg_user: ${secrets:PG_USER}
  - from: s3://bucket/orders/
    name: orders
    params:
      file_format: parquet
  - from: snowflake:analytics.sales
    name: sales
sql
-- Query across all three sources in one statement
SELECT c.name, o.order_total, s.region
FROM customers c
  JOIN orders o ON c.id = o.customer_id
  JOIN sales s ON o.id = s.order_id
WHERE s.region = 'EMEA';
配置指向不同数据源的数据集。Spice的查询规划器(基于Apache DataFusion构建)通过下推过滤和列投影来优化和路由查询:
yaml
datasets:
  - from: postgres:customers
    name: customers
    params:
      pg_host: db.example.com
      pg_user: ${secrets:PG_USER}
  - from: s3://bucket/orders/
    name: orders
    params:
      file_format: parquet
  - from: snowflake:analytics.sales
    name: sales
sql
-- Query across all three sources in one statement
SELECT c.name, o.order_total, s.region
FROM customers c
  JOIN orders o ON c.id = o.customer_id
  JOIN sales s ON o.id = s.order_id
WHERE s.region = 'EMEA';

Dataset Configuration

数据集配置

yaml
datasets:
  - from: <connector>:<identifier>
    name: <dataset_name>
    params:
      # connector-specific parameters
    acceleration:
      enabled: true # optional: materialize locally (see spice-acceleration)
yaml
datasets:
  - from: <connector>:<identifier>
    name: <dataset_name>
    params:
      # connector-specific parameters
    acceleration:
      enabled: true # optional: materialize locally (see spice-acceleration)

Supported Connectors

支持的连接器

Databases

数据库

ConnectorFrom FormatStatus
PostgreSQL
postgres:schema.table
Stable (also Amazon Redshift)
MySQL
mysql:schema.table
Stable
DuckDB
duckdb:database.table
Stable
MS SQL Server
mssql:db.table
Beta
DynamoDB
dynamodb:table
Release Candidate
MongoDB
mongodb:collection
Alpha
ClickHouse
clickhouse:db.table
Alpha
连接器来源格式状态
PostgreSQL
postgres:schema.table
稳定版(兼容Amazon Redshift)
MySQL
mysql:schema.table
稳定版
DuckDB
duckdb:database.table
稳定版
MS SQL Server
mssql:db.table
测试版
DynamoDB
dynamodb:table
候选发布版
MongoDB
mongodb:collection
预览版
ClickHouse
clickhouse:db.table
预览版

Data Warehouses

数据仓库

ConnectorFrom FormatStatus
Snowflake
snowflake:db.schema.table
Beta
Databricks (Delta Lake)
databricks:catalog.schema.table
Stable
Spark
spark:db.table
Beta
连接器来源格式状态
Snowflake
snowflake:db.schema.table
测试版
Databricks (Delta Lake)
databricks:catalog.schema.table
稳定版
Spark
spark:db.table
测试版

Data Lakes & Object Storage

数据湖与对象存储

ConnectorFrom FormatStatus
S3
s3://bucket/path/
Stable
Delta Lake
delta_lake:/path/to/delta/
Stable
Iceberg
iceberg:table
Beta
Azure BlobFS
abfs://container/path/
Alpha
File (local)
file:./path/to/data
Stable
连接器来源格式状态
S3
s3://bucket/path/
稳定版
Delta Lake
delta_lake:/path/to/delta/
稳定版
Iceberg
iceberg:table
测试版
Azure BlobFS
abfs://container/path/
预览版
File (本地)
file:./path/to/data
稳定版

Other Sources

其他数据源

ConnectorFrom FormatStatus
Spice.ai
spice.ai:path/to/dataset
Stable
Dremio
dremio:source.table
Stable
GitHub
github:github.com/owner/repo/issues
Stable
GraphQL
graphql:endpoint
Release Candidate
FlightSQL
flightsql:query
Beta
ODBC
odbc:connection
Beta
FTP/SFTP
sftp://host/path/
Alpha
HTTP/HTTPS
https://url/path/data.csv
Alpha
Kafka
kafka:topic
Alpha
Debezium CDC
debezium:topic
Alpha
SharePoint
sharepoint:site/path
Alpha
IMAP
imap:mailbox
Alpha
连接器来源格式状态
Spice.ai
spice.ai:path/to/dataset
稳定版
Dremio
dremio:source.table
稳定版
GitHub
github:github.com/owner/repo/issues
稳定版
GraphQL
graphql:endpoint
候选发布版
FlightSQL
flightsql:query
测试版
ODBC
odbc:connection
测试版
FTP/SFTP
sftp://host/path/
预览版
HTTP/HTTPS
https://url/path/data.csv
预览版
Kafka
kafka:topic
预览版
Debezium CDC
debezium:topic
预览版
SharePoint
sharepoint:site/path
预览版
IMAP
imap:mailbox
预览版

Common Examples

常见示例

PostgreSQL

PostgreSQL

yaml
datasets:
  - from: postgres:public.users
    name: users
    params:
      pg_host: localhost
      pg_port: 5432
      pg_user: ${ env:PG_USER }
      pg_pass: ${ env:PG_PASS }
    acceleration:
      enabled: true
yaml
datasets:
  - from: postgres:public.users
    name: users
    params:
      pg_host: localhost
      pg_port: 5432
      pg_user: ${ env:PG_USER }
      pg_pass: ${ env:PG_PASS }
    acceleration:
      enabled: true

S3 with Parquet

S3(Parquet格式)

yaml
datasets:
  - from: s3://my-bucket/data/sales/
    name: sales
    params:
      file_format: parquet
      s3_region: us-east-1
    acceleration:
      enabled: true
      engine: duckdb
yaml
datasets:
  - from: s3://my-bucket/data/sales/
    name: sales
    params:
      file_format: parquet
      s3_region: us-east-1
    acceleration:
      enabled: true
      engine: duckdb

GitHub Issues

GitHub Issues

yaml
datasets:
  - from: github:github.com/spiceai/spiceai/issues
    name: spiceai.issues
    params:
      github_token: ${ secrets:GITHUB_TOKEN }
    acceleration:
      enabled: true
      refresh_mode: append
      refresh_check_interval: 24h
      refresh_data_window: 14d
yaml
datasets:
  - from: github:github.com/spiceai/spiceai/issues
    name: spiceai.issues
    params:
      github_token: ${ secrets:GITHUB_TOKEN }
    acceleration:
      enabled: true
      refresh_mode: append
      refresh_check_interval: 24h
      refresh_data_window: 14d

Local File

本地文件

yaml
datasets:
  - from: file:./data/sales.parquet
    name: sales
yaml
datasets:
  - from: file:./data/sales.parquet
    name: sales

File Formats

文件格式

Connectors reading from object stores (S3, ABFS) or network storage (FTP, SFTP) support:
Format
file_format
Type
Apache Parquet
parquet
Structured
CSV
csv
Structured
Markdown
md
Document
Text
txt
Document
PDF
pdf
Document
Microsoft Word
docx
Document
Document files produce a table with
location
and
content
columns:
yaml
datasets:
  - from: file:docs/decisions/
    name: my_documents
    params:
      file_format: md
从对象存储(S3、ABFS)或网络存储(FTP、SFTP)读取数据的连接器支持以下格式:
格式
file_format
类型
Apache Parquet
parquet
结构化数据
CSV
csv
结构化数据
Markdown
md
文档型数据
Text
txt
文档型数据
PDF
pdf
文档型数据
Microsoft Word
docx
文档型数据
文档型文件会生成包含
location
content
列的表格:
yaml
datasets:
  - from: file:docs/decisions/
    name: my_documents
    params:
      file_format: md

Hive Partitioning

Hive分区

yaml
datasets:
  - from: s3://bucket/data/
    name: partitioned_data
    params:
      file_format: parquet
      hive_partitioning_enabled: true
sql
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';
yaml
datasets:
  - from: s3://bucket/data/
    name: partitioned_data
    params:
      file_format: parquet
      hive_partitioning_enabled: true
sql
SELECT * FROM partitioned_data WHERE year = '2024' AND month = '01';

Dataset Naming

数据集命名

  • name: foo
    spice.public.foo
  • name: myschema.foo
    spice.myschema.foo
  • Use
    .
    to organize datasets into schemas
  • name: foo
    spice.public.foo
  • name: myschema.foo
    spice.myschema.foo
  • 使用
    .
    将数据集组织到不同的schema中

Catalogs

数据目录

Catalog connectors expose external data catalogs, preserving the source schema hierarchy. Tables are accessed as
<catalog>.<schema>.<table>
.
Note: Acceleration is not supported for catalog tables. Use datasets for accelerated access.
yaml
catalogs:
  - from: <connector>
    name: <catalog_name>
    params:
      # connector-specific parameters
    include:
      - 'schema.*' # optional: filter with glob patterns
数据目录连接器会暴露外部数据目录,并保留源schema层级。表可通过
<catalog>.<schema>.<table>
访问。
注意: 数据目录表不支持加速。如需加速访问,请使用数据集。
yaml
catalogs:
  - from: <connector>
    name: <catalog_name>
    params:
      # connector-specific parameters
    include:
      - 'schema.*' # 可选:使用通配符过滤

Supported Catalogs

支持的数据目录

ConnectorFrom ValueStatus
Unity Catalog
unity_catalog
Stable
Databricks
databricks
Beta
Iceberg
iceberg
Beta
Spice.ai
spice.ai
Beta
AWS Glue
glue
Alpha
连接器来源值状态
Unity Catalog
unity_catalog
稳定版
Databricks
databricks
测试版
Iceberg
iceberg
测试版
Spice.ai
spice.ai
测试版
AWS Glue
glue
预览版

Catalog Example

数据目录示例

yaml
catalogs:
  - from: unity_catalog
    name: unity
    params:
      unity_catalog_endpoint: https://my-workspace.cloud.databricks.com
      databricks_token: ${ secrets:DATABRICKS_TOKEN }
    include:
      - 'my_schema.*'
sql
SELECT * FROM unity.my_schema.customers LIMIT 10;
yaml
catalogs:
  - from: unity_catalog
    name: unity
    params:
      unity_catalog_endpoint: https://my-workspace.cloud.databricks.com
      databricks_token: ${ secrets:DATABRICKS_TOKEN }
    include:
      - 'my_schema.*'
sql
SELECT * FROM unity.my_schema.customers LIMIT 10;

Views

视图

Views are virtual tables defined by SQL queries — useful for pre-aggregations, transformations, and simplified access:
yaml
views:
  - name: daily_sales
    sql: |
      SELECT DATE(created_at) as date, SUM(amount) as total, COUNT(*) as orders
      FROM orders
      GROUP BY DATE(created_at)

  - name: order_details
    sql: |
      SELECT o.id, c.name as customer, p.name as product, o.quantity
      FROM orders o
      JOIN customers c ON o.customer_id = c.id
      JOIN products p ON o.product_id = p.id
Views can be accelerated:
yaml
views:
  - name: rankings
    sql: |
      SELECT product_id, SUM(quantity) as total_sold
      FROM orders GROUP BY product_id ORDER BY total_sold DESC LIMIT 100
    acceleration:
      enabled: true
      refresh_check_interval: 1h
Views are read-only and queried like regular tables:
SELECT * FROM daily_sales
.
视图是由SQL查询定义的虚拟表——适用于预聚合、数据转换和简化访问:
yaml
views:
  - name: daily_sales
    sql: |
      SELECT DATE(created_at) as date, SUM(amount) as total, COUNT(*) as orders
      FROM orders
      GROUP BY DATE(created_at)

  - name: order_details
    sql: |
      SELECT o.id, c.name as customer, p.name as product, o.quantity
      FROM orders o
      JOIN customers c ON o.customer_id = c.id
      JOIN products p ON o.product_id = p.id
视图可启用加速:
yaml
views:
  - name: rankings
    sql: |
      SELECT product_id, SUM(quantity) as total_sold
      FROM orders GROUP BY product_id ORDER BY total_sold DESC LIMIT 100
    acceleration:
      enabled: true
      refresh_check_interval: 1h
视图为只读,可像普通表一样查询:
SELECT * FROM daily_sales

Writing Data

写入数据

Spice supports writing to Apache Iceberg tables and Amazon S3 Tables via
INSERT INTO
:
yaml
datasets:
  - from: iceberg:https://catalog.example.com/v1/namespaces/sales/tables/transactions
    name: transactions
    access: read_write # required for writes
sql
INSERT INTO transactions SELECT * FROM staging_transactions;
Spice支持通过
INSERT INTO
将数据写入Apache Iceberg表和Amazon S3表:
yaml
datasets:
  - from: iceberg:https://catalog.example.com/v1/namespaces/sales/tables/transactions
    name: transactions
    access: read_write # 写入操作必填
sql
INSERT INTO transactions SELECT * FROM staging_transactions;

Referencing Secrets

引用密钥

Use
${ store_name:KEY }
syntax in params. See spice-secrets for full configuration:
yaml
params:
  pg_user: ${ env:PG_USER }
  pg_pass: ${ secrets:PG_PASSWORD }
在params中使用
${ store_name:KEY }
语法引用密钥。完整配置请参考spice-secrets:
yaml
params:
  pg_user: ${ env:PG_USER }
  pg_pass: ${ secrets:PG_PASSWORD }

Documentation

相关文档