spice-accelerators
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSpice Data Accelerators
Spice 数据加速器
Accelerators materialize data locally from connected sources for faster queries and reduced load on source systems.
加速器可从关联数据源在本地物化数据,以实现更快的查询并降低源系统负载。
Basic Configuration
基础配置
yaml
datasets:
- from: postgres:my_table
name: my_table
acceleration:
enabled: true
engine: duckdb # arrow, duckdb, sqlite, cayenne, postgres, turso
mode: memory # memory or file
refresh_check_interval: 1hyaml
datasets:
- from: postgres:my_table
name: my_table
acceleration:
enabled: true
engine: duckdb # arrow, duckdb, sqlite, cayenne, postgres, turso
mode: memory # memory or file
refresh_check_interval: 1hChoosing an Accelerator
选择加速器
| Use Case | Engine | Why |
|---|---|---|
| Small datasets (<1 GB), max speed | | In-memory, lowest latency |
| Medium datasets (1-100 GB), complex SQL | | Mature SQL, memory management |
| Large datasets (100 GB-1+ TB), analytics | | Built on Vortex (Linux Foundation), 10-20x faster scans |
| Point lookups on large datasets | | 100x faster random access vs Parquet |
| Simple queries, low resource usage | | Lightweight, minimal overhead |
| Async operations, concurrent workloads | | Native async, modern connection pooling |
| External database integration | | Leverage existing PostgreSQL infra |
| 使用场景 | 引擎 | 原因 |
|---|---|---|
| 小型数据集(<1 GB),追求极致速度 | | 内存级存储,延迟最低 |
| 中型数据集(1-100 GB),复杂SQL查询 | | 成熟的SQL支持,优秀的内存管理 |
| 大型数据集(100 GB-1+ TB),分析场景 | | 基于Vortex(Linux基金会)构建,扫描速度快10-20倍 |
| 大型数据集上的点查询 | | 随机访问速度比Parquet快100倍 |
| 简单查询,低资源占用 | | 轻量级,开销极小 |
| 异步操作,并发工作负载 | | 原生异步支持,现代连接池机制 |
| 外部数据库集成 | | 利用现有PostgreSQL基础设施 |
Cayenne vs DuckDB
Cayenne vs DuckDB
Choose Cayenne when datasets exceed ~1 TB, multi-file ingestion is needed, or point lookups are common.
Choose DuckDB when datasets are under ~1 TB, complex SQL (window functions, CTEs) is needed, or DuckDB tooling is beneficial.
当数据集超过约1 TB、需要多文件 ingestion 或点查询频繁时,选择Cayenne。
当数据集小于约1 TB、需要复杂SQL(窗口函数、CTE)或借助DuckDB工具链时,选择DuckDB。
Supported Engines
支持的引擎
| Engine | Mode | Status |
|---|---|---|
| memory | Stable |
| memory, file | Stable |
| memory, file | Release Candidate |
| file | Beta |
| N/A (attached) | Release Candidate |
| memory, file | Beta |
| 引擎 | 模式 | 状态 |
|---|---|---|
| memory | 稳定 |
| memory, file | 稳定 |
| memory, file | 候选发布版 |
| file | Beta |
| N/A (attached) | 候选发布版 |
| memory, file | Beta |
Refresh Modes
刷新模式
| Mode | Description | Use Case |
|---|---|---|
| Complete dataset replacement on each refresh | Small, slowly-changing datasets |
| Adds new records based on a | Append-only logs, time-series data |
| Continuous streaming without time column | Real-time event streams (Kafka, Debezium) |
| CDC-based incremental updates via Debezium or DynamoDB Streams | Frequently updated transactional data |
| Request-based row-level caching | API responses, HTTP endpoints |
yaml
undefined| 模式 | 描述 | 使用场景 |
|---|---|---|
| 每次刷新时完全替换数据集 | 小型、变化缓慢的数据集 |
| 基于 | 仅追加日志、时序数据 |
| 无时间列的持续流摄入 | 实时事件流(Kafka、Debezium) |
| 基于CDC的增量更新(通过Debezium或DynamoDB Streams) | 频繁更新的事务数据 |
| 基于请求的行级缓存 | API响应、HTTP端点 |
yaml
undefinedFull refresh every 8 hours
每8小时全量刷新
acceleration:
refresh_mode: full
refresh_check_interval: 8h
acceleration:
refresh_mode: full
refresh_check_interval: 8h
Append mode: check for new records from the last day every 10 minutes
追加模式:每10分钟检查过去一天的新记录
acceleration:
refresh_mode: append
time_column: created_at
refresh_check_interval: 10m
refresh_data_window: 1d
acceleration:
refresh_mode: append
time_column: created_at
refresh_check_interval: 10m
refresh_data_window: 1d
Continuous ingestion using Kafka
使用Kafka的持续摄入
acceleration:
refresh_mode: append
acceleration:
refresh_mode: append
CDC with Debezium or DynamoDB Streams
基于Debezium或DynamoDB Streams的CDC
acceleration:
refresh_mode: changes
undefinedacceleration:
refresh_mode: changes
undefinedCommon Configurations
常见配置
In-Memory with Interval Refresh
内存模式+间隔刷新
yaml
acceleration:
enabled: true
engine: arrow
refresh_check_interval: 5myaml
acceleration:
enabled: true
engine: arrow
refresh_check_interval: 5mFile-Based with Append and Time Window
文件模式+追加+时间窗口
yaml
datasets:
- from: postgres:events
name: events
time_column: created_at
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: append
refresh_check_interval: 1h
refresh_data_window: 7dyaml
datasets:
- from: postgres:events
name: events
time_column: created_at
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: append
refresh_check_interval: 1h
refresh_data_window: 7dWith Retention Policy
带保留策略
Retention policies prevent unbounded growth of accelerated datasets. Spice supports time-based and custom SQL-based retention strategies:
yaml
datasets:
- from: postgres:events
name: events
time_column: created_at
acceleration:
enabled: true
engine: duckdb
retention_check_enabled: true
retention_period: 30d
retention_check_interval: 1h保留策略可防止加速数据集无限制增长。Spice支持基于时间和自定义SQL的保留策略:
yaml
datasets:
- from: postgres:events
name: events
time_column: created_at
acceleration:
enabled: true
engine: duckdb
retention_check_enabled: true
retention_period: 30d
retention_check_interval: 1hWith SQL-Based Retention
基于SQL的保留
yaml
acceleration:
retention_check_enabled: true
retention_check_interval: 1h
retention_sql: "DELETE FROM logs WHERE status = 'archived'"yaml
acceleration:
retention_check_enabled: true
retention_check_interval: 1h
retention_sql: "DELETE FROM logs WHERE status = 'archived'"With Indexes (DuckDB, SQLite, Turso)
带索引(DuckDB、SQLite、Turso)
yaml
acceleration:
enabled: true
engine: sqlite
indexes:
user_id: enabled
'(created_at, status)': unique
primary_key: idyaml
acceleration:
enabled: true
engine: sqlite
indexes:
user_id: enabled
'(created_at, status)': unique
primary_key: idEngine-Specific Parameters
引擎特定参数
DuckDB
DuckDB
yaml
acceleration:
engine: duckdb
mode: file
params:
duckdb_file: ./data/cache.dbyaml
acceleration:
engine: duckdb
mode: file
params:
duckdb_file: ./data/cache.dbSQLite
SQLite
yaml
acceleration:
engine: sqlite
mode: file
params:
sqlite_file: ./data/cache.sqliteyaml
acceleration:
engine: sqlite
mode: file
params:
sqlite_file: ./data/cache.sqliteConstraints and Indexes
约束与索引
Accelerated datasets support primary key constraints and indexes:
yaml
acceleration:
enabled: true
engine: duckdb
primary_key: order_id # Creates non-null unique index
indexes:
customer_id: enabled # Single column index
'(created_at, status)': unique # Multi-column unique index加速数据集支持主键约束和索引:
yaml
acceleration:
enabled: true
engine: duckdb
primary_key: order_id # 创建非空唯一索引
indexes:
customer_id: enabled # 单列索引
'(created_at, status)': unique # 多列唯一索引Snapshots (DuckDB, SQLite & Cayenne file mode)
快照(DuckDB、SQLite & Cayenne文件模式)
Bootstrap file-based accelerations from S3 or filesystem snapshots on startup. This dramatically reduces cold-start latency in distributed deployments.
Snapshot triggers vary by refresh mode:
- : Creates snapshots after each refresh (full and batch-append modes)
refresh_complete - : Creates snapshots on a fixed schedule (all refresh modes)
time_interval - : Creates snapshots after every N batches (streaming modes: Kafka, Debezium, DynamoDB Streams)
stream_batches
yaml
snapshots:
enabled: true
location: s3://my_bucket/snapshots/
bootstrap_on_failure_behavior: warn # warn | retry | fallback
params:
s3_auth: iam_rolePer-dataset opt-in:
yaml
acceleration:
enabled: true
engine: duckdb
mode: file
snapshots:
enabled: true启动时从S3或文件系统快照初始化基于文件的加速器。这可大幅降低分布式部署中的冷启动延迟。
快照触发方式因刷新模式而异:
- : 每次刷新后创建快照(全量和批量追加模式)
refresh_complete - : 按固定计划创建快照(所有刷新模式)
time_interval - : 每处理N个批次后创建快照(流模式:Kafka、Debezium、DynamoDB Streams)
stream_batches
yaml
snapshots:
enabled: true
location: s3://my_bucket/snapshots/
bootstrap_on_failure_behavior: warn # warn | retry | fallback
params:
s3_auth: iam_role按数据集启用:
yaml
acceleration:
enabled: true
engine: duckdb
mode: file
snapshots:
enabled: trueMemory Considerations
内存注意事项
When using (default), the dataset is loaded into RAM. Ensure sufficient memory including overhead for queries and the runtime. Mitigate with for duckdb, sqlite, turso, or cayenne accelerators.
mode: memorymode: file使用(默认)时,数据集会加载到RAM中。确保有足够的内存,包括查询和运行时的开销。对于duckdb、sqlite、turso或cayenne加速器,可使用来缓解内存压力。
mode: memorymode: file