spice-accelerators

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Spice Data Accelerators

Spice 数据加速器

Accelerators materialize data locally from connected sources for faster queries and reduced load on source systems.
加速器可从关联数据源在本地物化数据,以实现更快的查询并降低源系统负载。

Basic Configuration

基础配置

yaml
datasets:
  - from: postgres:my_table
    name: my_table
    acceleration:
      enabled: true
      engine: duckdb # arrow, duckdb, sqlite, cayenne, postgres, turso
      mode: memory # memory or file
      refresh_check_interval: 1h
yaml
datasets:
  - from: postgres:my_table
    name: my_table
    acceleration:
      enabled: true
      engine: duckdb # arrow, duckdb, sqlite, cayenne, postgres, turso
      mode: memory # memory or file
      refresh_check_interval: 1h

Choosing an Accelerator

选择加速器

Use CaseEngineWhy
Small datasets (<1 GB), max speed
arrow
In-memory, lowest latency
Medium datasets (1-100 GB), complex SQL
duckdb
Mature SQL, memory management
Large datasets (100 GB-1+ TB), analytics
cayenne
Built on Vortex (Linux Foundation), 10-20x faster scans
Point lookups on large datasets
cayenne
100x faster random access vs Parquet
Simple queries, low resource usage
sqlite
Lightweight, minimal overhead
Async operations, concurrent workloads
turso
Native async, modern connection pooling
External database integration
postgres
Leverage existing PostgreSQL infra
使用场景引擎原因
小型数据集(<1 GB),追求极致速度
arrow
内存级存储,延迟最低
中型数据集(1-100 GB),复杂SQL查询
duckdb
成熟的SQL支持,优秀的内存管理
大型数据集(100 GB-1+ TB),分析场景
cayenne
基于Vortex(Linux基金会)构建,扫描速度快10-20倍
大型数据集上的点查询
cayenne
随机访问速度比Parquet快100倍
简单查询,低资源占用
sqlite
轻量级,开销极小
异步操作,并发工作负载
turso
原生异步支持,现代连接池机制
外部数据库集成
postgres
利用现有PostgreSQL基础设施

Cayenne vs DuckDB

Cayenne vs DuckDB

Choose Cayenne when datasets exceed ~1 TB, multi-file ingestion is needed, or point lookups are common. Choose DuckDB when datasets are under ~1 TB, complex SQL (window functions, CTEs) is needed, or DuckDB tooling is beneficial.
当数据集超过约1 TB、需要多文件 ingestion 或点查询频繁时,选择Cayenne。 当数据集小于约1 TB、需要复杂SQL(窗口函数、CTE)或借助DuckDB工具链时,选择DuckDB

Supported Engines

支持的引擎

EngineModeStatus
arrow
memoryStable
duckdb
memory, fileStable
sqlite
memory, fileRelease Candidate
cayenne
fileBeta
postgres
N/A (attached)Release Candidate
turso
memory, fileBeta
引擎模式状态
arrow
memory稳定
duckdb
memory, file稳定
sqlite
memory, file候选发布版
cayenne
fileBeta
postgres
N/A (attached)候选发布版
turso
memory, fileBeta

Refresh Modes

刷新模式

ModeDescriptionUse Case
full
Complete dataset replacement on each refreshSmall, slowly-changing datasets
append
(batch)
Adds new records based on a
time_column
Append-only logs, time-series data
append
(stream)
Continuous streaming without time columnReal-time event streams (Kafka, Debezium)
changes
CDC-based incremental updates via Debezium or DynamoDB StreamsFrequently updated transactional data
caching
Request-based row-level cachingAPI responses, HTTP endpoints
yaml
undefined
模式描述使用场景
full
每次刷新时完全替换数据集小型、变化缓慢的数据集
append
(批量)
基于
time_column
添加新记录
仅追加日志、时序数据
append
(流)
无时间列的持续流摄入实时事件流(Kafka、Debezium)
changes
基于CDC的增量更新(通过Debezium或DynamoDB Streams)频繁更新的事务数据
caching
基于请求的行级缓存API响应、HTTP端点
yaml
undefined

Full refresh every 8 hours

每8小时全量刷新

acceleration: refresh_mode: full refresh_check_interval: 8h
acceleration: refresh_mode: full refresh_check_interval: 8h

Append mode: check for new records from the last day every 10 minutes

追加模式:每10分钟检查过去一天的新记录

acceleration: refresh_mode: append time_column: created_at refresh_check_interval: 10m refresh_data_window: 1d
acceleration: refresh_mode: append time_column: created_at refresh_check_interval: 10m refresh_data_window: 1d

Continuous ingestion using Kafka

使用Kafka的持续摄入

acceleration: refresh_mode: append
acceleration: refresh_mode: append

CDC with Debezium or DynamoDB Streams

基于Debezium或DynamoDB Streams的CDC

acceleration: refresh_mode: changes
undefined
acceleration: refresh_mode: changes
undefined

Common Configurations

常见配置

In-Memory with Interval Refresh

内存模式+间隔刷新

yaml
acceleration:
  enabled: true
  engine: arrow
  refresh_check_interval: 5m
yaml
acceleration:
  enabled: true
  engine: arrow
  refresh_check_interval: 5m

File-Based with Append and Time Window

文件模式+追加+时间窗口

yaml
datasets:
  - from: postgres:events
    name: events
    time_column: created_at
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      refresh_mode: append
      refresh_check_interval: 1h
      refresh_data_window: 7d
yaml
datasets:
  - from: postgres:events
    name: events
    time_column: created_at
    acceleration:
      enabled: true
      engine: duckdb
      mode: file
      refresh_mode: append
      refresh_check_interval: 1h
      refresh_data_window: 7d

With Retention Policy

带保留策略

Retention policies prevent unbounded growth of accelerated datasets. Spice supports time-based and custom SQL-based retention strategies:
yaml
datasets:
  - from: postgres:events
    name: events
    time_column: created_at
    acceleration:
      enabled: true
      engine: duckdb
      retention_check_enabled: true
      retention_period: 30d
      retention_check_interval: 1h
保留策略可防止加速数据集无限制增长。Spice支持基于时间和自定义SQL的保留策略:
yaml
datasets:
  - from: postgres:events
    name: events
    time_column: created_at
    acceleration:
      enabled: true
      engine: duckdb
      retention_check_enabled: true
      retention_period: 30d
      retention_check_interval: 1h

With SQL-Based Retention

基于SQL的保留

yaml
acceleration:
  retention_check_enabled: true
  retention_check_interval: 1h
  retention_sql: "DELETE FROM logs WHERE status = 'archived'"
yaml
acceleration:
  retention_check_enabled: true
  retention_check_interval: 1h
  retention_sql: "DELETE FROM logs WHERE status = 'archived'"

With Indexes (DuckDB, SQLite, Turso)

带索引(DuckDB、SQLite、Turso)

yaml
acceleration:
  enabled: true
  engine: sqlite
  indexes:
    user_id: enabled
    '(created_at, status)': unique
  primary_key: id
yaml
acceleration:
  enabled: true
  engine: sqlite
  indexes:
    user_id: enabled
    '(created_at, status)': unique
  primary_key: id

Engine-Specific Parameters

引擎特定参数

DuckDB

DuckDB

yaml
acceleration:
  engine: duckdb
  mode: file
  params:
    duckdb_file: ./data/cache.db
yaml
acceleration:
  engine: duckdb
  mode: file
  params:
    duckdb_file: ./data/cache.db

SQLite

SQLite

yaml
acceleration:
  engine: sqlite
  mode: file
  params:
    sqlite_file: ./data/cache.sqlite
yaml
acceleration:
  engine: sqlite
  mode: file
  params:
    sqlite_file: ./data/cache.sqlite

Constraints and Indexes

约束与索引

Accelerated datasets support primary key constraints and indexes:
yaml
acceleration:
  enabled: true
  engine: duckdb
  primary_key: order_id # Creates non-null unique index
  indexes:
    customer_id: enabled # Single column index
    '(created_at, status)': unique # Multi-column unique index
加速数据集支持主键约束和索引:
yaml
acceleration:
  enabled: true
  engine: duckdb
  primary_key: order_id # 创建非空唯一索引
  indexes:
    customer_id: enabled # 单列索引
    '(created_at, status)': unique # 多列唯一索引

Snapshots (DuckDB, SQLite & Cayenne file mode)

快照(DuckDB、SQLite & Cayenne文件模式)

Bootstrap file-based accelerations from S3 or filesystem snapshots on startup. This dramatically reduces cold-start latency in distributed deployments.
Snapshot triggers vary by refresh mode:
  • refresh_complete
    : Creates snapshots after each refresh (full and batch-append modes)
  • time_interval
    : Creates snapshots on a fixed schedule (all refresh modes)
  • stream_batches
    : Creates snapshots after every N batches (streaming modes: Kafka, Debezium, DynamoDB Streams)
yaml
snapshots:
  enabled: true
  location: s3://my_bucket/snapshots/
  bootstrap_on_failure_behavior: warn # warn | retry | fallback
  params:
    s3_auth: iam_role
Per-dataset opt-in:
yaml
acceleration:
  enabled: true
  engine: duckdb
  mode: file
  snapshots:
    enabled: true
启动时从S3或文件系统快照初始化基于文件的加速器。这可大幅降低分布式部署中的冷启动延迟。
快照触发方式因刷新模式而异:
  • refresh_complete
    : 每次刷新后创建快照(全量和批量追加模式)
  • time_interval
    : 按固定计划创建快照(所有刷新模式)
  • stream_batches
    : 每处理N个批次后创建快照(流模式:Kafka、Debezium、DynamoDB Streams)
yaml
snapshots:
  enabled: true
  location: s3://my_bucket/snapshots/
  bootstrap_on_failure_behavior: warn # warn | retry | fallback
  params:
    s3_auth: iam_role
按数据集启用:
yaml
acceleration:
  enabled: true
  engine: duckdb
  mode: file
  snapshots:
    enabled: true

Memory Considerations

内存注意事项

When using
mode: memory
(default), the dataset is loaded into RAM. Ensure sufficient memory including overhead for queries and the runtime. Mitigate with
mode: file
for duckdb, sqlite, turso, or cayenne accelerators.
使用
mode: memory
(默认)时,数据集会加载到RAM中。确保有足够的内存,包括查询和运行时的开销。对于duckdb、sqlite、turso或cayenne加速器,可使用
mode: file
来缓解内存压力。

Documentation

文档