workflow-orchestration-patterns

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Workflow Orchestration Patterns

工作流编排模式

Master workflow orchestration architecture with Temporal, covering fundamental design decisions, resilience patterns, and best practices for building reliable distributed systems.
使用Temporal掌握工作流编排架构,涵盖构建可靠分布式系统的核心设计决策、弹性模式和最佳实践。

When to Use Workflow Orchestration

何时使用工作流编排

Ideal Use Cases (Source: docs.temporal.io)

理想使用场景(来源:docs.temporal.io)

  • Multi-step processes spanning machines/services/databases
  • Distributed transactions requiring all-or-nothing semantics
  • Long-running workflows (hours to years) with automatic state persistence
  • Failure recovery that must resume from last successful step
  • Business processes: bookings, orders, campaigns, approvals
  • Entity lifecycle management: inventory tracking, account management, cart workflows
  • Infrastructure automation: CI/CD pipelines, provisioning, deployments
  • Human-in-the-loop systems requiring timeouts and escalations
  • 多步骤流程:跨机器/服务/数据库的流程
  • 分布式事务:需要全有或全无语义的事务
  • 长时间运行的工作流:(数小时到数年)具备自动状态持久化能力
  • 故障恢复:必须从最后成功步骤恢复的场景
  • 业务流程:预订、订单、营销活动、审批流程
  • 实体生命周期管理:库存跟踪、账户管理、购物车工作流
  • 基础设施自动化:CI/CD流水线、资源配置、部署流程
  • 人工参与的系统:需要超时处理和升级机制的系统

When NOT to Use

何时不使用

  • Simple CRUD operations (use direct API calls)
  • Pure data processing pipelines (use Airflow, batch processing)
  • Stateless request/response (use standard APIs)
  • Real-time streaming (use Kafka, event processors)
  • 简单CRUD操作(使用直接API调用)
  • 纯数据处理流水线(使用Airflow、批处理)
  • 无状态请求/响应(使用标准API)
  • 实时流处理(使用Kafka、事件处理器)

Critical Design Decision: Workflows vs Activities

核心设计决策:工作流 vs 活动

The Fundamental Rule (Source: temporal.io/blog/workflow-engine-principles):
  • Workflows = Orchestration logic and decision-making
  • Activities = External interactions (APIs, databases, network calls)
基本原则(来源:temporal.io/blog/workflow-engine-principles):
  • 工作流 = 编排逻辑与决策制定
  • 活动 = 外部交互(API、数据库、网络调用)

Workflows (Orchestration)

工作流(编排)

Characteristics:
  • Contain business logic and coordination
  • MUST be deterministic (same inputs → same outputs)
  • Cannot perform direct external calls
  • State automatically preserved across failures
  • Can run for years despite infrastructure failures
Example workflow tasks:
  • Decide which steps to execute
  • Handle compensation logic
  • Manage timeouts and retries
  • Coordinate child workflows
特性
  • 包含业务逻辑与协调逻辑
  • 必须具备确定性(相同输入 → 相同输出)
  • 不能执行直接的外部调用
  • 故障发生时自动保留状态
  • 即使基础设施故障,也可运行数年
工作流任务示例
  • 决定执行哪些步骤
  • 处理补偿逻辑
  • 管理超时与重试
  • 协调子工作流

Activities (External Interactions)

活动(外部交互)

Characteristics:
  • Handle all external system interactions
  • Can be non-deterministic (API calls, DB writes)
  • Include built-in timeouts and retry logic
  • Must be idempotent (calling N times = calling once)
  • Short-lived (seconds to minutes typically)
Example activity tasks:
  • Call payment gateway API
  • Write to database
  • Send emails or notifications
  • Query external services
特性
  • 处理所有外部系统交互
  • 可以是非确定性的(API调用、数据库写入)
  • 内置超时与重试逻辑
  • 必须具备幂等性(调用N次 = 调用1次)
  • 短生命周期(通常为数秒到数分钟)
活动任务示例
  • 调用支付网关API
  • 写入数据库
  • 发送邮件或通知
  • 查询外部服务

Design Decision Framework

设计决策框架

Does it touch external systems? → Activity
Is it orchestration/decision logic? → Workflow
Does it touch external systems? → Activity
Is it orchestration/decision logic? → Workflow

Core Workflow Patterns

核心工作流模式

1. Saga Pattern with Compensation

1. 带补偿的Saga模式

Purpose: Implement distributed transactions with rollback capability
Pattern (Source: temporal.io/blog/compensating-actions-part-of-a-complete-breakfast-with-sagas):
For each step:
  1. Register compensation BEFORE executing
  2. Execute the step (via activity)
  3. On failure, run all compensations in reverse order (LIFO)
Example: Payment Workflow
  1. Reserve inventory (compensation: release inventory)
  2. Charge payment (compensation: refund payment)
  3. Fulfill order (compensation: cancel fulfillment)
Critical Requirements:
  • Compensations must be idempotent
  • Register compensation BEFORE executing step
  • Run compensations in reverse order
  • Handle partial failures gracefully
目标:实现具备回滚能力的分布式事务
模式(来源:temporal.io/blog/compensating-actions-part-of-a-complete-breakfast-with-sagas):
For each step:
  1. Register compensation BEFORE executing
  2. Execute the step (via activity)
  3. On failure, run all compensations in reverse order (LIFO)
示例:支付工作流
  1. 预留库存(补偿:释放库存)
  2. 扣款(补偿:退款)
  3. 履行订单(补偿:取消订单履行)
关键要求
  • 补偿操作必须具备幂等性
  • 执行步骤前注册补偿逻辑
  • 按逆序执行补偿操作
  • 优雅处理部分故障

2. Entity Workflows (Actor Model)

2. 实体工作流(Actor模型)

Purpose: Long-lived workflow representing single entity instance
Pattern (Source: docs.temporal.io/evaluate/use-cases-design-patterns):
  • One workflow execution = one entity (cart, account, inventory item)
  • Workflow persists for entity lifetime
  • Receives signals for state changes
  • Supports queries for current state
Example Use Cases:
  • Shopping cart (add items, checkout, expiration)
  • Bank account (deposits, withdrawals, balance checks)
  • Product inventory (stock updates, reservations)
Benefits:
  • Encapsulates entity behavior
  • Guarantees consistency per entity
  • Natural event sourcing
目标:代表单个实体实例的长时间运行工作流
模式(来源:docs.temporal.io/evaluate/use-cases-design-patterns):
  • 一个工作流执行 = 一个实体(购物车、账户、库存项)
  • 工作流在实体生命周期内持续运行
  • 接收状态变更的信号
  • 支持查询当前状态
示例使用场景
  • 购物车(添加商品、结账、过期处理)
  • 银行账户(存款、取款、余额查询)
  • 产品库存(库存更新、预留)
优势
  • 封装实体行为
  • 保证每个实体的一致性
  • 天然的事件溯源能力

3. Fan-Out/Fan-In (Parallel Execution)

3. 扇出/扇入(并行执行)

Purpose: Execute multiple tasks in parallel, aggregate results
Pattern:
  • Spawn child workflows or parallel activities
  • Wait for all to complete
  • Aggregate results
  • Handle partial failures
Scaling Rule (Source: temporal.io/blog/workflow-engine-principles):
  • Don't scale individual workflows
  • For 1M tasks: spawn 1K child workflows × 1K tasks each
  • Keep each workflow bounded
目标:并行执行多个任务,聚合结果
模式
  • 生成子工作流或并行活动
  • 等待所有任务完成
  • 聚合结果
  • 处理部分故障
扩展规则(来源:temporal.io/blog/workflow-engine-principles):
  • 不要扩展单个工作流
  • 处理100万任务:生成1000个子工作流 × 每个子工作流处理1000个任务
  • 保持每个工作流的边界清晰

4. Async Callback Pattern

4. 异步回调模式

Purpose: Wait for external event or human approval
Pattern:
  • Workflow sends request and waits for signal
  • External system processes asynchronously
  • Sends signal to resume workflow
  • Workflow continues with response
Use Cases:
  • Human approval workflows
  • Webhook callbacks
  • Long-running external processes
目标:等待外部事件或人工审批
模式
  • 工作流发送请求并等待信号
  • 外部系统异步处理
  • 发送信号恢复工作流
  • 工作流根据响应继续执行
使用场景
  • 人工审批工作流
  • Webhook回调
  • 长时间运行的外部流程

State Management and Determinism

状态管理与确定性

Automatic State Preservation

自动状态保留

How Temporal Works (Source: docs.temporal.io/workflows):
  • Complete program state preserved automatically
  • Event History records every command and event
  • Seamless recovery from crashes
  • Applications restore pre-failure state
Temporal的工作原理(来源:docs.temporal.io/workflows):
  • 自动保留完整的程序状态
  • 事件历史记录每个命令与事件
  • 从崩溃中无缝恢复
  • 应用恢复到故障前的状态

Determinism Constraints

确定性约束

Workflows Execute as State Machines:
  • Replay behavior must be consistent
  • Same inputs → identical outputs every time
Prohibited in Workflows (Source: docs.temporal.io/workflows):
  • ❌ Threading, locks, synchronization primitives
  • ❌ Random number generation (
    random()
    )
  • ❌ Global state or static variables
  • ❌ System time (
    datetime.now()
    )
  • ❌ Direct file I/O or network calls
  • ❌ Non-deterministic libraries
Allowed in Workflows:
  • workflow.now()
    (deterministic time)
  • workflow.random()
    (deterministic random)
  • ✅ Pure functions and calculations
  • ✅ Calling activities (non-deterministic operations)
工作流作为状态机执行
  • 重放行为必须一致
  • 相同输入 → 每次输出都相同
工作流中禁止的操作(来源:docs.temporal.io/workflows):
  • ❌ 线程、锁、同步原语
  • ❌ 随机数生成(
    random()
  • ❌ 全局状态或静态变量
  • ❌ 系统时间(
    datetime.now()
  • ❌ 直接文件I/O或网络调用
  • ❌ 非确定性库
工作流中允许的操作
  • workflow.now()
    (确定性时间)
  • workflow.random()
    (确定性随机数)
  • ✅ 纯函数与计算
  • ✅ 调用活动(非确定性操作)

Versioning Strategies

版本控制策略

Challenge: Changing workflow code while old executions still running
Solutions:
  1. Versioning API: Use
    workflow.get_version()
    for safe changes
  2. New Workflow Type: Create new workflow, route new executions to it
  3. Backward Compatibility: Ensure old events replay correctly
挑战:在旧的工作流执行仍在运行时修改工作流代码
解决方案
  1. 版本控制API:使用
    workflow.get_version()
    进行安全变更
  2. 新工作流类型:创建新的工作流,将新的执行路由到该工作流
  3. 向后兼容性:确保旧事件可以正确重放

Resilience and Error Handling

弹性与错误处理

Retry Policies

重试策略

Default Behavior: Temporal retries activities forever
Configure Retry:
  • Initial retry interval
  • Backoff coefficient (exponential backoff)
  • Maximum interval (cap retry delay)
  • Maximum attempts (eventually fail)
Non-Retryable Errors:
  • Invalid input (validation failures)
  • Business rule violations
  • Permanent failures (resource not found)
默认行为:Temporal会无限重试活动
配置重试
  • 初始重试间隔
  • 退避系数(指数退避)
  • 最大间隔(限制重试延迟)
  • 最大尝试次数(最终失败)
不可重试错误
  • 无效输入(验证失败)
  • 违反业务规则
  • 永久性故障(资源未找到)

Idempotency Requirements

幂等性要求

Why Critical (Source: docs.temporal.io/activities):
  • Activities may execute multiple times
  • Network failures trigger retries
  • Duplicate execution must be safe
Implementation Strategies:
  • Idempotency keys (deduplication)
  • Check-then-act with unique constraints
  • Upsert operations instead of insert
  • Track processed request IDs
为什么至关重要(来源:docs.temporal.io/activities):
  • 活动可能会执行多次
  • 网络故障会触发重试
  • 重复执行必须是安全的
实现策略
  • 幂等键(去重)
  • 带唯一约束的先检查后执行
  • 使用Upsert操作而非Insert操作
  • 跟踪已处理的请求ID

Activity Heartbeats

活动心跳

Purpose: Detect stalled long-running activities
Pattern:
  • Activity sends periodic heartbeat
  • Includes progress information
  • Timeout if no heartbeat received
  • Enables progress-based retry
目标:检测停滞的长时间运行活动
模式
  • 活动定期发送心跳
  • 包含进度信息
  • 如果未收到心跳则超时
  • 支持基于进度的重试

Best Practices

最佳实践

Workflow Design

工作流设计

  1. Keep workflows focused - Single responsibility per workflow
  2. Small workflows - Use child workflows for scalability
  3. Clear boundaries - Workflow orchestrates, activities execute
  4. Test locally - Use time-skipping test environment
  1. 保持工作流聚焦:每个工作流单一职责
  2. 小型工作流:使用子工作流实现可扩展性
  3. 清晰的边界:工作流负责编排,活动负责执行
  4. 本地测试:使用时间跳跃测试环境

Activity Design

活动设计

  1. Idempotent operations - Safe to retry
  2. Short-lived - Seconds to minutes, not hours
  3. Timeout configuration - Always set timeouts
  4. Heartbeat for long tasks - Report progress
  5. Error handling - Distinguish retryable vs non-retryable
  1. 幂等操作:重试安全
  2. 短生命周期:数秒到数分钟,而非数小时
  3. 超时配置:始终设置超时
  4. 长任务心跳:报告进度
  5. 错误处理:区分可重试与不可重试错误

Common Pitfalls

常见陷阱

Workflow Violations:
  • Using
    datetime.now()
    instead of
    workflow.now()
  • Threading or async operations in workflow code
  • Calling external APIs directly from workflow
  • Non-deterministic logic in workflows
Activity Mistakes:
  • Non-idempotent operations (can't handle retries)
  • Missing timeouts (activities run forever)
  • No error classification (retry validation errors)
  • Ignoring payload limits (2MB per argument)
工作流违规行为
  • 使用
    datetime.now()
    而非
    workflow.now()
  • 在工作流代码中使用线程或异步操作
  • 从工作流直接调用外部API
  • 工作流中包含非确定性逻辑
活动常见错误
  • 非幂等操作(无法处理重试)
  • 缺少超时(活动无限运行)
  • 未分类错误(重试验证错误)
  • 忽略负载限制(每个参数最大2MB)

Operational Considerations

运营注意事项

Monitoring:
  • Workflow execution duration
  • Activity failure rates
  • Retry attempts and backoff
  • Pending workflow counts
Scalability:
  • Horizontal scaling with workers
  • Task queue partitioning
  • Child workflow decomposition
  • Activity batching when appropriate
监控
  • 工作流执行时长
  • 活动失败率
  • 重试次数与退避
  • 待处理工作流数量
可扩展性
  • 使用Worker实现水平扩展
  • 任务队列分区
  • 子工作流分解
  • 适当的活动批处理

Additional Resources

额外资源

Official Documentation:
  • Temporal Core Concepts: docs.temporal.io/workflows
  • Workflow Patterns: docs.temporal.io/evaluate/use-cases-design-patterns
  • Best Practices: docs.temporal.io/develop/best-practices
  • Saga Pattern: temporal.io/blog/saga-pattern-made-easy
Key Principles:
  1. Workflows = orchestration, Activities = external calls
  2. Determinism is non-negotiable for workflows
  3. Idempotency is critical for activities
  4. State preservation is automatic
  5. Design for failure and recovery
官方文档
  • Temporal核心概念:docs.temporal.io/workflows
  • 工作流模式:docs.temporal.io/evaluate/use-cases-design-patterns
  • 最佳实践:docs.temporal.io/develop/best-practices
  • Saga模式:temporal.io/blog/saga-pattern-made-easy
核心原则
  1. 工作流 = 编排,活动 = 外部调用
  2. 确定性是工作流的硬性要求
  3. 幂等性对活动至关重要
  4. 状态保留是自动的
  5. 为故障与恢复设计