Loading...
Loading...
Compare original and translation side by side
+-------------------------------------------------------------------+
| Circuit Breaker States |
+-------------------------------------------------------------------+
| |
| +----------+ failures >= threshold +----------+ |
| | CLOSED | ----------------------------> | OPEN | |
| | (normal) | | (reject) | |
| +----+-----+ +----+-----+ |
| | | |
| | success timeout | |
| | expires | |
| | +------------+ | |
| | | HALF_OPEN |<-----------------+ |
| +---------+ (probe) | |
| +------------+ |
| |
| CLOSED: Allow requests, count failures |
| OPEN: Reject immediately, return fallback |
| HALF_OPEN: Allow probe request to test recovery |
| |
+-------------------------------------------------------------------+failure_thresholdrecovery_timeouthalf_open_requests+-------------------------------------------------------------------+
| Circuit Breaker States |
+-------------------------------------------------------------------+
| |
| +----------+ failures >= threshold +----------+ |
| | CLOSED | ----------------------------> | OPEN | |
| | (normal) | | (reject) | |
| +----+-----+ +----+-----+ |
| | | |
| | success timeout | |
| | expires | |
| | +------------+ | |
| | | HALF_OPEN |<-----------------+ |
| +---------+ (probe) | |
| +------------+ |
| |
| CLOSED: Allow requests, count failures |
| OPEN: Reject immediately, return fallback |
| HALF_OPEN: Allow probe request to test recovery |
| |
+-------------------------------------------------------------------+failure_thresholdrecovery_timeouthalf_open_requests+-------------------------------------------------------------------+
| Bulkhead Isolation |
+-------------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | TIER 1: Critical | | TIER 2: Standard | |
| | (5 workers) | | (3 workers) | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | |#| |#| | | | | |#| | | | | | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | +-+ +-+ | | | |
| | | | | | | | Queue: 2 | |
| | +-+ +-+ | | | |
| | Queue: 0 | +------------------+ |
| +------------------+ |
| |
| +------------------+ |
| | TIER 3: Optional | # = Active request |
| | (2 workers) | = Available slot |
| | +-+ +-+ | |
| | |#| |#| FULL! | Tier 1: synthesis, quality_gate |
| | +-+ +-+ | Tier 2: analysis agents |
| | Queue: 5 | Tier 3: enrichment, optional features |
| +------------------+ |
| |
+-------------------------------------------------------------------+| Tier | Workers | Queue | Timeout | Use Case |
|---|---|---|---|---|
| 1 (Critical) | 5 | 10 | 300s | Synthesis, quality gate |
| 2 (Standard) | 3 | 5 | 120s | Content analysis agents |
| 3 (Optional) | 2 | 3 | 60s | Enrichment, caching |
+-------------------------------------------------------------------+
| Bulkhead Isolation |
+-------------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | TIER 1: Critical | | TIER 2: Standard | |
| | (5 workers) | | (3 workers) | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | |#| |#| | | | | |#| | | | | | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | +-+ +-+ | | | |
| | | | | | | | Queue: 2 | |
| | +-+ +-+ | | | |
| | Queue: 0 | +------------------+ |
| +------------------+ |
| |
| +------------------+ |
| | TIER 3: Optional | # = Active request |
| | (2 workers) | = Available slot |
| | +-+ +-+ | |
| | |#| |#| FULL! | Tier 1: synthesis, quality_gate |
| | +-+ +-+ | Tier 2: analysis agents |
| | Queue: 5 | Tier 3: enrichment, optional features |
| +------------------+ |
| |
+-------------------------------------------------------------------+| 层级 | 工作线程数 | 队列长度 | 超时时间 | 适用场景 |
|---|---|---|---|---|
| 1(关键) | 5 | 10 | 300秒 | 合成任务、质量网关 |
| 2(标准) | 3 | 5 | 120秒 | 内容分析Agent |
| 3(可选) | 2 | 3 | 60秒 | 增强处理、可选功能 |
+-------------------------------------------------------------------+
| Exponential Backoff + Jitter |
+-------------------------------------------------------------------+
| |
| Attempt 1: --> X (fail) |
| wait: 1s +/- 0.5s |
| |
| Attempt 2: --> X (fail) |
| wait: 2s +/- 1s |
| |
| Attempt 3: --> X (fail) |
| wait: 4s +/- 2s |
| |
| Attempt 4: --> OK (success) |
| |
| Formula: delay = min(base * 2^attempt, max_delay) * jitter |
| Jitter: random(0.5, 1.5) to prevent thundering herd |
| |
+-------------------------------------------------------------------+RETRYABLE_ERRORS = {
# HTTP/Network
408, 429, 500, 502, 503, 504, # HTTP status codes
ConnectionError, TimeoutError, # Network errors
# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded", # Retry with truncation
}
NON_RETRYABLE_ERRORS = {
400, 401, 403, 404, # Client errors
"invalid_api_key",
"content_policy_violation",
"invalid_request_error",
}+-------------------------------------------------------------------+
| Exponential Backoff + Jitter |
+-------------------------------------------------------------------+
| |
| Attempt 1: --> X (fail) |
| wait: 1s +/- 0.5s |
| |
| Attempt 2: --> X (fail) |
| wait: 2s +/- 1s |
| |
| Attempt 3: --> X (fail) |
| wait: 4s +/- 2s |
| |
| Attempt 4: --> OK (success) |
| |
| Formula: delay = min(base * 2^attempt, max_delay) * jitter |
| Jitter: random(0.5, 1.5) to prevent thundering herd |
| |
+-------------------------------------------------------------------+RETRYABLE_ERRORS = {
# HTTP/Network
408, 429, 500, 502, 503, 504, # HTTP status codes
ConnectionError, TimeoutError, # Network errors
# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded", # Retry with truncation
}
NON_RETRYABLE_ERRORS = {
400, 401, 403, 404, # Client errors
"invalid_api_key",
"content_policy_violation",
"invalid_request_error",
}+-------------------------------------------------------------------+
| LLM Fallback Chain |
+-------------------------------------------------------------------+
| |
| Request --> [Primary Model] --success--> Response |
| | |
| fail |
| v |
| [Fallback Model] --success--> Response |
| | |
| fail |
| v |
| [Cached Response] --hit--> Response |
| | |
| miss |
| v |
| [Default Response] --> Graceful Degradation |
| |
| Example Chain: |
| 1. claude-sonnet-4-5-20251101 (primary) |
| 2. gpt-5.2-mini (fallback) |
| 3. Semantic cache lookup |
| 4. "Analysis unavailable" + partial results |
| |
+-------------------------------------------------------------------++-------------------------------------------------------------------+
| Token Budget Guard |
+-------------------------------------------------------------------+
| |
| Input: 8,000 tokens |
| +---------------------------------------------+ |
| |################################# | |
| +---------------------------------------------+ |
| ^ |
| | |
| Context Limit (16K) |
| |
| Strategy when approaching limit: |
| 1. Summarize earlier context (compress 4:1) |
| 2. Drop low-priority content (optional fields) |
| 3. Split into multiple requests |
| 4. Fail fast with "content too large" error |
| |
+-------------------------------------------------------------------++-------------------------------------------------------------------+
| LLM Fallback Chain |
+-------------------------------------------------------------------+
| |
| Request --> [Primary Model] --success--> Response |
| | |
| fail |
| v |
| [Fallback Model] --success--> Response |
| | |
| fail |
| v |
| [Cached Response] --hit--> Response |
| | |
| miss |
| v |
| [Default Response] --> Graceful Degradation |
| |
| Example Chain: |
| 1. claude-sonnet-4-5-20251101 (primary) |
| 2. gpt-5.2-mini (fallback) |
| 3. Semantic cache lookup |
| 4. "Analysis unavailable" + partial results |
| |
+-------------------------------------------------------------------++-------------------------------------------------------------------+
| Token Budget Guard |
+-------------------------------------------------------------------+
| |
| Input: 8,000 tokens |
| +---------------------------------------------+ |
| |################################# | |
| +---------------------------------------------+ |
| ^ |
| | |
| Context Limit (16K) |
| |
| Strategy when approaching limit: |
| 1. Summarize earlier context (compress 4:1) |
| 2. Drop low-priority content (optional fields) |
| 3. Split into multiple requests |
| 4. Fail fast with "content too large" error |
| |
+-------------------------------------------------------------------+| Pattern | When to Use | Key Benefit |
|---|---|---|
| Circuit Breaker | External service calls | Prevent cascade failures |
| Bulkhead | Multi-tenant/multi-agent | Isolate failures |
| Retry + Backoff | Transient failures | Automatic recovery |
| Fallback Chain | Critical operations | Graceful degradation |
| Token Budget | LLM calls | Cost control, prevent failures |
| 模式 | 适用场景 | 核心优势 |
|---|---|---|
| 断路器 | 外部服务调用 | 防止级联故障 |
| 舱壁 | 多租户/多Agent场景 | 隔离故障 |
| 重试+退避 | 瞬时故障场景 | 自动恢复 |
| 降级链 | 关键操作场景 | 优雅降级 |
| Token预算 | LLM调用场景 | 成本控制、避免故障 |
references/circuit-breaker.mdreferences/bulkhead-pattern.mdreferences/retry-strategies.mdreferences/llm-resilience.mdreferences/error-classification.mdreferences/circuit-breaker.mdreferences/bulkhead-pattern.mdreferences/retry-strategies.mdreferences/llm-resilience.mdreferences/error-classification.mdscripts/circuit-breaker.pyscripts/bulkhead.pyscripts/retry-handler.pyscripts/llm-fallback-chain.pyscripts/token-budget.pyscripts/circuit-breaker.pyscripts/bulkhead.pyscripts/retry-handler.pyscripts/llm-fallback-chain.pyscripts/token-budget.pyexamples/orchestkit-workflow-resilience.mdexamples/orchestkit-workflow-resilience.mdchecklists/pre-deployment-resilience.mdchecklists/circuit-breaker-setup.mdchecklists/pre-deployment-resilience.mdchecklists/circuit-breaker-setup.mdobservability-monitoringcaching-strategieserror-handling-rfc9457background-jobsobservability-monitoringcaching-strategieserror-handling-rfc9457background-jobs| Decision | Choice | Rationale |
|---|---|---|
| Circuit breaker recovery | Half-open probe | Gradual recovery, prevents immediate re-failure |
| Retry algorithm | Exponential backoff + jitter | Prevents thundering herd, respects rate limits |
| Bulkhead isolation | Semaphore-based tiers | Simple, efficient, prioritizes critical operations |
| LLM fallback | Model chain with cache | Graceful degradation, cost optimization, availability |
| 决策项 | 选择方案 | 理由 |
|---|---|---|
| 断路器恢复机制 | 半开状态探测 | 逐步恢复,避免立即再次故障 |
| 重试算法 | 指数退避+抖动 | 防止惊群效应,遵守速率限制 |
| 舱壁隔离方式 | 基于信号量的层级 | 简单高效,优先处理关键操作 |
| LLM降级方案 | 带缓存的模型链 | 优雅降级,成本优化,高可用性 |