Loading...
Loading...
Production-grade fault tolerance for distributed systems. Use when implementing circuit breakers, retry with exponential backoff, bulkhead isolation patterns, or building resilience into LLM API integrations.
npx skill4agent add yonatangross/skillforge-claude-plugin resilience-patterns+-------------------------------------------------------------------+
| Circuit Breaker States |
+-------------------------------------------------------------------+
| |
| +----------+ failures >= threshold +----------+ |
| | CLOSED | ----------------------------> | OPEN | |
| | (normal) | | (reject) | |
| +----+-----+ +----+-----+ |
| | | |
| | success timeout | |
| | expires | |
| | +------------+ | |
| | | HALF_OPEN |<-----------------+ |
| +---------+ (probe) | |
| +------------+ |
| |
| CLOSED: Allow requests, count failures |
| OPEN: Reject immediately, return fallback |
| HALF_OPEN: Allow probe request to test recovery |
| |
+-------------------------------------------------------------------+failure_thresholdrecovery_timeouthalf_open_requests+-------------------------------------------------------------------+
| Bulkhead Isolation |
+-------------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | TIER 1: Critical | | TIER 2: Standard | |
| | (5 workers) | | (3 workers) | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | |#| |#| | | | | |#| | | | | | |
| | +-+ +-+ +-+ | | +-+ +-+ +-+ | |
| | +-+ +-+ | | | |
| | | | | | | | Queue: 2 | |
| | +-+ +-+ | | | |
| | Queue: 0 | +------------------+ |
| +------------------+ |
| |
| +------------------+ |
| | TIER 3: Optional | # = Active request |
| | (2 workers) | = Available slot |
| | +-+ +-+ | |
| | |#| |#| FULL! | Tier 1: synthesis, quality_gate |
| | +-+ +-+ | Tier 2: analysis agents |
| | Queue: 5 | Tier 3: enrichment, optional features |
| +------------------+ |
| |
+-------------------------------------------------------------------+| Tier | Workers | Queue | Timeout | Use Case |
|---|---|---|---|---|
| 1 (Critical) | 5 | 10 | 300s | Synthesis, quality gate |
| 2 (Standard) | 3 | 5 | 120s | Content analysis agents |
| 3 (Optional) | 2 | 3 | 60s | Enrichment, caching |
+-------------------------------------------------------------------+
| Exponential Backoff + Jitter |
+-------------------------------------------------------------------+
| |
| Attempt 1: --> X (fail) |
| wait: 1s +/- 0.5s |
| |
| Attempt 2: --> X (fail) |
| wait: 2s +/- 1s |
| |
| Attempt 3: --> X (fail) |
| wait: 4s +/- 2s |
| |
| Attempt 4: --> OK (success) |
| |
| Formula: delay = min(base * 2^attempt, max_delay) * jitter |
| Jitter: random(0.5, 1.5) to prevent thundering herd |
| |
+-------------------------------------------------------------------+RETRYABLE_ERRORS = {
# HTTP/Network
408, 429, 500, 502, 503, 504, # HTTP status codes
ConnectionError, TimeoutError, # Network errors
# LLM-specific
"rate_limit_exceeded",
"model_overloaded",
"context_length_exceeded", # Retry with truncation
}
NON_RETRYABLE_ERRORS = {
400, 401, 403, 404, # Client errors
"invalid_api_key",
"content_policy_violation",
"invalid_request_error",
}+-------------------------------------------------------------------+
| LLM Fallback Chain |
+-------------------------------------------------------------------+
| |
| Request --> [Primary Model] --success--> Response |
| | |
| fail |
| v |
| [Fallback Model] --success--> Response |
| | |
| fail |
| v |
| [Cached Response] --hit--> Response |
| | |
| miss |
| v |
| [Default Response] --> Graceful Degradation |
| |
| Example Chain: |
| 1. claude-sonnet-4-5-20251101 (primary) |
| 2. gpt-5.2-mini (fallback) |
| 3. Semantic cache lookup |
| 4. "Analysis unavailable" + partial results |
| |
+-------------------------------------------------------------------++-------------------------------------------------------------------+
| Token Budget Guard |
+-------------------------------------------------------------------+
| |
| Input: 8,000 tokens |
| +---------------------------------------------+ |
| |################################# | |
| +---------------------------------------------+ |
| ^ |
| | |
| Context Limit (16K) |
| |
| Strategy when approaching limit: |
| 1. Summarize earlier context (compress 4:1) |
| 2. Drop low-priority content (optional fields) |
| 3. Split into multiple requests |
| 4. Fail fast with "content too large" error |
| |
+-------------------------------------------------------------------+| Pattern | When to Use | Key Benefit |
|---|---|---|
| Circuit Breaker | External service calls | Prevent cascade failures |
| Bulkhead | Multi-tenant/multi-agent | Isolate failures |
| Retry + Backoff | Transient failures | Automatic recovery |
| Fallback Chain | Critical operations | Graceful degradation |
| Token Budget | LLM calls | Cost control, prevent failures |
references/circuit-breaker.mdreferences/bulkhead-pattern.mdreferences/retry-strategies.mdreferences/llm-resilience.mdreferences/error-classification.mdscripts/circuit-breaker.pyscripts/bulkhead.pyscripts/retry-handler.pyscripts/llm-fallback-chain.pyscripts/token-budget.pyexamples/orchestkit-workflow-resilience.mdchecklists/pre-deployment-resilience.mdchecklists/circuit-breaker-setup.mdobservability-monitoringcaching-strategieserror-handling-rfc9457background-jobs| Decision | Choice | Rationale |
|---|---|---|
| Circuit breaker recovery | Half-open probe | Gradual recovery, prevents immediate re-failure |
| Retry algorithm | Exponential backoff + jitter | Prevents thundering herd, respects rate limits |
| Bulkhead isolation | Semaphore-based tiers | Simple, efficient, prioritizes critical operations |
| LLM fallback | Model chain with cache | Graceful degradation, cost optimization, availability |