Loading...
Loading...
Found 14 Skills
Python resilience patterns including automatic retries, exponential backoff, timeouts, and fault-tolerant decorators. Use when adding retry logic, implementing timeouts, building fault-tolerant services, or handling transient failures.
Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.
Implement circuit breaker patterns for fault tolerance, automatic failure detection, and fallback mechanisms. Use when calling external services, handling cascading failures, or implementing resilience patterns.
This skill should be used when implementing fault tolerance and resilience patterns in Spring Boot applications using the Resilience4j library. Apply this skill to add circuit breaker, retry, rate limiter, bulkhead, time limiter, and fallback mechanisms to prevent cascading failures, handle transient errors, and manage external service dependencies gracefully in microservices architectures.
Idempotent Redundancy
Implements reliability patterns including circuit breakers, retries, fallbacks, bulkheads, and SLO definitions. Provides failure mode analysis and incident response plans. Use for "SRE", "reliability", "resilience", or "failure handling".
模型自动降级与故障切换。当主模型请求失败、超时、达到速率限制或配额耗尽时,自动切换到备用模型,确保服务连续性。支持多供应商、多优先级的智能模型选择,提供健康监控、自动重试和错误恢复机制。
LangGraph checkpointing and persistence. Use when implementing fault-tolerant workflows, resuming interrupted executions, debugging with state history, or avoiding re-running expensive operations.
Production-grade fault tolerance for distributed systems. Use when implementing circuit breakers, retry with exponential backoff, bulkhead isolation patterns, or building resilience into LLM API integrations.
Graceful degradation through cascading fallback strategies - ensures system always completes while maintaining acceptable functionality
Agent skill for quorum-manager - invoke with $agent-quorum-manager
Agent skill for resource-allocator - invoke with $agent-resource-allocator