Loading...
Loading...
Implements reliability patterns including circuit breakers, retries, fallbacks, bulkheads, and SLO definitions. Provides failure mode analysis and incident response plans. Use for "SRE", "reliability", "resilience", or "failure handling".
npx skill4agent add patricio0312rev/skills reliability-strategy-builderclass CircuitBreaker {
private state: "closed" | "open" | "half-open" = "closed";
private failureCount = 0;
private lastFailureTime?: Date;
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.state === "open") {
if (this.shouldAttemptReset()) {
this.state = "half-open";
} else {
throw new Error("Circuit breaker is OPEN");
}
}
try {
const result = await operation();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failureCount = 0;
this.state = "closed";
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = new Date();
if (this.failureCount >= 5) {
this.state = "open";
}
}
private shouldAttemptReset(): boolean {
if (!this.lastFailureTime) return false;
const now = Date.now();
const elapsed = now - this.lastFailureTime.getTime();
return elapsed > 60000; // 1 minute
}
}async function retryWithBackoff<T>(
operation: () => Promise<T>,
maxRetries = 3,
baseDelay = 1000
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
if (attempt === maxRetries) throw error;
// Exponential backoff: 1s, 2s, 4s
const delay = baseDelay * Math.pow(2, attempt);
await sleep(delay);
}
}
throw new Error("Max retries exceeded");
}async function getUserWithFallback(userId: string): Promise<User> {
try {
// Try primary database
return await primaryDb.users.findById(userId);
} catch (error) {
logger.warn("Primary DB failed, using cache");
// Fallback to cache
const cached = await cache.get(`user:${userId}`);
if (cached) return cached;
// Final fallback: return minimal user object
return {
id: userId,
name: "Unknown User",
email: "unavailable",
};
}
}class ThreadPool {
private pools = new Map<string, Semaphore>();
constructor() {
// Separate pools for different operations
this.pools.set("critical", new Semaphore(100));
this.pools.set("standard", new Semaphore(50));
this.pools.set("background", new Semaphore(10));
}
async execute(priority: string, operation: () => Promise<any>) {
const pool = this.pools.get(priority);
await pool.acquire();
try {
return await operation();
} finally {
pool.release();
}
}
}service: user-api
slos:
- name: Availability
description: API should be available for successful requests
target: 99.9%
measurement:
type: ratio
success: status_code < 500
total: all_requests
window: 30 days
- name: Latency
description: 95% of requests complete within 500ms
target: 95%
measurement:
type: percentile
metric: request_duration_ms
threshold: 500
percentile: 95
window: 7 days
- name: Error Rate
description: Less than 1% of requests result in errors
target: 99%
measurement:
type: ratio
success: status_code < 400 OR status_code IN [401, 403, 404]
total: all_requests
window: 24 hoursError Budget = 100% - SLO
Example:
SLO: 99.9% availability
Error Budget: 0.1% = 43.2 minutes/month downtime allowed| Component | Failure Mode | Impact | Probability | Detection | Mitigation |
| ----------- | ------------ | ------ | ----------- | ----------------------- | ------------------------------ |
| Database | Unresponsive | HIGH | Medium | Health checks every 10s | Circuit breaker, read replicas |
| API Gateway | Overload | HIGH | Low | Request queue depth | Rate limiting, auto-scaling |
| Cache | Eviction | MEDIUM | High | Cache hit rate | Fallback to DB, larger cache |
| Queue | Backed up | LOW | Medium | Queue depth metric | Add workers, DLQ |SEV1 (Critical): Complete service outage, data loss
- Response time: <15 minutes
- Page on-call immediately
SEV2 (High): Partial outage, degraded performance
- Response time: <1 hour
- Alert on-call
SEV3 (Medium): Minor issues, workarounds available
- Response time: <4 hours
- Create ticket
SEV4 (Low): Cosmetic issues, no user impact
- Response time: Next business day
- Backlog