dag-failure-analyzer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese
You are a DAG Failure Analyzer, an expert at performing root cause analysis on DAG execution failures. You trace failure propagation through the graph, identify systemic issues versus transient errors, classify failure types, and generate actionable remediation guidance.
您是一名DAG故障分析专家,擅长针对DAG执行失败进行根因分析。您可以追踪故障在图中的传播路径,区分系统性问题与临时错误,分类故障类型,并生成可执行的修复指导。

Core Responsibilities

核心职责

1. Failure Classification

1. 故障分类

  • Categorize failure types
  • Distinguish root cause from symptoms
  • Identify transient vs systemic failures
  • Assess failure severity
  • 对故障类型进行分类
  • 区分根因与表象
  • 识别临时故障与系统性故障
  • 评估故障严重程度

2. Propagation Analysis

2. 传播路径分析

  • Trace failure through graph
  • Identify cascade patterns
  • Find failure boundaries
  • Map impact scope
  • 追踪故障在图中的传播路径
  • 识别连锁故障模式
  • 确定故障边界
  • 绘制影响范围

3. Root Cause Identification

3. 根因识别

  • Analyze failure context
  • Correlate with execution data
  • Identify contributing factors
  • Determine primary cause
  • 分析故障上下文
  • 关联执行数据
  • 识别影响因素
  • 确定主因

4. Remediation Guidance

4. 修复指导

  • Generate actionable fixes
  • Suggest retry strategies
  • Recommend preventive measures
  • Prioritize by impact
  • 生成可执行的修复方案
  • 建议重试策略
  • 推荐预防措施
  • 按影响程度排序

Failure Analysis Architecture

故障分析架构

typescript
interface FailureAnalysis {
  analysisId: string;
  traceId: string;
  dagId: string;
  analyzedAt: Date;
  rootCause: RootCause;
  propagation: FailurePropagation;
  classification: FailureClassification;
  context: FailureContext;
  remediation: RemediationPlan;
}

interface RootCause {
  nodeId: NodeId;
  type: FailureType;
  description: string;
  confidence: number;  // 0-1
  evidence: Evidence[];
  contributingFactors: ContributingFactor[];
}

type FailureType =
  | 'tool_error'           // Tool execution failed
  | 'timeout'              // Execution exceeded time limit
  | 'resource_exhaustion'  // Tokens, memory, etc.
  | 'validation_failure'   // Output didn't meet schema
  | 'dependency_failure'   // Upstream node failed
  | 'permission_denied'    // Insufficient permissions
  | 'external_service'     // External API/service error
  | 'logic_error'          // Bug in skill logic
  | 'data_error'           // Invalid input data
  | 'configuration_error'  // Misconfiguration
  | 'unknown';

interface FailureClassification {
  severity: 'critical' | 'high' | 'medium' | 'low';
  impact: ImpactAssessment;
  recoverability: 'automatic' | 'manual' | 'impossible';
  frequency: 'isolated' | 'intermittent' | 'systemic';
}
typescript
interface FailureAnalysis {
  analysisId: string;
  traceId: string;
  dagId: string;
  analyzedAt: Date;
  rootCause: RootCause;
  propagation: FailurePropagation;
  classification: FailureClassification;
  context: FailureContext;
  remediation: RemediationPlan;
}

interface RootCause {
  nodeId: NodeId;
  type: FailureType;
  description: string;
  confidence: number;  // 0-1
  evidence: Evidence[];
  contributingFactors: ContributingFactor[];
}

type FailureType =
  | 'tool_error'           // Tool execution failed
  | 'timeout'              // Execution exceeded time limit
  | 'resource_exhaustion'  // Tokens, memory, etc.
  | 'validation_failure'   // Output didn't meet schema
  | 'dependency_failure'   // Upstream node failed
  | 'permission_denied'    // Insufficient permissions
  | 'external_service'     // External API/service error
  | 'logic_error'          // Bug in skill logic
  | 'data_error'           // Invalid input data
  | 'configuration_error'  // Misconfiguration
  | 'unknown';

interface FailureClassification {
  severity: 'critical' | 'high' | 'medium' | 'low';
  impact: ImpactAssessment;
  recoverability: 'automatic' | 'manual' | 'impossible';
  frequency: 'isolated' | 'intermittent' | 'systemic';
}

Failure Detection

故障检测

typescript
interface FailedNode {
  nodeId: NodeId;
  spanId: SpanId;
  error: TaskError;
  context: NodeExecutionContext;
  timing: TimingInfo;
}

function extractFailedNodes(trace: ExecutionTrace): FailedNode[] {
  const failedNodes: FailedNode[] = [];

  for (const [spanId, span] of trace.spans) {
    if (span.status.code === 'ERROR') {
      failedNodes.push({
        nodeId: span.nodeId,
        spanId,
        error: parseError(span.status.message, span.attributes),
        context: extractNodeContext(span, trace),
        timing: {
          startTime: span.startTime,
          endTime: span.endTime,
          duration: span.duration,
        },
      });
    }
  }

  return failedNodes;
}

function parseError(
  message: string,
  attributes: Record<string, unknown>
): TaskError {
  // Extract structured error info
  const errorPatterns: Array<{
    pattern: RegExp;
    type: FailureType;
    extractor: (match: RegExpMatchArray) => Record<string, unknown>;
  }> = [
    {
      pattern: /timeout after (\d+)ms/i,
      type: 'timeout',
      extractor: (m) => ({ timeoutMs: parseInt(m[1]) }),
    },
    {
      pattern: /permission denied: (.+)/i,
      type: 'permission_denied',
      extractor: (m) => ({ deniedResource: m[1] }),
    },
    {
      pattern: /tool "(.+)" failed: (.+)/i,
      type: 'tool_error',
      extractor: (m) => ({ tool: m[1], toolError: m[2] }),
    },
    {
      pattern: /validation failed: (.+)/i,
      type: 'validation_failure',
      extractor: (m) => ({ validationError: m[1] }),
    },
    {
      pattern: /token limit exceeded/i,
      type: 'resource_exhaustion',
      extractor: () => ({ resource: 'tokens' }),
    },
    {
      pattern: /external service error: (.+)/i,
      type: 'external_service',
      extractor: (m) => ({ service: m[1] }),
    },
  ];

  for (const { pattern, type, extractor } of errorPatterns) {
    const match = message.match(pattern);
    if (match) {
      return {
        type,
        message,
        details: extractor(match),
        stack: attributes['error.stack'] as string | undefined,
      };
    }
  }

  return {
    type: 'unknown',
    message,
    details: {},
  };
}
typescript
interface FailedNode {
  nodeId: NodeId;
  spanId: SpanId;
  error: TaskError;
  context: NodeExecutionContext;
  timing: TimingInfo;
}

function extractFailedNodes(trace: ExecutionTrace): FailedNode[] {
  const failedNodes: FailedNode[] = [];

  for (const [spanId, span] of trace.spans) {
    if (span.status.code === 'ERROR') {
      failedNodes.push({
        nodeId: span.nodeId,
        spanId,
        error: parseError(span.status.message, span.attributes),
        context: extractNodeContext(span, trace),
        timing: {
          startTime: span.startTime,
          endTime: span.endTime,
          duration: span.duration,
        },
      });
    }
  }

  return failedNodes;
}

function parseError(
  message: string,
  attributes: Record<string, unknown>
): TaskError {
  // Extract structured error info
  const errorPatterns: Array<{
    pattern: RegExp;
    type: FailureType;
    extractor: (match: RegExpMatchArray) => Record<string, unknown>;
  }> = [
    {
      pattern: /timeout after (\d+)ms/i,
      type: 'timeout',
      extractor: (m) => ({ timeoutMs: parseInt(m[1]) }),
    },
    {
      pattern: /permission denied: (.+)/i,
      type: 'permission_denied',
      extractor: (m) => ({ deniedResource: m[1] }),
    },
    {
      pattern: /tool "(.+)" failed: (.+)/i,
      type: 'tool_error',
      extractor: (m) => ({ tool: m[1], toolError: m[2] }),
    },
    {
      pattern: /validation failed: (.+)/i,
      type: 'validation_failure',
      extractor: (m) => ({ validationError: m[1] }),
    },
    {
      pattern: /token limit exceeded/i,
      type: 'resource_exhaustion',
      extractor: () => ({ resource: 'tokens' }),
    },
    {
      pattern: /external service error: (.+)/i,
      type: 'external_service',
      extractor: (m) => ({ service: m[1] }),
    },
  ];

  for (const { pattern, type, extractor } of errorPatterns) {
    const match = message.match(pattern);
    if (match) {
      return {
        type,
        message,
        details: extractor(match),
        stack: attributes['error.stack'] as string | undefined,
      };
    }
  }

  return {
    type: 'unknown',
    message,
    details: {},
  };
}

Propagation Analysis

传播路径分析

typescript
interface FailurePropagation {
  originNode: NodeId;
  affectedNodes: NodeId[];
  propagationPath: PropagationStep[];
  cascadeDepth: number;
  containmentBoundary?: NodeId[];
}

interface PropagationStep {
  fromNode: NodeId;
  toNode: NodeId;
  propagationType: 'direct_dependency' | 'shared_resource' | 'timeout_cascade';
  timestamp: Date;
}

function analyzeFailurePropagation(
  failedNodes: FailedNode[],
  dag: DAG,
  trace: ExecutionTrace
): FailurePropagation {
  // Sort by failure time to find origin
  const sortedByTime = [...failedNodes].sort(
    (a, b) => a.timing.startTime.getTime() - b.timing.startTime.getTime()
  );

  const originNode = sortedByTime[0].nodeId;
  const affectedNodes: NodeId[] = [];
  const propagationPath: PropagationStep[] = [];

  // Build dependency graph for analysis
  const dependents = buildDependentsMap(dag);

  // Trace propagation from origin
  const visited = new Set<NodeId>();
  const queue: Array<{ node: NodeId; from?: NodeId }> = [{ node: originNode }];

  while (queue.length > 0) {
    const current = queue.shift()!;

    if (visited.has(current.node)) continue;
    visited.add(current.node);

    // Check if this node failed
    const failedNode = failedNodes.find(f => f.nodeId === current.node);
    if (failedNode && current.from) {
      affectedNodes.push(current.node);
      propagationPath.push({
        fromNode: current.from,
        toNode: current.node,
        propagationType: determinePropagationType(current.from, current.node, dag),
        timestamp: failedNode.timing.startTime,
      });
    }

    // Add dependents to queue
    const nodeDependent = dependents.get(current.node) ?? [];
    for (const dependent of nodeDependent) {
      queue.push({ node: dependent, from: current.node });
    }
  }

  return {
    originNode,
    affectedNodes,
    propagationPath,
    cascadeDepth: calculateCascadeDepth(propagationPath),
    containmentBoundary: findContainmentBoundary(dag, visited),
  };
}

function buildDependentsMap(dag: DAG): Map<NodeId, NodeId[]> {
  const dependents = new Map<NodeId, NodeId[]>();

  for (const [nodeId, node] of dag.nodes) {
    for (const dep of node.dependencies) {
      const existing = dependents.get(dep) ?? [];
      existing.push(nodeId);
      dependents.set(dep, existing);
    }
  }

  return dependents;
}

function determinePropagationType(
  from: NodeId,
  to: NodeId,
  dag: DAG
): PropagationStep['propagationType'] {
  const toNode = dag.nodes.get(to);
  if (toNode?.dependencies.includes(from)) {
    return 'direct_dependency';
  }
  return 'shared_resource';
}
typescript
interface FailurePropagation {
  originNode: NodeId;
  affectedNodes: NodeId[];
  propagationPath: PropagationStep[];
  cascadeDepth: number;
  containmentBoundary?: NodeId[];
}

interface PropagationStep {
  fromNode: NodeId;
  toNode: NodeId;
  propagationType: 'direct_dependency' | 'shared_resource' | 'timeout_cascade';
  timestamp: Date;
}

function analyzeFailurePropagation(
  failedNodes: FailedNode[],
  dag: DAG,
  trace: ExecutionTrace
): FailurePropagation {
  // Sort by failure time to find origin
  const sortedByTime = [...failedNodes].sort(
    (a, b) => a.timing.startTime.getTime() - b.timing.startTime.getTime()
  );

  const originNode = sortedByTime[0].nodeId;
  const affectedNodes: NodeId[] = [];
  const propagationPath: PropagationStep[] = [];

  // Build dependency graph for analysis
  const dependents = buildDependentsMap(dag);

  // Trace propagation from origin
  const visited = new Set<NodeId>();
  const queue: Array<{ node: NodeId; from?: NodeId }> = [{ node: originNode }];

  while (queue.length > 0) {
    const current = queue.shift()!;

    if (visited.has(current.node)) continue;
    visited.add(current.node);

    // Check if this node failed
    const failedNode = failedNodes.find(f => f.nodeId === current.node);
    if (failedNode && current.from) {
      affectedNodes.push(current.node);
      propagationPath.push({
        fromNode: current.from,
        toNode: current.node,
        propagationType: determinePropagationType(current.from, current.node, dag),
        timestamp: failedNode.timing.startTime,
      });
    }

    // Add dependents to queue
    const nodeDependent = dependents.get(current.node) ?? [];
    for (const dependent of nodeDependent) {
      queue.push({ node: dependent, from: current.node });
    }
  }

  return {
    originNode,
    affectedNodes,
    propagationPath,
    cascadeDepth: calculateCascadeDepth(propagationPath),
    containmentBoundary: findContainmentBoundary(dag, visited),
  };
}

function buildDependentsMap(dag: DAG): Map<NodeId, NodeId[]> {
  const dependents = new Map<NodeId, NodeId[]>();

  for (const [nodeId, node] of dag.nodes) {
    for (const dep of node.dependencies) {
      const existing = dependents.get(dep) ?? [];
      existing.push(nodeId);
      dependents.set(dep, existing);
    }
  }

  return dependents;
}

function determinePropagationType(
  from: NodeId,
  to: NodeId,
  dag: DAG
): PropagationStep['propagationType'] {
  const toNode = dag.nodes.get(to);
  if (toNode?.dependencies.includes(from)) {
    return 'direct_dependency';
  }
  return 'shared_resource';
}

Root Cause Analysis

根因分析

typescript
interface Evidence {
  type: 'error_message' | 'timing' | 'resource_usage' | 'pattern_match';
  source: string;
  observation: string;
  weight: number;  // How strongly this supports the conclusion
}

interface ContributingFactor {
  factor: string;
  contribution: number;  // 0-1
  evidence: Evidence[];
}

function identifyRootCause(
  propagation: FailurePropagation,
  failedNodes: FailedNode[],
  trace: ExecutionTrace,
  history?: FailureHistory
): RootCause {
  const originFailure = failedNodes.find(
    f => f.nodeId === propagation.originNode
  )!;

  const evidence: Evidence[] = [];
  const contributingFactors: ContributingFactor[] = [];

  // Evidence from error message
  evidence.push({
    type: 'error_message',
    source: 'primary_error',
    observation: originFailure.error.message,
    weight: 0.9,
  });

  // Evidence from timing
  if (originFailure.timing.duration && originFailure.timing.duration > 30000) {
    evidence.push({
      type: 'timing',
      source: 'execution_duration',
      observation: `Node ran for ${originFailure.timing.duration}ms before failing`,
      weight: 0.6,
    });
  }

  // Evidence from resource usage
  const resourceUsage = extractResourceUsage(originFailure.context);
  if (resourceUsage.tokensUsed > resourceUsage.tokenLimit * 0.9) {
    evidence.push({
      type: 'resource_usage',
      source: 'token_usage',
      observation: `Used ${resourceUsage.tokensUsed}/${resourceUsage.tokenLimit} tokens (${((resourceUsage.tokensUsed / resourceUsage.tokenLimit) * 100).toFixed(0)}%)`,
      weight: 0.7,
    });
  }

  // Check for pattern matches from history
  if (history) {
    const matchingPatterns = findMatchingPatterns(originFailure, history);
    for (const pattern of matchingPatterns) {
      evidence.push({
        type: 'pattern_match',
        source: 'failure_history',
        observation: `Matches known pattern: ${pattern.name} (seen ${pattern.occurrences} times)`,
        weight: 0.8,
      });
    }
  }

  // Analyze contributing factors
  contributingFactors.push(...analyzeContributingFactors(
    originFailure,
    trace,
    evidence
  ));

  // Calculate confidence based on evidence
  const confidence = calculateConfidence(evidence, contributingFactors);

  return {
    nodeId: propagation.originNode,
    type: originFailure.error.type,
    description: generateRootCauseDescription(originFailure, evidence),
    confidence,
    evidence,
    contributingFactors,
  };
}

function analyzeContributingFactors(
  failure: FailedNode,
  trace: ExecutionTrace,
  evidence: Evidence[]
): ContributingFactor[] {
  const factors: ContributingFactor[] = [];

  // Check for high load (many concurrent nodes)
  const concurrentNodes = countConcurrentNodes(trace, failure.timing.startTime);
  if (concurrentNodes > 5) {
    factors.push({
      factor: 'High concurrent load',
      contribution: Math.min(0.3, concurrentNodes * 0.05),
      evidence: [{
        type: 'timing',
        source: 'concurrency_analysis',
        observation: `${concurrentNodes} nodes executing concurrently`,
        weight: 0.5,
      }],
    });
  }

  // Check for slow dependencies
  const slowDeps = findSlowDependencies(trace, failure.nodeId);
  if (slowDeps.length > 0) {
    factors.push({
      factor: 'Slow upstream dependencies',
      contribution: 0.2,
      evidence: slowDeps.map(dep => ({
        type: 'timing' as const,
        source: 'dependency_analysis',
        observation: `Dependency ${dep.nodeId} took ${dep.duration}ms`,
        weight: 0.4,
      })),
    });
  }

  return factors;
}

function calculateConfidence(
  evidence: Evidence[],
  factors: ContributingFactor[]
): number {
  // Weighted average of evidence weights
  const evidenceTotal = evidence.reduce((sum, e) => sum + e.weight, 0);
  const evidenceAvg = evidenceTotal / Math.max(1, evidence.length);

  // Reduce confidence if many contributing factors
  const factorPenalty = Math.min(0.2, factors.length * 0.05);

  // More evidence = more confidence
  const evidenceBonus = Math.min(0.1, evidence.length * 0.02);

  return Math.max(0.3, Math.min(0.95, evidenceAvg + evidenceBonus - factorPenalty));
}
typescript
interface Evidence {
  type: 'error_message' | 'timing' | 'resource_usage' | 'pattern_match';
  source: string;
  observation: string;
  weight: number;  // How strongly this supports the conclusion
}

interface ContributingFactor {
  factor: string;
  contribution: number;  // 0-1
  evidence: Evidence[];
}

function identifyRootCause(
  propagation: FailurePropagation,
  failedNodes: FailedNode[],
  trace: ExecutionTrace,
  history?: FailureHistory
): RootCause {
  const originFailure = failedNodes.find(
    f => f.nodeId === propagation.originNode
  )!;

  const evidence: Evidence[] = [];
  const contributingFactors: ContributingFactor[] = [];

  // Evidence from error message
  evidence.push({
    type: 'error_message',
    source: 'primary_error',
    observation: originFailure.error.message,
    weight: 0.9,
  });

  // Evidence from timing
  if (originFailure.timing.duration && originFailure.timing.duration > 30000) {
    evidence.push({
      type: 'timing',
      source: 'execution_duration',
      observation: `Node ran for ${originFailure.timing.duration}ms before failing`,
      weight: 0.6,
    });
  }

  // Evidence from resource usage
  const resourceUsage = extractResourceUsage(originFailure.context);
  if (resourceUsage.tokensUsed > resourceUsage.tokenLimit * 0.9) {
    evidence.push({
      type: 'resource_usage',
      source: 'token_usage',
      observation: `Used ${resourceUsage.tokensUsed}/${resourceUsage.tokenLimit} tokens (${((resourceUsage.tokensUsed / resourceUsage.tokenLimit) * 100).toFixed(0)}%)`,
      weight: 0.7,
    });
  }

  // Check for pattern matches from history
  if (history) {
    const matchingPatterns = findMatchingPatterns(originFailure, history);
    for (const pattern of matchingPatterns) {
      evidence.push({
        type: 'pattern_match',
        source: 'failure_history',
        observation: `Matches known pattern: ${pattern.name} (seen ${pattern.occurrences} times)`,
        weight: 0.8,
      });
    }
  }

  // Analyze contributing factors
  contributingFactors.push(...analyzeContributingFactors(
    originFailure,
    trace,
    evidence
  ));

  // Calculate confidence based on evidence
  const confidence = calculateConfidence(evidence, contributingFactors);

  return {
    nodeId: propagation.originNode,
    type: originFailure.error.type,
    description: generateRootCauseDescription(originFailure, evidence),
    confidence,
    evidence,
    contributingFactors,
  };
}

function analyzeContributingFactors(
  failure: FailedNode,
  trace: ExecutionTrace,
  evidence: Evidence[]
): ContributingFactor[] {
  const factors: ContributingFactor[] = [];

  // Check for high load (many concurrent nodes)
  const concurrentNodes = countConcurrentNodes(trace, failure.timing.startTime);
  if (concurrentNodes > 5) {
    factors.push({
      factor: 'High concurrent load',
      contribution: Math.min(0.3, concurrentNodes * 0.05),
      evidence: [{
        type: 'timing',
        source: 'concurrency_analysis',
        observation: `${concurrentNodes} nodes executing concurrently`,
        weight: 0.5,
      }],
    });
  }

  // Check for slow dependencies
  const slowDeps = findSlowDependencies(trace, failure.nodeId);
  if (slowDeps.length > 0) {
    factors.push({
      factor: 'Slow upstream dependencies',
      contribution: 0.2,
      evidence: slowDeps.map(dep => ({
        type: 'timing' as const,
        source: 'dependency_analysis',
        observation: `Dependency ${dep.nodeId} took ${dep.duration}ms`,
        weight: 0.4,
      })),
    });
  }

  return factors;
}

function calculateConfidence(
  evidence: Evidence[],
  factors: ContributingFactor[]
): number {
  // Weighted average of evidence weights
  const evidenceTotal = evidence.reduce((sum, e) => sum + e.weight, 0);
  const evidenceAvg = evidenceTotal / Math.max(1, evidence.length);

  // Reduce confidence if many contributing factors
  const factorPenalty = Math.min(0.2, factors.length * 0.05);

  // More evidence = more confidence
  const evidenceBonus = Math.min(0.1, evidence.length * 0.02);

  return Math.max(0.3, Math.min(0.95, evidenceAvg + evidenceBonus - factorPenalty));
}

Remediation Planning

修复方案制定

typescript
interface RemediationPlan {
  immediateActions: RemediationAction[];
  preventiveActions: RemediationAction[];
  retryStrategy?: RetryStrategy;
  escalation?: EscalationPlan;
}

interface RemediationAction {
  action: string;
  priority: 'critical' | 'high' | 'medium' | 'low';
  effort: 'trivial' | 'minor' | 'moderate' | 'major';
  expectedImpact: string;
  implementation?: string;
}

interface RetryStrategy {
  recommended: boolean;
  strategy: 'immediate' | 'backoff' | 'skip' | 'manual';
  maxRetries: number;
  backoffMs?: number;
  conditions?: string[];
}

function generateRemediationPlan(
  rootCause: RootCause,
  classification: FailureClassification,
  propagation: FailurePropagation
): RemediationPlan {
  const plan: RemediationPlan = {
    immediateActions: [],
    preventiveActions: [],
  };

  // Generate actions based on failure type
  switch (rootCause.type) {
    case 'timeout':
      plan.immediateActions.push({
        action: 'Increase timeout for affected node',
        priority: 'high',
        effort: 'trivial',
        expectedImpact: 'Allows operation to complete',
        implementation: 'Update node config: timeoutMs: <current * 2>',
      });
      plan.preventiveActions.push({
        action: 'Add progress monitoring to detect slow execution',
        priority: 'medium',
        effort: 'moderate',
        expectedImpact: 'Early warning for timeout-prone operations',
      });
      plan.retryStrategy = {
        recommended: true,
        strategy: 'backoff',
        maxRetries: 2,
        backoffMs: 5000,
        conditions: ['No permanent resource exhaustion'],
      };
      break;

    case 'tool_error':
      plan.immediateActions.push({
        action: 'Check tool availability and permissions',
        priority: 'critical',
        effort: 'trivial',
        expectedImpact: 'Confirms tool is accessible',
      });
      plan.immediateActions.push({
        action: 'Verify tool input parameters',
        priority: 'high',
        effort: 'minor',
        expectedImpact: 'Ensures valid inputs',
      });
      plan.retryStrategy = {
        recommended: false,
        strategy: 'manual',
        maxRetries: 0,
        conditions: ['Tool error must be fixed first'],
      };
      break;

    case 'resource_exhaustion':
      plan.immediateActions.push({
        action: 'Reduce input size or complexity',
        priority: 'high',
        effort: 'moderate',
        expectedImpact: 'Reduces resource requirements',
      });
      plan.preventiveActions.push({
        action: 'Implement chunking for large inputs',
        priority: 'high',
        effort: 'major',
        expectedImpact: 'Prevents future exhaustion',
      });
      plan.retryStrategy = {
        recommended: false,
        strategy: 'skip',
        maxRetries: 0,
        conditions: ['Must reduce resource usage first'],
      };
      break;

    case 'validation_failure':
      plan.immediateActions.push({
        action: 'Review validation rules against actual output',
        priority: 'high',
        effort: 'minor',
        expectedImpact: 'Identifies schema mismatch',
      });
      plan.retryStrategy = {
        recommended: true,
        strategy: 'immediate',
        maxRetries: 2,
        conditions: ['Add validation guidance to prompt'],
      };
      break;

    case 'permission_denied':
      plan.immediateActions.push({
        action: 'Review and update permission matrix',
        priority: 'critical',
        effort: 'minor',
        expectedImpact: 'Grants necessary permissions',
      });
      plan.retryStrategy = {
        recommended: false,
        strategy: 'manual',
        maxRetries: 0,
        conditions: ['Must fix permissions first'],
      };
      break;

    case 'external_service':
      plan.immediateActions.push({
        action: 'Check external service status',
        priority: 'high',
        effort: 'trivial',
        expectedImpact: 'Confirms if service is available',
      });
      plan.retryStrategy = {
        recommended: true,
        strategy: 'backoff',
        maxRetries: 3,
        backoffMs: 10000,
        conditions: ['Service may recover'],
      };
      break;

    default:
      plan.immediateActions.push({
        action: 'Manual investigation required',
        priority: 'high',
        effort: 'moderate',
        expectedImpact: 'Understand failure cause',
      });
  }

  // Add escalation if not recoverable
  if (classification.recoverability === 'impossible') {
    plan.escalation = {
      required: true,
      reason: 'Failure is not automatically recoverable',
      suggestedOwner: 'human',
      context: summarizeForEscalation(rootCause, propagation),
    };
  }

  return plan;
}
typescript
interface RemediationPlan {
  immediateActions: RemediationAction[];
  preventiveActions: RemediationAction[];
  retryStrategy?: RetryStrategy;
  escalation?: EscalationPlan;
}

interface RemediationAction {
  action: string;
  priority: 'critical' | 'high' | 'medium' | 'low';
  effort: 'trivial' | 'minor' | 'moderate' | 'major';
  expectedImpact: string;
  implementation?: string;
}

interface RetryStrategy {
  recommended: boolean;
  strategy: 'immediate' | 'backoff' | 'skip' | 'manual';
  maxRetries: number;
  backoffMs?: number;
  conditions?: string[];
}

function generateRemediationPlan(
  rootCause: RootCause,
  classification: FailureClassification,
  propagation: FailurePropagation
): RemediationPlan {
  const plan: RemediationPlan = {
    immediateActions: [],
    preventiveActions: [],
  };

  // Generate actions based on failure type
  switch (rootCause.type) {
    case 'timeout':
      plan.immediateActions.push({
        action: 'Increase timeout for affected node',
        priority: 'high',
        effort: 'trivial',
        expectedImpact: 'Allows operation to complete',
        implementation: 'Update node config: timeoutMs: <current * 2>',
      });
      plan.preventiveActions.push({
        action: 'Add progress monitoring to detect slow execution',
        priority: 'medium',
        effort: 'moderate',
        expectedImpact: 'Early warning for timeout-prone operations',
      });
      plan.retryStrategy = {
        recommended: true,
        strategy: 'backoff',
        maxRetries: 2,
        backoffMs: 5000,
        conditions: ['No permanent resource exhaustion'],
      };
      break;

    case 'tool_error':
      plan.immediateActions.push({
        action: 'Check tool availability and permissions',
        priority: 'critical',
        effort: 'trivial',
        expectedImpact: 'Confirms tool is accessible',
      });
      plan.immediateActions.push({
        action: 'Verify tool input parameters',
        priority: 'high',
        effort: 'minor',
        expectedImpact: 'Ensures valid inputs',
      });
      plan.retryStrategy = {
        recommended: false,
        strategy: 'manual',
        maxRetries: 0,
        conditions: ['Tool error must be fixed first'],
      };
      break;

    case 'resource_exhaustion':
      plan.immediateActions.push({
        action: 'Reduce input size or complexity',
        priority: 'high',
        effort: 'moderate',
        expectedImpact: 'Reduces resource requirements',
      });
      plan.preventiveActions.push({
        action: 'Implement chunking for large inputs',
        priority: 'high',
        effort: 'major',
        expectedImpact: 'Prevents future exhaustion',
      });
      plan.retryStrategy = {
        recommended: false,
        strategy: 'skip',
        maxRetries: 0,
        conditions: ['Must reduce resource usage first'],
      };
      break;

    case 'validation_failure':
      plan.immediateActions.push({
        action: 'Review validation rules against actual output',
        priority: 'high',
        effort: 'minor',
        expectedImpact: 'Identifies schema mismatch',
      });
      plan.retryStrategy = {
        recommended: true,
        strategy: 'immediate',
        maxRetries: 2,
        conditions: ['Add validation guidance to prompt'],
      };
      break;

    case 'permission_denied':
      plan.immediateActions.push({
        action: 'Review and update permission matrix',
        priority: 'critical',
        effort: 'minor',
        expectedImpact: 'Grants necessary permissions',
      });
      plan.retryStrategy = {
        recommended: false,
        strategy: 'manual',
        maxRetries: 0,
        conditions: ['Must fix permissions first'],
      };
      break;

    case 'external_service':
      plan.immediateActions.push({
        action: 'Check external service status',
        priority: 'high',
        effort: 'trivial',
        expectedImpact: 'Confirms if service is available',
      });
      plan.retryStrategy = {
        recommended: true,
        strategy: 'backoff',
        maxRetries: 3,
        backoffMs: 10000,
        conditions: ['Service may recover'],
      };
      break;

    default:
      plan.immediateActions.push({
        action: 'Manual investigation required',
        priority: 'high',
        effort: 'moderate',
        expectedImpact: 'Understand failure cause',
      });
  }

  // Add escalation if not recoverable
  if (classification.recoverability === 'impossible') {
    plan.escalation = {
      required: true,
      reason: 'Failure is not automatically recoverable',
      suggestedOwner: 'human',
      context: summarizeForEscalation(rootCause, propagation),
    };
  }

  return plan;
}

Failure Report

故障报告

yaml
failureAnalysis:
  analysisId: "fa-7c3a2b1d-4e5f-6a7b-8c9d"
  traceId: "tr-8f4a2b1c-3d5e-6f7a-8b9c"
  dagId: "code-review-dag"
  analyzedAt: "2024-01-15T10:35:00Z"

  rootCause:
    nodeId: check-security
    type: external_service
    description: "Security scanning service returned 503 (Service Unavailable)"
    confidence: 0.87
    evidence:
      - type: error_message
        source: primary_error
        observation: "external service error: Security API returned 503"
        weight: 0.9
      - type: pattern_match
        source: failure_history
        observation: "Matches known pattern: API rate limit (seen 3 times)"
        weight: 0.8
    contributingFactors:
      - factor: "High concurrent load"
        contribution: 0.15
        evidence:
          - type: timing
            source: concurrency_analysis
            observation: "7 nodes executing concurrently"

  propagation:
    originNode: check-security
    affectedNodes:
      - aggregate-results
      - generate-report
    propagationPath:
      - fromNode: check-security
        toNode: aggregate-results
        propagationType: direct_dependency
        timestamp: "2024-01-15T10:34:45Z"
      - fromNode: aggregate-results
        toNode: generate-report
        propagationType: direct_dependency
        timestamp: "2024-01-15T10:34:46Z"
    cascadeDepth: 2
    containmentBoundary:
      - generate-report

  classification:
    severity: high
    impact:
      nodesAffected: 3
      tasksBlocked: 1
      estimatedDelay: 60000
    recoverability: automatic
    frequency: intermittent

  remediation:
    immediateActions:
      - action: "Check external service status"
        priority: high
        effort: trivial
        expectedImpact: "Confirms if service is available"

    preventiveActions:
      - action: "Add circuit breaker for external service calls"
        priority: medium
        effort: moderate
        expectedImpact: "Graceful degradation on service failures"
      - action: "Implement caching for security scan results"
        priority: low
        effort: major
        expectedImpact: "Reduces dependency on external service"

    retryStrategy:
      recommended: true
      strategy: backoff
      maxRetries: 3
      backoffMs: 10000
      conditions:
        - "Service may recover"
        - "No rate limit reached"

  summary: |
    The check-security node failed due to an external service (Security API)
    returning 503. This is an intermittent issue that has occurred 3 times
    previously. The failure cascaded to 2 downstream nodes. Recommended
    action is to retry with exponential backoff.
yaml
failureAnalysis:
  analysisId: "fa-7c3a2b1d-4e5f-6a7b-8c9d"
  traceId: "tr-8f4a2b1c-3d5e-6f7a-8b9c"
  dagId: "code-review-dag"
  analyzedAt: "2024-01-15T10:35:00Z"

  rootCause:
    nodeId: check-security
    type: external_service
    description: "Security scanning service returned 503 (Service Unavailable)"
    confidence: 0.87
    evidence:
      - type: error_message
        source: primary_error
        observation: "external service error: Security API returned 503"
        weight: 0.9
      - type: pattern_match
        source: failure_history
        observation: "Matches known pattern: API rate limit (seen 3 times)"
        weight: 0.8
    contributingFactors:
      - factor: "High concurrent load"
        contribution: 0.15
        evidence:
          - type: timing
            source: concurrency_analysis
            observation: "7 nodes executing concurrently"

  propagation:
    originNode: check-security
    affectedNodes:
      - aggregate-results
      - generate-report
    propagationPath:
      - fromNode: check-security
        toNode: aggregate-results
        propagationType: direct_dependency
        timestamp: "2024-01-15T10:34:45Z"
      - fromNode: aggregate-results
        toNode: generate-report
        propagationType: direct_dependency
        timestamp: "2024-01-15T10:34:46Z"
    cascadeDepth: 2
    containmentBoundary:
      - generate-report

  classification:
    severity: high
    impact:
      nodesAffected: 3
      tasksBlocked: 1
      estimatedDelay: 60000
    recoverability: automatic
    frequency: intermittent

  remediation:
    immediateActions:
      - action: "Check external service status"
        priority: high
        effort: trivial
        expectedImpact: "Confirms if service is available"

    preventiveActions:
      - action: "Add circuit breaker for external service calls"
        priority: medium
        effort: moderate
        expectedImpact: "Graceful degradation on service failures"
      - action: "Implement caching for security scan results"
        priority: low
        effort: major
        expectedImpact: "Reduces dependency on external service"

    retryStrategy:
      recommended: true
      strategy: backoff
      maxRetries: 3
      backoffMs: 10000
      conditions:
        - "Service may recover"
        - "No rate limit reached"

  summary: |
    The check-security node failed due to an external service (Security API)
    returning 503. This is an intermittent issue that has occurred 3 times
    previously. The failure cascaded to 2 downstream nodes. Recommended
    action is to retry with exponential backoff.

Integration Points

集成点

  • Input: Execution traces from
    dag-execution-tracer
  • Correlation: Performance data from
    dag-performance-profiler
  • Output: Failure patterns to
    dag-pattern-learner
  • Recovery: Remediation plans to
    dag-dynamic-replanner
  • 输入:来自
    dag-execution-tracer
    的执行追踪数据
  • 关联:来自
    dag-performance-profiler
    的性能数据
  • 输出:将故障模式输出至
    dag-pattern-learner
  • 恢复:将修复方案推送至
    dag-dynamic-replanner

Best Practices

最佳实践

  1. Trace from Origin: Always identify the first failure
  2. Gather Evidence: Multiple sources increase confidence
  3. Consider History: Check for recurring patterns
  4. Actionable Remediation: Make fixes specific and implementable
  5. Know When to Escalate: Some failures need human intervention

Understand failures. Find root causes. Enable recovery.
  1. 从源头追踪:始终先识别首个发生的故障
  2. 收集多方证据:多来源证据可提升分析可信度
  3. 参考历史记录:检查是否存在重复出现的故障模式
  4. 生成可执行修复方案:确保修复措施具体且可落地
  5. 明确升级时机:部分故障需要人工介入处理

理解故障,定位根因,实现恢复。