fault-tree-analysis

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Fault Tree Analysis (FTA)

故障树分析(FTA)

Conduct systematic Fault Tree Analysis using a structured, Q&A-based approach with Boolean logic gates, minimal cut set identification, and optional probability calculations.
采用基于问答的结构化方法,结合布尔逻辑门、最小割集识别及可选的概率计算,开展系统性的故障树分析。

Overview

概述

Fault Tree Analysis is a top-down, deductive failure analysis method that maps how combinations of lower-level events (basic events) lead to an undesired system-level event (top event). Uses Boolean logic gates (AND, OR) to represent relationships between events.
Key Principle: One fault tree analyzes one specific undesired event. Start at the top (what failed?) and work down (what caused it?).
Analysis Types:
  • Qualitative: Identify failure pathways, minimal cut sets, single points of failure
  • Quantitative: Calculate failure probabilities using component failure data
故障树分析是一种自上而下的演绎式故障分析方法,用于梳理底层事件(基本事件)的组合如何引发不期望的系统级事件(顶事件)。该方法使用布尔逻辑门(AND、OR)来表示事件之间的关系。
核心原则:一棵故障树仅分析一个特定的不期望事件。从顶部开始(什么发生了故障?),逐步向下分析(故障原因是什么?)。
分析类型
  • 定性分析:识别故障路径、最小割集、单点故障
  • 定量分析:利用部件故障数据计算故障概率

Workflow

工作流程

Phase 1: System Definition & Scope

阶段1:系统定义与范围确定

Collect from user:
  1. What system or process is being analyzed?
  2. What are the system boundaries (what's in scope vs. out of scope)?
  3. What are the operating conditions and assumptions?
  4. What documentation exists (schematics, P&IDs, operating procedures)?
  5. What is the purpose of this analysis (design review, incident investigation, safety case)?
Outputs:
  • System description with boundaries
  • Operating mode(s) under analysis
  • List of assumptions and exclusions
从用户处收集以下信息
  1. 正在分析的系统或流程是什么?
  2. 系统边界是什么(哪些属于分析范围,哪些不属于)?
  3. 运行条件和假设是什么?
  4. 现有哪些文档(原理图、管道及仪表流程图P&ID、操作程序)?
  5. 本次分析的目的是什么(设计评审、事故调查、安全案例编制)?
输出结果
  • 带边界的系统描述
  • 分析覆盖的运行模式
  • 假设与排除项列表

Phase 2: Top Event Definition

阶段2:顶事件定义

Collect from user:
  1. What is the single undesired outcome to analyze?
  2. How is this event defined (what state constitutes "failure")?
  3. What is the severity/criticality of this event?
  4. What is the mission time or exposure period?
Quality Gate - Top Event Must Be:
  • Single, specific, unambiguous event
  • Clearly defined failure state (not vague)
  • At appropriate system level (not too high or too low)
  • Observable or detectable
Good Example: "Pump fails to deliver required flow rate (>100 GPM) during normal operation" Poor Example: "System doesn't work" (too vague)
从用户处收集以下信息
  1. 要分析的单一不期望结果是什么?
  2. 该事件的定义是什么(什么状态构成“故障”)?
  3. 该事件的严重程度/关键性如何?
  4. 任务时间或暴露时长是多少?
质量关卡 - 顶事件必须满足
  • 单一、具体、无歧义的事件
  • 故障状态定义清晰(不能模糊)
  • 处于合适的系统层级(既不过高也不过低)
  • 可观察或可检测
正面示例:“正常运行期间,泵无法输送要求的流量(>100 GPM)” 反面示例:“系统无法工作”(过于模糊)

Phase 3: Fault Tree Construction

阶段3:故障树构建

Build the tree iteratively from top to bottom:
For each event (starting with top event):
  1. Identify immediate causes: "What events could directly cause this?"
  2. Determine gate type:
    • OR gate: ANY one cause is sufficient (independent causes)
    • AND gate: ALL causes required simultaneously (redundancy/barriers)
  3. Classify event type:
    • Intermediate event (rectangle): Requires further development
    • Basic event (circle): Component failure, terminal point
    • Undeveloped event (diamond): Insufficient data or out of scope
    • House event (house symbol): Normal occurrence, switch on/off
    • External event (house): Environmental or expected condition
  4. Continue developing until all branches terminate in basic/undeveloped events
Stopping Criteria for Branch Development:
  • Component-level failure reached (basic event)
  • Out of scope (undeveloped event)
  • Normal expected condition (house event)
  • Insufficient information available
Critical Rules:
  • Each event must have clear, unambiguous description
  • No redundant events (same failure in multiple places)
  • No "miracles" (events that cannot physically occur)
  • Consistent naming conventions throughout
从顶至底迭代构建故障树:
针对每个事件(从顶事件开始)
  1. 识别直接原因:“哪些事件会直接导致该故障?”
  2. 确定门类型
    • OR门:任意一个原因即可引发故障(独立原因)
    • AND门:所有原因必须同时存在(冗余/防护机制失效)
  3. 分类事件类型
    • 中间事件(矩形):需要进一步展开分析
    • 基本事件(圆形):部件故障,为终端节点
    • 未展开事件(菱形):数据不足或超出分析范围
    • 房屋事件(房屋符号):正常发生的事件,如开关启停
    • 外部事件(房屋符号):环境或预期条件
  4. 持续展开,直到所有分支终止于基本事件/未展开事件
分支展开停止标准
  • 已到达部件级故障(基本事件)
  • 超出分析范围(未展开事件)
  • 正常预期条件(房屋事件)
  • 可用信息不足
关键规则
  • 每个事件的描述必须清晰、无歧义
  • 无重复事件(同一故障不能出现在多个位置)
  • 无“不可能事件”(物理上无法发生的事件)
  • 全程使用一致的命名规范

Phase 4: Qualitative Analysis

阶段4:定性分析

Identify Minimal Cut Sets (MCS): Minimal cut sets are the smallest combinations of basic events that cause the top event.
  • Order 1 MCS (single events): Most critical - single points of failure
  • Order 2 MCS (pairs): Critical for redundant systems
  • Higher order MCS: Less critical, require multiple failures
Analysis Tasks:
  1. List all minimal cut sets by order
  2. Identify single points of failure (Order 1)
  3. Assess common cause failure potential
  4. Evaluate effectiveness of redundancy
Run
python scripts/calculate_fta.py --qualitative
for automated MCS extraction.
识别最小割集(MCS): 最小割集是引发顶事件所需的最小基本事件组合。
  • 一阶最小割集(单一事件):最关键,属于单点故障
  • 二阶最小割集(成对事件):对冗余系统至关重要
  • 高阶最小割集:关键性较低,需要多个故障同时发生
分析任务
  1. 按阶次列出所有最小割集
  2. 识别单点故障(一阶割集)
  3. 评估共因故障的潜在可能性
  4. 评估冗余机制的有效性
运行
python scripts/calculate_fta.py --qualitative
以自动提取最小割集。

Phase 5: Quantitative Analysis (Optional)

阶段5:定量分析(可选)

If failure probability data is available:
Collect failure data for each basic event:
  • Failure rate (λ) or probability (P)
  • Mission time or exposure period
  • Data source (field data, handbook, estimate)
  • Confidence level
Calculations:
  • OR gate: P(output) ≈ P(A) + P(B) - P(A)×P(B) ≈ P(A) + P(B) for small probabilities
  • AND gate: P(output) = P(A) × P(B) (for independent events)
Calculate:
  1. Probability of each minimal cut set
  2. Top event probability (sum of MCS probabilities with adjustments for overlapping events)
  3. Importance measures (Fussell-Vesely, Birnbaum)
Run
python scripts/calculate_fta.py --quantitative
with probability data.
若具备故障概率数据:
收集每个基本事件的故障数据
  • 故障率(λ)或故障概率(P)
  • 任务时间或暴露时长
  • 数据来源(现场数据、手册、估算值)
  • 置信水平
计算方法
  • OR门:P(输出) ≈ P(A) + P(B) - P(A)×P(B),对于小概率事件可近似为P(A) + P(B)
  • AND门:P(输出) = P(A) × P(B)(适用于独立事件)
计算内容
  1. 每个最小割集的概率
  2. 顶事件概率(所有最小割集概率之和,需针对重叠事件进行调整)
  3. 重要度指标(Fussell-Vesely、Birnbaum)
结合概率数据运行
python scripts/calculate_fta.py --quantitative

Phase 6: Common Cause Failure Analysis

阶段6:共因故障分析

Identify potential common causes across basic events:
  • Environmental (temperature, humidity, EMI)
  • Manufacturing (batch defects, supplier issues)
  • Maintenance (common procedures, same personnel)
  • Design (same components, shared software)
  • Human error (operator mistakes, procedure gaps)
For AND gates (redundant systems): Common cause failures can defeat redundancy. Apply beta-factor model if quantifying:
  • P(CCF) = β × P(independent failure)
  • Typical β values: 1-10% depending on diversity measures
识别基本事件间潜在的共因
  • 环境因素(温度、湿度、电磁干扰EMI)
  • 制造因素(批次缺陷、供应商问题)
  • 维护因素(通用流程、同一人员操作)
  • 设计因素(相同部件、共享软件)
  • 人为失误(操作人员错误、流程漏洞)
针对AND门(冗余系统): 共因故障可能会抵消冗余机制的作用。若需量化,可应用β因子模型:
  • P(CCF) = β × P(独立故障)
  • 典型β值:1-10%,具体取决于多样性措施

Phase 7: Documentation & Reporting

阶段7:文档编制与报告

Generate professional outputs:
  • python scripts/generate_diagram.py
    - SVG fault tree diagram
  • python scripts/generate_report.py
    - Comprehensive HTML report
生成专业输出:
  • python scripts/generate_diagram.py
    - SVG格式故障树图
  • python scripts/generate_report.py
    - 完整HTML报告

Symbols Reference

符号参考

SymbolNameDescription
RectangleIntermediate EventFault resulting from combination of inputs; requires gate
CircleBasic EventComponent failure; terminal event with probability data
DiamondUndeveloped EventNot further developed (out of scope or insufficient data)
HouseHouse EventExpected occurrence; can be set TRUE/FALSE
Flat OR gateOR GateOutput if ANY input occurs
Flat AND gateAND GateOutput if ALL inputs occur
TriangleTransferConnects to another tree section
符号名称描述
矩形中间事件由多个输入组合引发的故障;需要通过门连接
圆形基本事件部件故障;带有概率数据的终端事件
菱形未展开事件未进一步展开(超出范围或数据不足)
房屋形房屋事件预期发生的事件;可设置为TRUE/FALSE
扁平OR门OR门任意一个输入发生则输出故障
扁平AND门AND门所有输入同时发生则输出故障
三角形转移符号连接至故障树的其他部分

Quality Scoring

质量评分

Each analysis scored on six dimensions (see references/quality-rubric.md):
DimensionWeightDescription
System Definition15%Clear boundaries, assumptions, operating conditions
Top Event Clarity15%Specific, unambiguous, appropriate level
Tree Completeness25%All pathways developed, no gaps, consistent logic
Minimal Cut Sets20%Correctly identified, analyzed for SPOFs
Quantification15%Accurate calculations, appropriate data sources
Actionability10%Identifies design improvements, risk mitigations
Scoring Scale: Each dimension rated 1-5 (Inadequate to Excellent) Overall Score: Weighted average × 20 = 0-100 points Passing Threshold: 70 points minimum
Run
python scripts/score_analysis.py
to calculate scores.
每次分析从六个维度进行评分(详见 references/quality-rubric.md):
维度权重描述
系统定义15%边界清晰、假设明确、运行条件明确
顶事件清晰度15%具体、无歧义、层级合适
故障树完整性25%所有路径均已展开、无遗漏、逻辑一致
最小割集分析20%识别正确、已分析单点故障
量化分析15%计算准确、数据来源合适
可行动性10%识别出设计改进方向、风险缓解措施
评分标准:每个维度按1-5分评级(从不合格到优秀) 总分计算:加权平均分 × 20 = 0-100分 合格阈值:最低70分
运行
python scripts/score_analysis.py
计算评分。

Common Pitfalls

常见误区

See references/common-pitfalls.md for:
  • Incorrect gate selection (AND vs OR confusion)
  • Top event too vague or at wrong level
  • Missing common cause failures
  • Incomplete branch development
  • Ignoring human factors
  • Double-counting events
详见 references/common-pitfalls.md
  • 门类型选择错误(混淆AND与OR门)
  • 顶事件过于模糊或层级不当
  • 遗漏共因故障
  • 分支展开不完整
  • 忽略人为因素
  • 事件重复计数

Examples

示例

See references/examples.md for worked examples:
  • Pump system failure
  • Control system loss of function
  • Safety interlock bypass
  • Manufacturing equipment hazard
详见 references/examples.md 中的实操示例:
  • 泵系统故障
  • 控制系统功能丧失
  • 安全联锁旁路
  • 制造设备隐患

Integration with Other Tools

与其他工具的集成

  • FMEA/FMECA: Bottom-up complements top-down FTA; use FMEA to identify basic events
  • 5 Whys: Use for detailed investigation of specific failure pathways
  • Fishbone Diagram: Brainstorm potential causes before structuring in FTA
  • Reliability Block Diagram: Alternative view of system reliability
  • Event Tree Analysis: Use FTA for initiating event probabilities
  • FMEA/FMECA:自下而上的分析方法与自上而下的FTA形成互补;可利用FMEA识别基本事件
  • 5 Whys:用于深入调查特定故障路径
  • 鱼骨图:在构建FTA前用于头脑风暴潜在原因
  • 可靠性框图(RBD):系统可靠性的另一种展示方式
  • 事件树分析:利用FTA计算初始事件的概率

When to Use FTA

FTA的适用场景

Good candidates:
  • Safety-critical system design review
  • Accident/incident investigation
  • Regulatory compliance demonstration
  • Redundancy effectiveness evaluation
  • System failure probability estimation
Consider alternatives when:
  • Need to catalog ALL failure modes (use FMEA)
  • Analyzing success paths (use Success Tree/RBD)
  • Time-sequential dependencies critical (use Event Tree)
适用场景
  • 安全关键系统设计评审
  • 事故/事件调查
  • 合规性证明
  • 冗余有效性评估
  • 系统故障概率估算
考虑替代方法的场景
  • 需要梳理所有故障模式(使用FMEA)
  • 分析成功路径(使用成功树/RBD)
  • 时间序列依赖至关重要(使用事件树)