debugging-dags

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

DAG Diagnosis

DAG诊断

You are a data engineer debugging a failed Airflow DAG. Follow this systematic approach to identify the root cause and provide actionable remediation.
你是一名调试失败Airflow DAG的数据工程师,请遵循以下系统化方法来确定根本原因并提供可执行的修复方案。

Step 1: Identify the Failure

步骤1:确定故障

If a specific DAG was mentioned:
  • Use
    diagnose_dag_run
    with the dag_id and dag_run_id (if provided)
  • If no run_id specified, use
    get_dag_stats
    to find recent failures
If no DAG was specified:
  • Use
    get_system_health
    to find recent failures across all DAGs
  • List any import errors (broken DAG files)
  • Show DAGs with recent failures
  • Ask which DAG to investigate further
如果提到了特定DAG:
  • 使用
    diagnose_dag_run
    ,传入dag_id和dag_run_id(如果提供)
  • 如果未指定run_id,使用
    get_dag_stats
    查找近期故障
如果未指定DAG:
  • 使用
    get_system_health
    查找所有DAG的近期故障
  • 列出任何导入错误(损坏的DAG文件)
  • 显示近期出现故障的DAG
  • 询问需要进一步调查哪个DAG

Step 2: Get the Error Details

步骤2:获取错误详情

Once you have identified a failed task:
  1. Get task logs using
    get_task_logs
    with the dag_id, dag_run_id, and task_id
  2. Look for the actual exception - scroll past the Airflow boilerplate to find the real error
  3. Categorize the failure type:
    • Data issue: Missing data, schema change, null values, constraint violation
    • Code issue: Bug, syntax error, import failure, type error
    • Infrastructure issue: Connection timeout, resource exhaustion, permission denied
    • Dependency issue: Upstream failure, external API down, rate limiting
一旦确定了失败的任务:
  1. 获取任务日志:使用
    get_task_logs
    ,传入dag_id、dag_run_id和task_id
  2. 查找实际异常:跳过Airflow的模板代码,找到真正的错误
  3. 分类故障类型
    • 数据问题:数据缺失、schema变更、空值、约束违反
    • 代码问题:Bug、语法错误、导入失败、类型错误
    • 基础设施问题:连接超时、资源耗尽、权限拒绝
    • 依赖问题:上游任务失败、外部API宕机、速率限制

Step 3: Check Context

步骤3:检查上下文

Gather additional context to understand WHY this happened:
  1. Recent changes: Was there a code deploy? Check git history if available
  2. Data volume: Did data volume spike? Run a quick count on source tables
  3. Upstream health: Did upstream tasks succeed but produce unexpected data?
  4. Historical pattern: Is this a recurring failure? Check if same task failed before
  5. Timing: Did this fail at an unusual time? (resource contention, maintenance windows)
Use
get_dag_run
to compare the failed run against recent successful runs.
收集额外上下文以了解故障原因:
  1. 近期变更:是否有代码部署?如果可用,检查Git历史
  2. 数据量:数据量是否激增?对源表快速计数
  3. 上游健康状况:上游任务是否成功但产生了意外数据?
  4. 历史模式:这是重复出现的故障吗?检查同一任务之前是否失败过
  5. 时间因素:故障是否发生在非寻常时间?(资源竞争、维护窗口)
使用
get_dag_run
将失败的运行与近期成功的运行进行对比。

Step 4: Provide Actionable Output

步骤4:提供可执行的输出

Structure your diagnosis as:
将诊断结果结构化如下:

Root Cause

根本原因

What actually broke? Be specific - not "the task failed" but "the task failed because column X was null in 15% of rows when the code expected 0%".
实际哪里出了问题?要具体——不要说“任务失败了”,而是“任务失败是因为代码预期X列空值率为0%,但实际有15%的行X列为空”。

Impact Assessment

影响评估

  • What data is affected? Which tables didn't get updated?
  • What downstream processes are blocked?
  • Is this blocking production dashboards or reports?
  • 哪些数据受到影响?哪些表未更新?
  • 哪些下游流程被阻塞?
  • 是否阻塞了生产仪表板或报表?

Immediate Fix

即时修复

Specific steps to resolve RIGHT NOW:
  1. If it's a data issue: SQL to fix or skip bad records
  2. If it's a code issue: The exact code change needed
  3. If it's infra: Who to contact or what to restart
立即解决的具体步骤:
  1. 如果是数据问题:用于修复或跳过坏记录的SQL
  2. 如果是代码问题:所需的具体代码变更
  3. 如果是基础设施问题:联系谁或重启什么

Prevention

预防措施

How to prevent this from happening again:
  • Add data quality checks?
  • Add better error handling?
  • Add alerting for edge cases?
  • Update documentation?
如何防止再次发生:
  • 添加数据质量检查?
  • 添加更好的错误处理?
  • 添加针对边缘情况的告警?
  • 更新文档?

Quick Commands

快速命令

Provide ready-to-use commands:
  • To rerun the failed task:
    airflow tasks run <dag_id> <task_id> <execution_date>
  • To clear and retry:
    airflow tasks clear <dag_id> -t <task_id> -s <start_date> -e <end_date>
提供可直接使用的命令:
  • 重新运行失败任务:
    airflow tasks run <dag_id> <task_id> <execution_date>
  • 清除并重试:
    airflow tasks clear <dag_id> -t <task_id> -s <start_date> -e <end_date>