resiliency

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Resiliency

弹性能力

Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md Card: @skills/resiliency/card.yaml
稳定版文档:@docs/training/resiliency.md、@docs/training/checkpointing.md 技能卡片:@skills/resiliency/card.yaml

Enablement

功能启用

Fault tolerance (Slurm only)

容错(仅支持Slurm)

Option 1: NeMo Run plugin (recommended)

方案1:NeMo Run插件(推荐)

python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
        calc_ft_timeouts=True,
        num_in_job_restarts=3,
        num_job_retries_on_failure=2,
        initial_rank_heartbeat_timeout=1800,
        rank_heartbeat_timeout=300,
    )
]
run.run(task, plugins=run_plugins, executor=executor)
Plugin parameterDefaultDescription
num_in_job_restarts
3Max restarts within same job
num_job_retries_on_failure
2Max new job launches on failure
initial_rank_heartbeat_timeout
1800First heartbeat timeout (seconds)
rank_heartbeat_timeout
300Subsequent heartbeat timeout (seconds)
python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
        calc_ft_timeouts=True,
        num_in_job_restarts=3,
        num_job_retries_on_failure=2,
        initial_rank_heartbeat_timeout=1800,
        rank_heartbeat_timeout=300,
    )
]
run.run(task, plugins=run_plugins, executor=executor)
插件参数默认值描述
num_in_job_restarts
3同任务内的最大重启次数
num_job_retries_on_failure
2任务失败后的最大重新启动次数
initial_rank_heartbeat_timeout
1800首次心跳超时时间(秒)
rank_heartbeat_timeout
300后续心跳超时时间(秒)

Option 2: Direct config + ft_launcher

方案2:直接配置 + ft_launcher

python
from megatron.bridge.training.config import FaultToleranceConfig

cfg.ft = FaultToleranceConfig(
    enable_ft_package=True,
    calc_ft_timeouts=True,
    simulate_fault=False,
    simulated_fault_type="random",
)
Launch with
ft_launcher
(not
torchrun
):
bash
export GROUP_RANK=0  # required for non-Slurm
ft_launcher \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
    --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
    --ft-rank_out_of_section_timeout=300 \
    your_training_script.py
Config parameterDefaultDescription
enable_ft_package
FalseEnable fault tolerance
calc_ft_timeouts
FalseAuto-compute optimal timeouts
simulate_fault
FalseEnable fault simulation for testing
simulated_fault_type
"random"
"rank_hung"
,
"rank_killed"
, or
"random"
simulated_fault_rank
NoneSpecific rank to fault (random if None)
simulated_fault_base_delay
0Base delay before simulating fault
Section-based timeout monitoring covers setup, training steps, checkpointing, and out-of-section time independently. Timeouts are saved to
ft_state.json
for subsequent runs when
calc_ft_timeouts=True
.
python
from megatron.bridge.training.config import FaultToleranceConfig

cfg.ft = FaultToleranceConfig(
    enable_ft_package=True,
    calc_ft_timeouts=True,
    simulate_fault=False,
    simulated_fault_type="random",
)
使用
ft_launcher
启动(而非
torchrun
):
bash
export GROUP_RANK=0  # 非Slurm环境必填
ft_launcher \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
    --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
    --ft-rank_out_of_section_timeout=300 \
    your_training_script.py
配置参数默认值描述
enable_ft_package
False启用容错功能
calc_ft_timeouts
False自动计算最优超时时间
simulate_fault
False启用故障模拟用于测试
simulated_fault_type
"random"
可选值:
"rank_hung"
"rank_killed"
"random"
simulated_fault_rank
None指定要模拟故障的rank(为None时随机选择)
simulated_fault_base_delay
0模拟故障前的基础延迟时间
基于阶段的超时监控独立覆盖初始化、训练步骤、 checkpointing以及阶段外时间。当
calc_ft_timeouts=True
时,超时时间会保存到
ft_state.json
供后续运行使用。

NVRx straggler detection

NVRx掉队节点检测

python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig

cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
    enabled=True,
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_print=5,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    enable_logging=True,
)
ParameterDefaultDescription
enabled
FalseEnable straggler detection
report_time_interval
300.0Seconds between straggler checks
calc_relative_gpu_perf
TrueCompare ranks against each other
calc_individual_gpu_perf
TrueTrack per-rank degradation over time
gpu_relative_perf_threshold
0.7Threshold for relative performance (0-1)
gpu_individual_perf_threshold
0.7Threshold for individual performance (0-1)
stop_if_detected
FalseTerminate training on straggler
num_gpu_perf_scores_to_print
5Number of best/worst scores to print
profiling_interval
1Profiling interval for detector
python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig

cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
    enabled=True,
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_print=5,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    enable_logging=True,
)
参数默认值描述
enabled
False启用掉队节点检测
report_time_interval
300.0掉队节点检测的时间间隔(秒)
calc_relative_gpu_perf
True对比不同rank的性能
calc_individual_gpu_perf
True跟踪单个rank的性能退化情况
gpu_relative_perf_threshold
0.7相对性能阈值(0-1)
gpu_individual_perf_threshold
0.7单个性能阈值(0-1)
stop_if_detected
False检测到掉队节点时终止训练
num_gpu_perf_scores_to_print
5要打印的最优/最差性能分数数量
profiling_interval
1检测器的性能分析间隔

Preemption

抢占

Plugin (Slurm)

插件(Slurm)

python
from megatron.bridge.recipes.run_plugins import PreemptionPlugin

plugins = [
    PreemptionPlugin(
        preempt_time=60,
        enable_exit_handler=True,
        enable_exit_handler_for_data_loader=False,
    )
]
Plugin parameterDefaultDescription
preempt_time
60Seconds before job limit to send signal
enable_exit_handler
TrueEnable signal handler in training
enable_exit_handler_for_data_loader
FalseEnable for dataloader workers
python
from megatron.bridge.recipes.run_plugins import PreemptionPlugin

plugins = [
    PreemptionPlugin(
        preempt_time=60,
        enable_exit_handler=True,
        enable_exit_handler_for_data_loader=False,
    )
]
插件参数默认值描述
preempt_time
60到达任务限制前发送信号的提前时间(秒)
enable_exit_handler
True在训练中启用信号处理器
enable_exit_handler_for_data_loader
False为数据加载器工作进程启用信号处理器

Direct config

直接配置

python
import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = False
python
import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = False

Re-run state machine (experimental)

重运行状态机(实验性)

python
from megatron.bridge.training.config import RerunStateMachineConfig

cfg.rerun_state_machine = RerunStateMachineConfig(
    rerun_mode="validate_results",
    check_for_nan_in_loss=True,
    check_for_spiky_loss=False,
    spiky_loss_factor=10.0,
)
ParameterDefaultDescription
rerun_mode
"disabled"
"disabled"
,
"validate_results"
,
"report_determinism_stats"
check_for_nan_in_loss
TrueCheck for NaN in loss
check_for_spiky_loss
FalseCheck for unexpectedly large loss
spiky_loss_factor
10.0Loss flagged if > factor * max observed (increase for large models)
Exit codes: 16 = resume to disambiguate, 17 = failed validation.
python
from megatron.bridge.training.config import RerunStateMachineConfig

cfg.rerun_state_machine = RerunStateMachineConfig(
    rerun_mode="validate_results",
    check_for_nan_in_loss=True,
    check_for_spiky_loss=False,
    spiky_loss_factor=10.0,
)
参数默认值描述
rerun_mode
"disabled"
可选值:
"disabled"
"validate_results"
"report_determinism_stats"
check_for_nan_in_loss
True检查损失值是否为NaN
check_for_spiky_loss
False检查是否出现异常激增的损失值
spiky_loss_factor
10.0当损失值大于该系数乘以历史最大值时标记(大模型可适当增大)
退出码:16 = 需要恢复以消除歧义,17 = 验证失败。

In-process restart (experimental)

进程内重启(实验性)

python
from megatron.bridge.training.config import InProcessRestartConfig

cfg.inprocess_restart = InProcessRestartConfig(
    enabled=True,
    granularity="node",
    soft_timeout=60.0,
    hard_timeout=90.0,
)
ParameterDefaultDescription
enabled
FalseEnable in-process restart
active_world_size
NoneRanks executing workload (rest are warm reserves)
granularity
"node"
"node"
or
"rank"
restart granularity
max_iterations
NoneMax restart attempts (None = unlimited)
soft_timeout
60.0Detect GIL-released hangs (seconds)
hard_timeout
90.0Force-terminate hung ranks (seconds)
heartbeat_interval
30.0Heartbeat interval (seconds)
heartbeat_timeout
60.0Missing heartbeat timeout (seconds)
barrier_timeout
120.0Distributed barrier timeout (seconds)
completion_timeout
120.0Completion barrier timeout (seconds)
empty_cuda_cache
TrueClear CUDA cache during restart
max_rank_faults
NoneMax rank faults before terminating
monitor_process_logdir
NoneDirectory for monitor logs
Required environment variables:
bash
export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0
The PyTorch NCCL watchdog timeout must exceed
hard_timeout
. NeMo-Run's Slurm Executor is not supported; launch directly with
srun --kill-on-bad-exit=0
.
python
from megatron.bridge.training.config import InProcessRestartConfig

cfg.inprocess_restart = InProcessRestartConfig(
    enabled=True,
    granularity="node",
    soft_timeout=60.0,
    hard_timeout=90.0,
)
参数默认值描述
enabled
False启用进程内重启
active_world_size
None执行工作负载的rank数量(其余为热备)
granularity
"node"
重启粒度:
"node"
"rank"
max_iterations
None最大重启尝试次数(None表示无限制)
soft_timeout
60.0检测GIL释放后的挂起情况(秒)
hard_timeout
90.0强制终止挂起的rank(秒)
heartbeat_interval
30.0心跳间隔(秒)
heartbeat_timeout
60.0心跳丢失超时时间(秒)
barrier_timeout
120.0分布式屏障超时时间(秒)
completion_timeout
120.0完成屏障超时时间(秒)
empty_cuda_cache
True重启时清理CUDA缓存
max_rank_faults
None终止前允许的最大rank故障次数
monitor_process_logdir
None监控日志的存储目录
必填环境变量:
bash
export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0
PyTorch NCCL看门狗超时时间必须大于
hard_timeout
。不支持NeMo-Run的Slurm执行器;需直接使用
srun --kill-on-bad-exit=0
启动。

Async checkpoint save

异步Checkpoint保存

python
cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"
python
cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"

Local checkpointing (NVRx)

本地Checkpointing(NVRx)

python
cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"
python
cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"

Code Anchors

代码锚点

Fault tolerance

容错

  • Config:
    src/megatron/bridge/training/config.py
    FaultToleranceConfig
  • Runtime:
    src/megatron/bridge/training/fault_tolerance.py
  • Plugin:
    src/megatron/bridge/recipes/run_plugins.py
    FaultTolerancePlugin
  • Perf plugin:
    scripts/performance/resiliency_plugins.py
  • Tests:
    tests/unit_tests/training/test_fault_tolerance.py
  • Example:
    examples/training_features/resiliency/fault_tolerance/
  • 配置:
    src/megatron/bridge/training/config.py
    FaultToleranceConfig
  • 运行时:
    src/megatron/bridge/training/fault_tolerance.py
  • 插件:
    src/megatron/bridge/recipes/run_plugins.py
    FaultTolerancePlugin
  • 性能插件:
    scripts/performance/resiliency_plugins.py
  • 测试:
    tests/unit_tests/training/test_fault_tolerance.py
  • 示例:
    examples/training_features/resiliency/fault_tolerance/

Straggler detection

掉队节点检测

  • Config:
    src/megatron/bridge/training/config.py
    NVRxStragglerDetectionConfig
  • Runtime:
    src/megatron/bridge/training/nvrx_straggler.py
  • Train loop:
    src/megatron/bridge/training/train.py
    check_nvrx_straggler_detection
  • Tests:
    tests/unit_tests/training/test_nvrx_straggler.py
    ,
    tests/functional_tests/training/test_nvrx_straggler.py
  • Example:
    examples/training_features/resiliency/straggler_detection/
  • 配置:
    src/megatron/bridge/training/config.py
    NVRxStragglerDetectionConfig
  • 运行时:
    src/megatron/bridge/training/nvrx_straggler.py
  • 训练循环:
    src/megatron/bridge/training/train.py
    check_nvrx_straggler_detection
  • 测试:
    tests/unit_tests/training/test_nvrx_straggler.py
    tests/functional_tests/training/test_nvrx_straggler.py
  • 示例:
    examples/training_features/resiliency/straggler_detection/

In-process restart

进程内重启

  • Config:
    src/megatron/bridge/training/config.py
    InProcessRestartConfig
  • Runtime:
    src/megatron/bridge/training/inprocess_restart.py
  • Entry point:
    src/megatron/bridge/training/pretrain.py
    maybe_wrap_for_inprocess_restart
  • Tests:
    tests/unit_tests/training/test_inprocess_restart.py
    ,
    tests/functional_tests/training/test_inprocess_restart.py
  • 配置:
    src/megatron/bridge/training/config.py
    InProcessRestartConfig
  • 运行时:
    src/megatron/bridge/training/inprocess_restart.py
  • 入口:
    src/megatron/bridge/training/pretrain.py
    maybe_wrap_for_inprocess_restart
  • 测试:
    tests/unit_tests/training/test_inprocess_restart.py
    tests/functional_tests/training/test_inprocess_restart.py

Preemption

抢占

  • Plugin:
    src/megatron/bridge/recipes/run_plugins.py
    PreemptionPlugin
  • Signal handler:
    src/megatron/bridge/training/utils/sig_utils.py
  • Tests:
    tests/unit_tests/recipes/test_run_plugins.py
  • 插件:
    src/megatron/bridge/recipes/run_plugins.py
    PreemptionPlugin
  • 信号处理器:
    src/megatron/bridge/training/utils/sig_utils.py
  • 测试:
    tests/unit_tests/recipes/test_run_plugins.py

Re-run state machine

重运行状态机

  • Config:
    src/megatron/bridge/training/config.py
    RerunStateMachineConfig
  • Init:
    src/megatron/bridge/training/initialize.py
    init_rerun_state
  • 配置:
    src/megatron/bridge/training/config.py
    RerunStateMachineConfig
  • 初始化:
    src/megatron/bridge/training/initialize.py
    init_rerun_state

Checkpointing

Checkpointing

  • Async save:
    src/megatron/bridge/training/checkpointing.py
    schedule_async_save
  • Local ckpt:
    src/megatron/bridge/training/checkpointing.py
    LocalCheckpointManager
  • Tests:
    tests/functional_tests/training/test_local_checkpointing.py
  • 异步保存:
    src/megatron/bridge/training/checkpointing.py
    schedule_async_save
  • 本地Checkpoint:
    src/megatron/bridge/training/checkpointing.py
    LocalCheckpointManager
  • 测试:
    tests/functional_tests/training/test_local_checkpointing.py

Pitfalls

注意事项

  1. ft_launcher, not torchrun: Direct
    FaultToleranceConfig
    requires
    ft_launcher
    . Using
    torchrun
    silently disables FT. For non-Slurm, set
    GROUP_RANK=0
    .
  2. Async save requires torch_dist:
    async_save=True
    only works with
    ckpt_format="torch_dist"
    . Other formats silently fail or error.
  3. IPR + NeMo-Run: In-process restart is not compatible with NeMo-Run or Slurm preemption plugins. Requires specific PyTorch/NCCL versions and env vars.
  4. NVRx vs legacy straggler: Two detectors exist. Use NVRx (
    nvrx_straggler
    ); do not enable both.
  5. stop_if_detected default: NVRx logs but does not stop training by default. Set
    stop_if_detected=True
    for automatic termination.
  6. NCCL watchdog vs hard_timeout: For IPR, NCCL watchdog timeout must exceed
    hard_timeout
    or PyTorch kills the process before recovery.
  7. Rerun state machine is alpha: Use
    check_for_nan_in_loss=True
    for NaN detection, but don't rely on full rerun workflows yet.
  1. 使用ft_launcher而非torchrun:直接配置
    FaultToleranceConfig
    需要使用
    ft_launcher
    。使用
    torchrun
    会静默禁用容错功能。非Slurm环境需设置
    GROUP_RANK=0
  2. 异步保存仅支持torch_dist格式
    async_save=True
    仅在
    ckpt_format="torch_dist"
    时生效。其他格式会静默失败或报错。
  3. 进程内重启与NeMo-Run不兼容:进程内重启无法与NeMo-Run或Slurm抢占插件配合使用。需要特定版本的PyTorch/NCCL及环境变量。
  4. NVRx与旧版掉队节点检测器:存在两种检测器。请使用NVRx版本(
    nvrx_straggler
    );不要同时启用两者。
  5. stop_if_detected默认行为:NVRx默认仅记录日志,不会终止训练。若需自动终止需设置
    stop_if_detected=True
  6. NCCL看门狗与hard_timeout:对于进程内重启,NCCL看门狗超时时间必须大于
    hard_timeout
    ,否则PyTorch会在恢复前终止进程。
  7. 重运行状态机处于alpha阶段:可使用
    check_for_nan_in_loss=True
    检测NaN值,但暂不要依赖完整的重运行工作流。

Verification

验证方法

Fault tolerance

容错

bash
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault
Look for
[FaultTolerance]
/
[RankMonitorServer]
log lines with section timeouts. Simulated fault should trigger restart from checkpoint.
bash
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault
查找包含
[FaultTolerance]
/
[RankMonitorServer]
的日志行,查看阶段超时信息。模拟故障应触发从checkpoint重启。

Straggler detection

掉队节点检测

bash
uv run python -m torch.distributed.run --nproc_per_node=2 \
    examples/training_features/resiliency/straggler_detection/straggler_detection_example.py
Look for
GPU relative performance
and
GPU individual performance
reports with per-rank scores.
bash
uv run python -m torch.distributed.run --nproc_per_node=2 \
    examples/training_features/resiliency/straggler_detection/straggler_detection_example.py
查找包含
GPU relative performance
GPU individual performance
的报告,查看各rank的性能分数。

Async checkpoint

异步Checkpoint

Look for
Scheduling async checkpoint save
in logs. Training iterations should continue while checkpoint files are being written.
在日志中查找
Scheduling async checkpoint save
信息。训练迭代应在checkpoint文件写入期间持续进行。

In-process restart

进程内重启

bash
pytest tests/functional_tests/training/test_inprocess_restart.py -v
Requires compatible PyTorch/NCCL versions.
bash
pytest tests/functional_tests/training/test_inprocess_restart.py -v
需要兼容版本的PyTorch/NCCL。