resiliency
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseResiliency
弹性能力
Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md
Card: @skills/resiliency/card.yaml
稳定版文档:@docs/training/resiliency.md、@docs/training/checkpointing.md
技能卡片:@skills/resiliency/card.yaml
Enablement
功能启用
Fault tolerance (Slurm only)
容错(仅支持Slurm)
Option 1: NeMo Run plugin (recommended)
方案1:NeMo Run插件(推荐)
python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run
task = run.Script(...)
run_plugins = [
FaultTolerancePlugin(
enable_ft_package=True,
calc_ft_timeouts=True,
num_in_job_restarts=3,
num_job_retries_on_failure=2,
initial_rank_heartbeat_timeout=1800,
rank_heartbeat_timeout=300,
)
]
run.run(task, plugins=run_plugins, executor=executor)| Plugin parameter | Default | Description |
|---|---|---|
| 3 | Max restarts within same job |
| 2 | Max new job launches on failure |
| 1800 | First heartbeat timeout (seconds) |
| 300 | Subsequent heartbeat timeout (seconds) |
python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run
task = run.Script(...)
run_plugins = [
FaultTolerancePlugin(
enable_ft_package=True,
calc_ft_timeouts=True,
num_in_job_restarts=3,
num_job_retries_on_failure=2,
initial_rank_heartbeat_timeout=1800,
rank_heartbeat_timeout=300,
)
]
run.run(task, plugins=run_plugins, executor=executor)| 插件参数 | 默认值 | 描述 |
|---|---|---|
| 3 | 同任务内的最大重启次数 |
| 2 | 任务失败后的最大重新启动次数 |
| 1800 | 首次心跳超时时间(秒) |
| 300 | 后续心跳超时时间(秒) |
Option 2: Direct config + ft_launcher
方案2:直接配置 + ft_launcher
python
from megatron.bridge.training.config import FaultToleranceConfig
cfg.ft = FaultToleranceConfig(
enable_ft_package=True,
calc_ft_timeouts=True,
simulate_fault=False,
simulated_fault_type="random",
)Launch with (not ):
ft_launchertorchrunbash
export GROUP_RANK=0 # required for non-Slurm
ft_launcher \
--rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
--nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
--ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
--ft-rank_out_of_section_timeout=300 \
your_training_script.py| Config parameter | Default | Description |
|---|---|---|
| False | Enable fault tolerance |
| False | Auto-compute optimal timeouts |
| False | Enable fault simulation for testing |
| | |
| None | Specific rank to fault (random if None) |
| 0 | Base delay before simulating fault |
Section-based timeout monitoring covers setup, training steps, checkpointing,
and out-of-section time independently. Timeouts are saved to
for subsequent runs when .
ft_state.jsoncalc_ft_timeouts=Truepython
from megatron.bridge.training.config import FaultToleranceConfig
cfg.ft = FaultToleranceConfig(
enable_ft_package=True,
calc_ft_timeouts=True,
simulate_fault=False,
simulated_fault_type="random",
)使用启动(而非):
ft_launchertorchrunbash
export GROUP_RANK=0 # 非Slurm环境必填
ft_launcher \
--rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
--nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
--ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
--ft-rank_out_of_section_timeout=300 \
your_training_script.py| 配置参数 | 默认值 | 描述 |
|---|---|---|
| False | 启用容错功能 |
| False | 自动计算最优超时时间 |
| False | 启用故障模拟用于测试 |
| | 可选值: |
| None | 指定要模拟故障的rank(为None时随机选择) |
| 0 | 模拟故障前的基础延迟时间 |
基于阶段的超时监控独立覆盖初始化、训练步骤、 checkpointing以及阶段外时间。当时,超时时间会保存到供后续运行使用。
calc_ft_timeouts=Trueft_state.jsonNVRx straggler detection
NVRx掉队节点检测
python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig
cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
enabled=True,
report_time_interval=300.0,
calc_relative_gpu_perf=True,
calc_individual_gpu_perf=True,
num_gpu_perf_scores_to_print=5,
gpu_relative_perf_threshold=0.7,
gpu_individual_perf_threshold=0.7,
stop_if_detected=False,
enable_logging=True,
)| Parameter | Default | Description |
|---|---|---|
| False | Enable straggler detection |
| 300.0 | Seconds between straggler checks |
| True | Compare ranks against each other |
| True | Track per-rank degradation over time |
| 0.7 | Threshold for relative performance (0-1) |
| 0.7 | Threshold for individual performance (0-1) |
| False | Terminate training on straggler |
| 5 | Number of best/worst scores to print |
| 1 | Profiling interval for detector |
python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig
cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
enabled=True,
report_time_interval=300.0,
calc_relative_gpu_perf=True,
calc_individual_gpu_perf=True,
num_gpu_perf_scores_to_print=5,
gpu_relative_perf_threshold=0.7,
gpu_individual_perf_threshold=0.7,
stop_if_detected=False,
enable_logging=True,
)| 参数 | 默认值 | 描述 |
|---|---|---|
| False | 启用掉队节点检测 |
| 300.0 | 掉队节点检测的时间间隔(秒) |
| True | 对比不同rank的性能 |
| True | 跟踪单个rank的性能退化情况 |
| 0.7 | 相对性能阈值(0-1) |
| 0.7 | 单个性能阈值(0-1) |
| False | 检测到掉队节点时终止训练 |
| 5 | 要打印的最优/最差性能分数数量 |
| 1 | 检测器的性能分析间隔 |
Preemption
抢占
Plugin (Slurm)
插件(Slurm)
python
from megatron.bridge.recipes.run_plugins import PreemptionPlugin
plugins = [
PreemptionPlugin(
preempt_time=60,
enable_exit_handler=True,
enable_exit_handler_for_data_loader=False,
)
]| Plugin parameter | Default | Description |
|---|---|---|
| 60 | Seconds before job limit to send signal |
| True | Enable signal handler in training |
| False | Enable for dataloader workers |
python
from megatron.bridge.recipes.run_plugins import PreemptionPlugin
plugins = [
PreemptionPlugin(
preempt_time=60,
enable_exit_handler=True,
enable_exit_handler_for_data_loader=False,
)
]| 插件参数 | 默认值 | 描述 |
|---|---|---|
| 60 | 到达任务限制前发送信号的提前时间(秒) |
| True | 在训练中启用信号处理器 |
| False | 为数据加载器工作进程启用信号处理器 |
Direct config
直接配置
python
import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = Falsepython
import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = FalseRe-run state machine (experimental)
重运行状态机(实验性)
python
from megatron.bridge.training.config import RerunStateMachineConfig
cfg.rerun_state_machine = RerunStateMachineConfig(
rerun_mode="validate_results",
check_for_nan_in_loss=True,
check_for_spiky_loss=False,
spiky_loss_factor=10.0,
)| Parameter | Default | Description |
|---|---|---|
| | |
| True | Check for NaN in loss |
| False | Check for unexpectedly large loss |
| 10.0 | Loss flagged if > factor * max observed (increase for large models) |
Exit codes: 16 = resume to disambiguate, 17 = failed validation.
python
from megatron.bridge.training.config import RerunStateMachineConfig
cfg.rerun_state_machine = RerunStateMachineConfig(
rerun_mode="validate_results",
check_for_nan_in_loss=True,
check_for_spiky_loss=False,
spiky_loss_factor=10.0,
)| 参数 | 默认值 | 描述 |
|---|---|---|
| | 可选值: |
| True | 检查损失值是否为NaN |
| False | 检查是否出现异常激增的损失值 |
| 10.0 | 当损失值大于该系数乘以历史最大值时标记(大模型可适当增大) |
退出码:16 = 需要恢复以消除歧义,17 = 验证失败。
In-process restart (experimental)
进程内重启(实验性)
python
from megatron.bridge.training.config import InProcessRestartConfig
cfg.inprocess_restart = InProcessRestartConfig(
enabled=True,
granularity="node",
soft_timeout=60.0,
hard_timeout=90.0,
)| Parameter | Default | Description |
|---|---|---|
| False | Enable in-process restart |
| None | Ranks executing workload (rest are warm reserves) |
| | |
| None | Max restart attempts (None = unlimited) |
| 60.0 | Detect GIL-released hangs (seconds) |
| 90.0 | Force-terminate hung ranks (seconds) |
| 30.0 | Heartbeat interval (seconds) |
| 60.0 | Missing heartbeat timeout (seconds) |
| 120.0 | Distributed barrier timeout (seconds) |
| 120.0 | Completion barrier timeout (seconds) |
| True | Clear CUDA cache during restart |
| None | Max rank faults before terminating |
| None | Directory for monitor logs |
Required environment variables:
bash
export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0The PyTorch NCCL watchdog timeout must exceed . NeMo-Run's
Slurm Executor is not supported; launch directly with .
hard_timeoutsrun --kill-on-bad-exit=0python
from megatron.bridge.training.config import InProcessRestartConfig
cfg.inprocess_restart = InProcessRestartConfig(
enabled=True,
granularity="node",
soft_timeout=60.0,
hard_timeout=90.0,
)| 参数 | 默认值 | 描述 |
|---|---|---|
| False | 启用进程内重启 |
| None | 执行工作负载的rank数量(其余为热备) |
| | 重启粒度: |
| None | 最大重启尝试次数(None表示无限制) |
| 60.0 | 检测GIL释放后的挂起情况(秒) |
| 90.0 | 强制终止挂起的rank(秒) |
| 30.0 | 心跳间隔(秒) |
| 60.0 | 心跳丢失超时时间(秒) |
| 120.0 | 分布式屏障超时时间(秒) |
| 120.0 | 完成屏障超时时间(秒) |
| True | 重启时清理CUDA缓存 |
| None | 终止前允许的最大rank故障次数 |
| None | 监控日志的存储目录 |
必填环境变量:
bash
export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0PyTorch NCCL看门狗超时时间必须大于。不支持NeMo-Run的Slurm执行器;需直接使用启动。
hard_timeoutsrun --kill-on-bad-exit=0Async checkpoint save
异步Checkpoint保存
python
cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"python
cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"Local checkpointing (NVRx)
本地Checkpointing(NVRx)
python
cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"python
cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"Code Anchors
代码锚点
Fault tolerance
容错
- Config: —
src/megatron/bridge/training/config.pyFaultToleranceConfig - Runtime:
src/megatron/bridge/training/fault_tolerance.py - Plugin: —
src/megatron/bridge/recipes/run_plugins.pyFaultTolerancePlugin - Perf plugin:
scripts/performance/resiliency_plugins.py - Tests:
tests/unit_tests/training/test_fault_tolerance.py - Example:
examples/training_features/resiliency/fault_tolerance/
- 配置:—
src/megatron/bridge/training/config.pyFaultToleranceConfig - 运行时:
src/megatron/bridge/training/fault_tolerance.py - 插件:—
src/megatron/bridge/recipes/run_plugins.pyFaultTolerancePlugin - 性能插件:
scripts/performance/resiliency_plugins.py - 测试:
tests/unit_tests/training/test_fault_tolerance.py - 示例:
examples/training_features/resiliency/fault_tolerance/
Straggler detection
掉队节点检测
- Config: —
src/megatron/bridge/training/config.pyNVRxStragglerDetectionConfig - Runtime:
src/megatron/bridge/training/nvrx_straggler.py - Train loop: —
src/megatron/bridge/training/train.pycheck_nvrx_straggler_detection - Tests: ,
tests/unit_tests/training/test_nvrx_straggler.pytests/functional_tests/training/test_nvrx_straggler.py - Example:
examples/training_features/resiliency/straggler_detection/
- 配置:—
src/megatron/bridge/training/config.pyNVRxStragglerDetectionConfig - 运行时:
src/megatron/bridge/training/nvrx_straggler.py - 训练循环:—
src/megatron/bridge/training/train.pycheck_nvrx_straggler_detection - 测试:、
tests/unit_tests/training/test_nvrx_straggler.pytests/functional_tests/training/test_nvrx_straggler.py - 示例:
examples/training_features/resiliency/straggler_detection/
In-process restart
进程内重启
- Config: —
src/megatron/bridge/training/config.pyInProcessRestartConfig - Runtime:
src/megatron/bridge/training/inprocess_restart.py - Entry point: —
src/megatron/bridge/training/pretrain.pymaybe_wrap_for_inprocess_restart - Tests: ,
tests/unit_tests/training/test_inprocess_restart.pytests/functional_tests/training/test_inprocess_restart.py
- 配置:—
src/megatron/bridge/training/config.pyInProcessRestartConfig - 运行时:
src/megatron/bridge/training/inprocess_restart.py - 入口:—
src/megatron/bridge/training/pretrain.pymaybe_wrap_for_inprocess_restart - 测试:、
tests/unit_tests/training/test_inprocess_restart.pytests/functional_tests/training/test_inprocess_restart.py
Preemption
抢占
- Plugin: —
src/megatron/bridge/recipes/run_plugins.pyPreemptionPlugin - Signal handler:
src/megatron/bridge/training/utils/sig_utils.py - Tests:
tests/unit_tests/recipes/test_run_plugins.py
- 插件:—
src/megatron/bridge/recipes/run_plugins.pyPreemptionPlugin - 信号处理器:
src/megatron/bridge/training/utils/sig_utils.py - 测试:
tests/unit_tests/recipes/test_run_plugins.py
Re-run state machine
重运行状态机
- Config: —
src/megatron/bridge/training/config.pyRerunStateMachineConfig - Init: —
src/megatron/bridge/training/initialize.pyinit_rerun_state
- 配置:—
src/megatron/bridge/training/config.pyRerunStateMachineConfig - 初始化:—
src/megatron/bridge/training/initialize.pyinit_rerun_state
Checkpointing
Checkpointing
- Async save: —
src/megatron/bridge/training/checkpointing.pyschedule_async_save - Local ckpt: —
src/megatron/bridge/training/checkpointing.pyLocalCheckpointManager - Tests:
tests/functional_tests/training/test_local_checkpointing.py
- 异步保存:—
src/megatron/bridge/training/checkpointing.pyschedule_async_save - 本地Checkpoint:—
src/megatron/bridge/training/checkpointing.pyLocalCheckpointManager - 测试:
tests/functional_tests/training/test_local_checkpointing.py
Pitfalls
注意事项
-
ft_launcher, not torchrun: Directrequires
FaultToleranceConfig. Usingft_launchersilently disables FT. For non-Slurm, settorchrun.GROUP_RANK=0 -
Async save requires torch_dist:only works with
async_save=True. Other formats silently fail or error.ckpt_format="torch_dist" -
IPR + NeMo-Run: In-process restart is not compatible with NeMo-Run or Slurm preemption plugins. Requires specific PyTorch/NCCL versions and env vars.
-
NVRx vs legacy straggler: Two detectors exist. Use NVRx (); do not enable both.
nvrx_straggler -
stop_if_detected default: NVRx logs but does not stop training by default. Setfor automatic termination.
stop_if_detected=True -
NCCL watchdog vs hard_timeout: For IPR, NCCL watchdog timeout must exceedor PyTorch kills the process before recovery.
hard_timeout -
Rerun state machine is alpha: Usefor NaN detection, but don't rely on full rerun workflows yet.
check_for_nan_in_loss=True
-
使用ft_launcher而非torchrun:直接配置需要使用
FaultToleranceConfig。使用ft_launcher会静默禁用容错功能。非Slurm环境需设置torchrun。GROUP_RANK=0 -
异步保存仅支持torch_dist格式:仅在
async_save=True时生效。其他格式会静默失败或报错。ckpt_format="torch_dist" -
进程内重启与NeMo-Run不兼容:进程内重启无法与NeMo-Run或Slurm抢占插件配合使用。需要特定版本的PyTorch/NCCL及环境变量。
-
NVRx与旧版掉队节点检测器:存在两种检测器。请使用NVRx版本();不要同时启用两者。
nvrx_straggler -
stop_if_detected默认行为:NVRx默认仅记录日志,不会终止训练。若需自动终止需设置。
stop_if_detected=True -
NCCL看门狗与hard_timeout:对于进程内重启,NCCL看门狗超时时间必须大于,否则PyTorch会在恢复前终止进程。
hard_timeout -
重运行状态机处于alpha阶段:可使用检测NaN值,但暂不要依赖完整的重运行工作流。
check_for_nan_in_loss=True
Verification
验证方法
Fault tolerance
容错
bash
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-faultLook for / log lines with section
timeouts. Simulated fault should trigger restart from checkpoint.
[FaultTolerance][RankMonitorServer]bash
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault查找包含 / 的日志行,查看阶段超时信息。模拟故障应触发从checkpoint重启。
[FaultTolerance][RankMonitorServer]Straggler detection
掉队节点检测
bash
uv run python -m torch.distributed.run --nproc_per_node=2 \
examples/training_features/resiliency/straggler_detection/straggler_detection_example.pyLook for and reports
with per-rank scores.
GPU relative performanceGPU individual performancebash
uv run python -m torch.distributed.run --nproc_per_node=2 \
examples/training_features/resiliency/straggler_detection/straggler_detection_example.py查找包含和的报告,查看各rank的性能分数。
GPU relative performanceGPU individual performanceAsync checkpoint
异步Checkpoint
Look for in logs. Training iterations
should continue while checkpoint files are being written.
Scheduling async checkpoint save在日志中查找信息。训练迭代应在checkpoint文件写入期间持续进行。
Scheduling async checkpoint saveIn-process restart
进程内重启
bash
pytest tests/functional_tests/training/test_inprocess_restart.py -vRequires compatible PyTorch/NCCL versions.
bash
pytest tests/functional_tests/training/test_inprocess_restart.py -v需要兼容版本的PyTorch/NCCL。