Search Results: megatron-lm

Found 14 Skills

mcore-linting-and-formatting

Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesnvidia/skills

bump-base-image

Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

nemo-mbridge-mlm-bridge-training

Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.

🇺🇸|EnglishTranslated

Testing & QAnvidia/skills

mcore-testing

Test system for Megatron-LM. Covers test layout, recipe YAML structure, adding and running unit and functional tests, golden values, marker filters, and CI parity.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

slime-rl-training

Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

mlm-bridge-training

Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

mcore-run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

run-on-slurm

🇺🇸|EnglishTranslated

AI & Machine Learningascend/agent-skills

megatron-commit-tracker

Track and normalize change requests against the official Megatron-LM repository by branch, PR, commit, commit range, or time window. Use when Codex needs to collect the exact upstream change set before deeper analysis, especially for branch-aware Megatron and MindSpeed migration work, daily/periodic tracking, or preparing inputs for change analysis and migration generation.

🇺🇸|EnglishTranslated

3 scripts/Checked

AI & Machine Learningascend-ai-coding/awesome-...

mindspeed-llm-env-setup

MindSpeed-LLM 环境搭建指南，用于华为昇腾 NPU。覆盖 CANN 环境激活、PyTorch + torch_npu 安装、MindSpeed 加速库安装、Megatron-LM 核心模块集成、MindSpeed-LLM 安装及环境验证。当用户需要在昇腾 NPU 上搭建 MindSpeed-LLM 训练环境时使用。

🇺🇸|EnglishTranslated

AI & Machine Learningascend/agent-skills

megatron-change-analyzer

Analyze official Megatron-LM commits, PRs, and branch change sets to identify feature evolution, candidate breaking changes, and migration-relevant events. Use when Codex already has a normalized Megatron change set and needs to explain what changed, which new features matter, and which changes should flow into MindSpeed adaptation work.

🇺🇸|EnglishTranslated

1 scripts/Checked

DevOps & Cloud Servicesnvidia/skills

bump-dependency

Bump a pinned dependency (TransformerEngine, Megatron-LM, NRX, etc.), regenerate the lockfile, open a PR, and drive it to green by attaching a watchdog to the "CICD NeMo" workflow and quarantining failing functional tests as flaky until the run is green.

🇺🇸|EnglishTranslated