Search Results: megatron-lm

Found 14 Skills

nemo-mbridge-mlm-bridge-training

Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.

🇺🇸|EnglishTranslated

Code Qualitynvidia/skills

mcore-linting-and-formatting

Linting and formatting for Megatron-LM. Covers running autoformat.sh, tools (ruff, black, isort, pylint, mypy), and code style rules.

🇺🇸|EnglishTranslated

Testing & QAnvidia/skills

mcore-testing

Test system for Megatron-LM. Covers test layout, recipe YAML structure, adding and running unit and functional tests, golden values, marker filters, and CI parity.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

mcore-run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesnvidia/skills

bump-base-image

Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

slime-rl-training

Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

mlm-bridge-training

Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.

🇺🇸|EnglishTranslated

AI & Machine Learningascend/agent-skills

megatron-commit-tracker

Track and normalize change requests against the official Megatron-LM repository by branch, PR, commit, commit range, or time window. Use when Codex needs to collect the exact upstream change set before deeper analysis, especially for branch-aware Megatron and MindSpeed migration work, daily/periodic tracking, or preparing inputs for change analysis and migration generation.

🇺🇸|EnglishTranslated

3 scripts/Checked

AI & Machine Learningnvidia/skills

run-on-slurm

🇺🇸|EnglishTranslated

AI & Machine Learningascend/agent-skills

megatron-change-analyzer

Analyze official Megatron-LM commits, PRs, and branch change sets to identify feature evolution, candidate breaking changes, and migration-relevant events. Use when Codex already has a normalized Megatron change set and needs to explain what changed, which new features matter, and which changes should flow into MindSpeed adaptation work.

🇺🇸|EnglishTranslated

1 scripts/Checked

DevOps & Cloud Servicesnvidia/skills

bump-dependency

Bump a pinned dependency (TransformerEngine, Megatron-LM, NRX, etc.), regenerate the lockfile, open a PR, and drive it to green by attaching a watchdog to the "CICD NeMo" workflow and quarantining failing functional tests as flaky until the run is green.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesnvidia/skills

mcore-create-issue

Investigate a failing GitHub Actions run or job and create a GitHub issue for the failure.

🇺🇸|EnglishTranslated