Total 50,537 skills, AI & Machine Learning has 8483 skills
Showing 12 of 8483 skills
Generates a self-contained Python experiment client that uses the ddtrace.llmobs SDK. Emits either a runnable .py script or a Jupyter .ipynb notebook matching the canonical DataDog reference notebook style. Use when the user says "generate Python experiment", "write an SDK experiment", "create a ddtrace experiment", "Python notebook experiment", "use the LLM Obs SDK", or has `ddtrace` installed and wants idiomatic SDK code.
End-to-end pipeline from unlabeled ml_app traces to a bootstrapped evaluator suite. Runs trace classification → root cause analysis → eval bootstrap in sequence with user checkpoints. Use when user says "run the eval pipeline", "go from traces to evals", "bootstrap evals end to end", "classify then RCA then bootstrap", "build an eval set from scratch", or wants a guided walkthrough from production data to evaluator code.
A CLI tool that calls the Wind Alice Agent (A2A protocol, SSE streaming) to execute specified Skills and obtain analysis results. It applies to scenarios where users explicitly request actions like "run a certain Skill with Alice", "generate a research question list for a company", "create a one-page investment memo", "verify a piece of financial information", etc., that involve Alice sub-Skills.
Add a new cuTile GPU kernel operator to TileGym. Covers dispatch registration in ops.py, cuTile backend implementation, __init__.py exports, test creation, and benchmark in tests/benchmark. Use when adding, creating, or implementing a new cuTile operator/kernel in TileGym, or when asking how to register a new cuTile op.
Nsight Systems (nsys) CLI for system-level timeline profiling. Use when the user wants to run nsys profile, analyze .nsys-rep reports, use nsys stats/analyze/recipe commands, diagnose GPU idle time from timeline traces, or profile distributed training with NCCL overlap analysis. NOT for kernel-level metrics like SOL%, occupancy, or roofline (use perf-nsight-compute-analysis for ncu). NOT for writing or generating kernels. NOT for applying optimizations like CUDA Graphs.
Search video archives using natural language — find events, objects, actions, and people across recorded video using fusion search (Cosmos Embed1 semantic search + CV attribute search). Use when asked to search for something in video, find actions and events, locate objects and people, or query video archives. For these types of questions, default to this top-level fusion search unless user specifies otherwise. Requires the search profile to be deployed.
Debug AutoDeploy accuracy regressions vs a reference score (PyTorch backend or published baseline). Use when an AutoDeploy model's eval score is significantly below the reference and the root cause is unknown.
Expert cuTile programming assistant. Write high-performance GPU kernels using cuTile's tile-based programming model with proper validation and optimization. Supports deep agent orchestration for complex multi-kernel tasks.
Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.
Configuration conventions for NeMo-RL. YAML is the single source of truth for defaults. Covers TypedDict usage, exemplar YAML updates, and forbidden default patterns.
Compile TensorRT-LLM on a compute node inside a Docker container. Use this when already on a compute node with GPUs visible.
Analyze host/CPU overhead in TensorRT-LLM inference from nsys traces. Detect whether host overhead is the bottleneck using GPU idle ratio, host prep exposed ratio, and per-phase evidence. For regressions, isolate forward steps via allreduce/NVTX patterns, compare host operation breakdowns across versions, and identify scheduling or request-management overhead. Supports optional inter-kernel gap, eager-vs-graph, pattern mapping, and multi-rank straggler drill-down. Use standalone or within perf-analysis. Triggers: host overhead, inter-step gap, scheduling overhead, forward step isolation, nsys iteration analysis, NVTX breakdown, request management overhead, GPU idle, host bottleneck, host prep exposed, inter-kernel gap, bubble analysis, graph coverage, eager kernel, rank imbalance, straggler detection.