Loading...
Loading...
Query and browse evaluation results stored in MLflow. Use when the user wants to look up runs by invocation ID, compare metrics across models, fetch artifacts (configs, logs, results), or set up the MLflow MCP server. ALWAYS triggers on mentions of MLflow, experiment results, run comparison, invocation IDs in the context of results, or MLflow MCP setup.
npx skill4agent add nvidia/skills accessing-mlflow71f3f3199ea5e1f0# Find runs by invocation_id
MLflow:search_runs_by_tags(experiment_id, {"invocation_id": "<invocation_id>"})
# Query for example model/task runs
MLflow:query_runs(experiment_id, "tags.model LIKE '%<model>%'")
MLflow:query_runs(experiment_id, "tags.task_name LIKE '%<task_name>%'")
# Get a config from run's artifacts
MLflow:get_artifact_content(run_id, "config.yml")
# Get nested stats from run's artifacts
MLflow:get_artifact_content(run_id, "artifacts/eval_factory_metrics.json")uv run --with pandas python3 << 'EOF'
import pandas as pd
# ... compute deltas, averages, etc.
EOF<harness>.<task>/
├── artifacts/
│ ├── config.yml # Fully resolved config used during the evaluation
│ ├── launcher_unresolved_config.yaml # Unresolved config passed to the launcher
│ ├── results.yml # All results in YAML format
│ ├── eval_factory_metrics.json # Runtime stats (latency, tokens count, memory)
│ ├── report.html # Request-Response Pairs samples in HTML format (if enabled)
│ └── report.json # Request-Response Pairs samples in JSON format (if enabled)
└── logs/
├── client-*.log # Evaluation client
├── server-*-N.log # Deployment per node
├── slurm-*.log # Slurm job
└── proxy-*.log # Request proxyuvxcurl -LsSf https://astral.sh/uv/install.sh | sh.claude/settings.json"mcpServers""MLflow": {
"command": "uvx",
"args": ["mlflow-mcp"],
"env": {
"MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
}
}~/.cursor/mcp.json{
"mcpServers": {
"MLflow": {
"command": "uvx",
"args": ["mlflow-mcp"],
"env": {
"MLFLOW_TRACKING_URI": "https://<your-mlflow-server>/"
}
}
}
}