Loading...
Loading...
Testing framework for evaluating Databricks skills. Use when building test cases for skills, running skill evaluations, comparing skill versions, or creating ground truth datasets with the Generate-Review-Promote (GRP) pipeline. Triggers include "test skill", "evaluate skill", "skill regression", "ground truth", "GRP pipeline", "skill quality", and "skill metrics".
npx skill4agent add databricks-solutions/ai-dev-kit skill-test/skill-test/skill-test <skill-name> [subcommand]| Subcommand | Description |
|---|---|
| Run evaluation against ground truth (default) |
| Compare current results against baseline |
| Initialize test scaffolding for a new skill |
| Interactive: prompt -> invoke skill -> test -> save |
| Add test case with trace evaluation |
| Review pending candidates interactively |
| Batch approve all pending candidates |
| Save current results as regression baseline |
| Run full MLflow evaluation with LLM judges |
| Evaluate traces against skill expectations |
| List available traces (MLflow or local) |
| List configured scorers for a skill |
| Add/remove scorers or update default guidelines |
| Sync YAML to Unity Catalog (Phase 2) |
/skill-test spark-declarative-pipelines run
/skill-test spark-declarative-pipelines add --trace
/skill-test spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill inituv pip install -e .test/DATABRICKS_CONFIG_PROFILEMLFLOW_TRACKING_URIMLFLOW_EXPERIMENT_NAME.test/scripts/uv run python .test/scripts/{subcommand}.py {skill_name} [options]| Subcommand | Script |
|---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
--help/skill-testargs[0]args[1]| Subcommand | Action |
|---|---|
| Execute |
| Execute |
| Execute |
| Prompt for test input, invoke skill, run |
| Execute |
| Execute |
| Execute |
| Execute |
| Execute |
/skill-test <skill-name> initmanifest.yamlground_truth.yamlcandidates.yaml/skill-test <skill-name> add| File Type | Path |
|---|---|
| Ground truth | |
| Candidates | |
| Manifest | |
| Routing tests | |
| Baselines | |
spark-declarative-pipelines/Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG.test/ # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml # Package config (pip install -e ".test/")
├── README.md # Contributor documentation
├── SKILL.md # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh # Sync script
├── scripts/ # Wrapper scripts
│ ├── _common.py # Shared utilities
│ ├── run_eval.py
│ ├── regression.py
│ ├── init_skill.py
│ ├── add.py
│ ├── baseline.py
│ ├── mlflow_eval.py
│ ├── routing_eval.py
│ ├── trace_eval.py # Trace evaluation
│ ├── list_traces.py # List available traces
│ ├── scorers.py
│ ├── scorers_update.py
│ └── sync.py
├── src/
│ └── skill_test/ # Python package
│ ├── cli/ # CLI commands module
│ ├── fixtures/ # Test fixture setup
│ ├── scorers/ # Evaluation scorers
│ ├── grp/ # Generate-Review-Promote pipeline
│ └── runners/ # Evaluation runners
├── skills/ # Per-skill test definitions
│ ├── _routing/ # Routing test cases
│ └── {skill-name}/ # Skill-specific tests
│ ├── ground_truth.yaml
│ ├── candidates.yaml
│ └── manifest.yaml
├── tests/ # Unit tests
├── references/ # Documentation references
└── baselines/ # Regression baselines