skill-test

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Databricks Skills Testing Framework

Databricks 技能测试框架

Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.

采用YAML优先的离线评估模式，支持人工介入评审和交互式技能优化。

Quick References

快速参考

Scorers - Available scorers and quality gates
YAML Schemas - Manifest and ground truth formats
Python API - Programmatic usage examples
Workflows - Detailed example workflows
Trace Evaluation - Session trace analysis

评分器 - 可用的评分器和质量门槛
YAML 模式 - 清单和基准数据集格式
Python API - 程序化使用示例
工作流 - 详细的示例工作流
追踪评估 - 会话追踪分析

/skill-test Command

/skill-test 命令

The

/skill-test

command provides an interactive CLI for testing Databricks skills with real execution on Databricks.

/skill-test

命令提供交互式CLI，可在Databricks上实际执行以测试Databricks技能。

Basic Usage

基本用法

/skill-test <skill-name> [subcommand]

/skill-test <skill-name> [subcommand]

Subcommands

子命令

Subcommand	Description
`run`	Run evaluation against ground truth (default)
`regression`	Compare current results against baseline
`init`	Initialize test scaffolding for a new skill
`add`	Interactive: prompt -> invoke skill -> test -> save
`add --trace`	Add test case with trace evaluation
`review`	Review pending candidates interactively
`review --batch`	Batch approve all pending candidates
`baseline`	Save current results as regression baseline
`mlflow`	Run full MLflow evaluation with LLM judges
`trace-eval`	Evaluate traces against skill expectations
`list-traces`	List available traces (MLflow or local)
`scorers`	List configured scorers for a skill
`scorers update`	Add/remove scorers or update default guidelines
`sync`	Sync YAML to Unity Catalog (Phase 2)

子命令	描述
`run`	对照基准数据集运行评估（默认子命令）
`regression`	对比当前结果与基线版本
`init`	为新技能初始化测试脚手架
`add`	交互式流程：提示 -> 调用技能 -> 测试 -> 保存
`add --trace`	添加带追踪评估的测试用例
`review`	交互式评审待处理候选用例
`review --batch`	批量批准所有待处理候选用例
`baseline`	将当前结果保存为回归基线
`mlflow`	结合LLM评审器运行完整的MLflow评估
`trace-eval`	对照技能预期评估追踪数据
`list-traces`	列出可用的追踪数据（MLflow或本地）
`scorers`	列出技能的已配置评分器
`scorers update`	添加/移除评分器或更新默认准则
`sync`	将YAML同步至Unity Catalog（第二阶段）

Quick Examples

快速示例

/skill-test spark-declarative-pipelines run
/skill-test spark-declarative-pipelines add --trace
/skill-test spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init

See Workflows for detailed examples of each subcommand.

/skill-test spark-declarative-pipelines run
/skill-test spark-declarative-pipelines add --trace
/skill-test spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init

查看工作流获取各子命令的详细示例。

Execution Instructions

执行说明

Environment Setup

环境设置

bash

uv pip install -e .test/

Environment variables for Databricks MLflow:

```
DATABRICKS_CONFIG_PROFILE
```
- Databricks CLI profile (default: "DEFAULT")
```
MLFLOW_TRACKING_URI
```
- Set to "databricks" for Databricks MLflow
```
MLFLOW_EXPERIMENT_NAME
```
- Experiment path (e.g., "/Users/{user}/skill-test")

bash

uv pip install -e .test/

用于Databricks MLflow的环境变量：

```
DATABRICKS_CONFIG_PROFILE
```
- Databricks CLI配置文件（默认值："DEFAULT"）
```
MLFLOW_TRACKING_URI
```
- 针对Databricks MLflow需设置为"databricks"
```
MLFLOW_EXPERIMENT_NAME
```
- 实验路径（例如："/Users/{user}/skill-test"）

Running Scripts

运行脚本

All subcommands have corresponding scripts in

.test/scripts/

bash

uv run python .test/scripts/{subcommand}.py {skill_name} [options]

Subcommand	Script
`run`	`run_eval.py`
`regression`	`regression.py`
`init`	`init_skill.py`
`add`	`add.py`
`review`	`review.py`
`baseline`	`baseline.py`
`mlflow`	`mlflow_eval.py`
`scorers`	`scorers.py`
`scorers update`	`scorers_update.py`
`sync`	`sync.py`
`trace-eval`	`trace_eval.py`
`list-traces`	`list_traces.py`
`_routing mlflow`	`routing_eval.py`

Use

--help

on any script for available options.

所有子命令在

.test/scripts/

目录下都有对应的脚本：

bash

uv run python .test/scripts/{subcommand}.py {skill_name} [options]

子命令	脚本
`run`	`run_eval.py`
`regression`	`regression.py`
`init`	`init_skill.py`
`add`	`add.py`
`review`	`review.py`
`baseline`	`baseline.py`
`mlflow`	`mlflow_eval.py`
`scorers`	`scorers.py`
`scorers update`	`scorers_update.py`
`sync`	`sync.py`
`trace-eval`	`trace_eval.py`
`list-traces`	`list_traces.py`
`_routing mlflow`	`routing_eval.py`

在任意脚本后添加

--help

查看可用选项。

Command Handler

命令处理器

When

/skill-test

is invoked, parse arguments and execute the appropriate command.

调用

/skill-test

时，会解析参数并执行相应命令。

Argument Parsing

参数解析

```
args[0]
```
= skill_name (required)
```
args[1]
```
= subcommand (optional, default: "run")

```
args[0]
```
= skill_name（必填）
```
args[1]
```
= subcommand（可选，默认值："run"）

Subcommand Routing

子命令路由

Subcommand	Action
`run`	Execute `run(skill_name, ctx)` and display results
`regression`	Execute `regression(skill_name, ctx)` and display comparison
`init`	Execute `init(skill_name, ctx)` to create scaffolding
`add`	Prompt for test input, invoke skill, run `interactive()`
`review`	Execute `review(skill_name, ctx)` to review pending candidates
`baseline`	Execute `baseline(skill_name, ctx)` to save as regression baseline
`mlflow`	Execute `mlflow_eval(skill_name, ctx)` with MLflow logging
`scorers`	Execute `scorers(skill_name, ctx)` to list configured scorers
`scorers update`	Execute `scorers_update(skill_name, ctx, ...)` to modify scorers

子命令	操作
`run`	执行 `run(skill_name, ctx)` 并显示结果
`regression`	执行 `regression(skill_name, ctx)` 并显示对比结果
`init`	执行 `init(skill_name, ctx)` 创建测试脚手架
`add`	提示输入测试内容、调用技能、执行 `interactive()`
`review`	执行 `review(skill_name, ctx)` 评审待处理候选用例
`baseline`	执行 `baseline(skill_name, ctx)` 保存为回归基线
`mlflow`	结合MLflow日志执行 `mlflow_eval(skill_name, ctx)`
`scorers`	执行 `scorers(skill_name, ctx)` 列出已配置评分器
`scorers update`	执行 `scorers_update(skill_name, ctx, ...)` 修改评分器配置

init Behavior

init 行为

When running

/skill-test <skill-name> init

Read the skill's SKILL.md to understand its purpose
Create
```
manifest.yaml
```
with appropriate scorers and trace_expectations
Create empty
```
ground_truth.yaml
```
and
```
candidates.yaml
```
templates
Recommend test prompts based on documentation examples

Follow with

/skill-test <skill-name> add

using recommended prompts.

运行

/skill-test <skill-name> init

时：

读取技能的SKILL.md文件以了解其用途
创建包含合适评分器和追踪预期的
```
manifest.yaml
```
创建空的
```
ground_truth.yaml
```
和
```
candidates.yaml
```
模板
根据文档示例推荐测试提示词

之后可使用推荐的提示词运行

/skill-test <skill-name> add

。

Context Setup

上下文设置

Create CLIContext with MCP tools before calling any command. See Python API for details.

调用任何命令前，需使用MCP工具创建CLIContext。详情请查看Python API。

File Locations

文件位置

Important: All test files are stored at the repository root level, not relative to this skill's directory.

File Type	Path
Ground truth	`{repo_root}/.test/skills/{skill-name}/ground_truth.yaml`
Candidates	`{repo_root}/.test/skills/{skill-name}/candidates.yaml`
Manifest	`{repo_root}/.test/skills/{skill-name}/manifest.yaml`
Routing tests	`{repo_root}/.test/skills/_routing/ground_truth.yaml`
Baselines	`{repo_root}/.test/baselines/{skill-name}/baseline.yaml`

For example, to test

spark-declarative-pipelines

in this repository:

/Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml

Not relative to the skill definition:

/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # WRONG

重要提示： 所有测试文件存储在仓库根目录级别，而非当前技能的目录下。

文件类型	路径
基准数据集	`{repo_root}/.test/skills/{skill-name}/ground_truth.yaml`
候选用例	`{repo_root}/.test/skills/{skill-name}/candidates.yaml`
清单	`{repo_root}/.test/skills/{skill-name}/manifest.yaml`
路由测试	`{repo_root}/.test/skills/_routing/ground_truth.yaml`
基线版本	`{repo_root}/.test/baselines/{skill-name}/baseline.yaml`

例如，在本仓库中测试

spark-declarative-pipelines

的路径为：

/Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml

请勿使用技能定义的相对路径：

/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # 错误路径

Directory Structure

目录结构

.test/                          # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml              # Package config (pip install -e ".test/")
├── README.md                   # Contributor documentation
├── SKILL.md                    # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh       # Sync script
├── scripts/                    # Wrapper scripts
│   ├── _common.py              # Shared utilities
│   ├── run_eval.py
│   ├── regression.py
│   ├── init_skill.py
│   ├── add.py
│   ├── baseline.py
│   ├── mlflow_eval.py
│   ├── routing_eval.py
│   ├── trace_eval.py           # Trace evaluation
│   ├── list_traces.py          # List available traces
│   ├── scorers.py
│   ├── scorers_update.py
│   └── sync.py
├── src/
│   └── skill_test/             # Python package
│       ├── cli/                # CLI commands module
│       ├── fixtures/           # Test fixture setup
│       ├── scorers/            # Evaluation scorers
│       ├── grp/                # Generate-Review-Promote pipeline
│       └── runners/            # Evaluation runners
├── skills/                     # Per-skill test definitions
│   ├── _routing/               # Routing test cases
│   └── {skill-name}/           # Skill-specific tests
│       ├── ground_truth.yaml
│       ├── candidates.yaml
│       └── manifest.yaml
├── tests/                      # Unit tests
├── references/                 # Documentation references
└── baselines/                  # Regression baselines

.test/                          # 位于仓库根目录（而非技能目录）
├── pyproject.toml              # 包配置（pip install -e ".test/")
├── README.md                   # 贡献者文档
├── SKILL.md                    # 权威来源（同步至.claude/skills/）
├── install_skill_test.sh       # 同步脚本
├── scripts/                    # 包装脚本
│   ├── _common.py              # 共享工具
│   ├── run_eval.py
│   ├── regression.py
│   ├── init_skill.py
│   ├── add.py
│   ├── baseline.py
│   ├── mlflow_eval.py
│   ├── routing_eval.py
│   ├── trace_eval.py           # 追踪评估
│   ├── list_traces.py          # 列出可用追踪数据
│   ├── scorers.py
│   ├── scorers_update.py
│   └── sync.py
├── src/
│   └── skill_test/             # Python包
│       ├── cli/                # CLI命令模块
│       ├── fixtures/           # 测试夹具设置
│       ├── scorers/            # 评估评分器
│       ├── grp/                # Generate-Review-Promote pipeline
│       └── runners/            # 评估运行器
├── skills/                     # 各技能的测试定义
│   ├── _routing/               # 路由测试用例
│   └── {skill-name}/           # 技能专属测试
│       ├── ground_truth.yaml
│       ├── candidates.yaml
│       └── manifest.yaml
├── tests/                      # 单元测试
├── references/                 # 文档参考
└── baselines/                  # 回归基线

References

参考文档

Scorers - Available scorers and quality gates
YAML Schemas - Manifest and ground truth formats
Python API - Programmatic usage examples
Workflows - Detailed example workflows
Trace Evaluation - Session trace analysis

评分器 - 可用的评分器和质量门槛
YAML 模式 - 清单和基准数据集格式
Python API - 程序化使用示例
工作流 - 详细的示例工作流
追踪评估 - 会话追踪分析