skill-test

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Databricks Skills Testing Framework

Databricks 技能测试框架

Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.
采用YAML优先的离线评估模式,支持人工介入评审和交互式技能优化。

Quick References

快速参考

  • Scorers - Available scorers and quality gates
  • YAML Schemas - Manifest and ground truth formats
  • Python API - Programmatic usage examples
  • Workflows - Detailed example workflows
  • Trace Evaluation - Session trace analysis
  • 评分器 - 可用的评分器和质量门槛
  • YAML 模式 - 清单和基准数据集格式
  • Python API - 程序化使用示例
  • 工作流 - 详细的示例工作流
  • 追踪评估 - 会话追踪分析

/skill-test Command

/skill-test 命令

The
/skill-test
command provides an interactive CLI for testing Databricks skills with real execution on Databricks.
/skill-test
命令提供交互式CLI,可在Databricks上实际执行以测试Databricks技能。

Basic Usage

基本用法

/skill-test <skill-name> [subcommand]
/skill-test <skill-name> [subcommand]

Subcommands

子命令

SubcommandDescription
run
Run evaluation against ground truth (default)
regression
Compare current results against baseline
init
Initialize test scaffolding for a new skill
add
Interactive: prompt -> invoke skill -> test -> save
add --trace
Add test case with trace evaluation
review
Review pending candidates interactively
review --batch
Batch approve all pending candidates
baseline
Save current results as regression baseline
mlflow
Run full MLflow evaluation with LLM judges
trace-eval
Evaluate traces against skill expectations
list-traces
List available traces (MLflow or local)
scorers
List configured scorers for a skill
scorers update
Add/remove scorers or update default guidelines
sync
Sync YAML to Unity Catalog (Phase 2)
子命令描述
run
对照基准数据集运行评估(默认子命令)
regression
对比当前结果与基线版本
init
为新技能初始化测试脚手架
add
交互式流程:提示 -> 调用技能 -> 测试 -> 保存
add --trace
添加带追踪评估的测试用例
review
交互式评审待处理候选用例
review --batch
批量批准所有待处理候选用例
baseline
将当前结果保存为回归基线
mlflow
结合LLM评审器运行完整的MLflow评估
trace-eval
对照技能预期评估追踪数据
list-traces
列出可用的追踪数据(MLflow或本地)
scorers
列出技能的已配置评分器
scorers update
添加/移除评分器或更新默认准则
sync
将YAML同步至Unity Catalog(第二阶段)

Quick Examples

快速示例

/skill-test spark-declarative-pipelines run
/skill-test spark-declarative-pipelines add --trace
/skill-test spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init
See Workflows for detailed examples of each subcommand.
/skill-test spark-declarative-pipelines run
/skill-test spark-declarative-pipelines add --trace
/skill-test spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init
查看工作流获取各子命令的详细示例。

Execution Instructions

执行说明

Environment Setup

环境设置

bash
uv pip install -e .test/
Environment variables for Databricks MLflow:
  • DATABRICKS_CONFIG_PROFILE
    - Databricks CLI profile (default: "DEFAULT")
  • MLFLOW_TRACKING_URI
    - Set to "databricks" for Databricks MLflow
  • MLFLOW_EXPERIMENT_NAME
    - Experiment path (e.g., "/Users/{user}/skill-test")
bash
uv pip install -e .test/
用于Databricks MLflow的环境变量:
  • DATABRICKS_CONFIG_PROFILE
    - Databricks CLI配置文件(默认值:"DEFAULT")
  • MLFLOW_TRACKING_URI
    - 针对Databricks MLflow需设置为"databricks"
  • MLFLOW_EXPERIMENT_NAME
    - 实验路径(例如:"/Users/{user}/skill-test")

Running Scripts

运行脚本

All subcommands have corresponding scripts in
.test/scripts/
:
bash
uv run python .test/scripts/{subcommand}.py {skill_name} [options]
SubcommandScript
run
run_eval.py
regression
regression.py
init
init_skill.py
add
add.py
review
review.py
baseline
baseline.py
mlflow
mlflow_eval.py
scorers
scorers.py
scorers update
scorers_update.py
sync
sync.py
trace-eval
trace_eval.py
list-traces
list_traces.py
_routing mlflow
routing_eval.py
Use
--help
on any script for available options.
所有子命令在
.test/scripts/
目录下都有对应的脚本:
bash
uv run python .test/scripts/{subcommand}.py {skill_name} [options]
子命令脚本
run
run_eval.py
regression
regression.py
init
init_skill.py
add
add.py
review
review.py
baseline
baseline.py
mlflow
mlflow_eval.py
scorers
scorers.py
scorers update
scorers_update.py
sync
sync.py
trace-eval
trace_eval.py
list-traces
list_traces.py
_routing mlflow
routing_eval.py
在任意脚本后添加
--help
查看可用选项。

Command Handler

命令处理器

When
/skill-test
is invoked, parse arguments and execute the appropriate command.
调用
/skill-test
时,会解析参数并执行相应命令。

Argument Parsing

参数解析

  • args[0]
    = skill_name (required)
  • args[1]
    = subcommand (optional, default: "run")
  • args[0]
    = skill_name(必填)
  • args[1]
    = subcommand(可选,默认值:"run")

Subcommand Routing

子命令路由

SubcommandAction
run
Execute
run(skill_name, ctx)
and display results
regression
Execute
regression(skill_name, ctx)
and display comparison
init
Execute
init(skill_name, ctx)
to create scaffolding
add
Prompt for test input, invoke skill, run
interactive()
review
Execute
review(skill_name, ctx)
to review pending candidates
baseline
Execute
baseline(skill_name, ctx)
to save as regression baseline
mlflow
Execute
mlflow_eval(skill_name, ctx)
with MLflow logging
scorers
Execute
scorers(skill_name, ctx)
to list configured scorers
scorers update
Execute
scorers_update(skill_name, ctx, ...)
to modify scorers
子命令操作
run
执行
run(skill_name, ctx)
并显示结果
regression
执行
regression(skill_name, ctx)
并显示对比结果
init
执行
init(skill_name, ctx)
创建测试脚手架
add
提示输入测试内容、调用技能、执行
interactive()
review
执行
review(skill_name, ctx)
评审待处理候选用例
baseline
执行
baseline(skill_name, ctx)
保存为回归基线
mlflow
结合MLflow日志执行
mlflow_eval(skill_name, ctx)
scorers
执行
scorers(skill_name, ctx)
列出已配置评分器
scorers update
执行
scorers_update(skill_name, ctx, ...)
修改评分器配置

init Behavior

init 行为

When running
/skill-test <skill-name> init
:
  1. Read the skill's SKILL.md to understand its purpose
  2. Create
    manifest.yaml
    with appropriate scorers and trace_expectations
  3. Create empty
    ground_truth.yaml
    and
    candidates.yaml
    templates
  4. Recommend test prompts based on documentation examples
Follow with
/skill-test <skill-name> add
using recommended prompts.
运行
/skill-test <skill-name> init
时:
  1. 读取技能的SKILL.md文件以了解其用途
  2. 创建包含合适评分器和追踪预期的
    manifest.yaml
  3. 创建空的
    ground_truth.yaml
    candidates.yaml
    模板
  4. 根据文档示例推荐测试提示词
之后可使用推荐的提示词运行
/skill-test <skill-name> add

Context Setup

上下文设置

Create CLIContext with MCP tools before calling any command. See Python API for details.
调用任何命令前,需使用MCP工具创建CLIContext。详情请查看Python API

File Locations

文件位置

Important: All test files are stored at the repository root level, not relative to this skill's directory.
File TypePath
Ground truth
{repo_root}/.test/skills/{skill-name}/ground_truth.yaml
Candidates
{repo_root}/.test/skills/{skill-name}/candidates.yaml
Manifest
{repo_root}/.test/skills/{skill-name}/manifest.yaml
Routing tests
{repo_root}/.test/skills/_routing/ground_truth.yaml
Baselines
{repo_root}/.test/baselines/{skill-name}/baseline.yaml
For example, to test
spark-declarative-pipelines
in this repository:
/Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml
Not relative to the skill definition:
/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # WRONG
重要提示: 所有测试文件存储在仓库根目录级别,而非当前技能的目录下。
文件类型路径
基准数据集
{repo_root}/.test/skills/{skill-name}/ground_truth.yaml
候选用例
{repo_root}/.test/skills/{skill-name}/candidates.yaml
清单
{repo_root}/.test/skills/{skill-name}/manifest.yaml
路由测试
{repo_root}/.test/skills/_routing/ground_truth.yaml
基线版本
{repo_root}/.test/baselines/{skill-name}/baseline.yaml
例如,在本仓库中测试
spark-declarative-pipelines
的路径为:
/Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml
请勿使用技能定义的相对路径:
/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # 错误路径

Directory Structure

目录结构

.test/                          # At REPOSITORY ROOT (not skill directory)
├── pyproject.toml              # Package config (pip install -e ".test/")
├── README.md                   # Contributor documentation
├── SKILL.md                    # Source of truth (synced to .claude/skills/)
├── install_skill_test.sh       # Sync script
├── scripts/                    # Wrapper scripts
│   ├── _common.py              # Shared utilities
│   ├── run_eval.py
│   ├── regression.py
│   ├── init_skill.py
│   ├── add.py
│   ├── baseline.py
│   ├── mlflow_eval.py
│   ├── routing_eval.py
│   ├── trace_eval.py           # Trace evaluation
│   ├── list_traces.py          # List available traces
│   ├── scorers.py
│   ├── scorers_update.py
│   └── sync.py
├── src/
│   └── skill_test/             # Python package
│       ├── cli/                # CLI commands module
│       ├── fixtures/           # Test fixture setup
│       ├── scorers/            # Evaluation scorers
│       ├── grp/                # Generate-Review-Promote pipeline
│       └── runners/            # Evaluation runners
├── skills/                     # Per-skill test definitions
│   ├── _routing/               # Routing test cases
│   └── {skill-name}/           # Skill-specific tests
│       ├── ground_truth.yaml
│       ├── candidates.yaml
│       └── manifest.yaml
├── tests/                      # Unit tests
├── references/                 # Documentation references
└── baselines/                  # Regression baselines
.test/                          # 位于仓库根目录(而非技能目录)
├── pyproject.toml              # 包配置(pip install -e ".test/")
├── README.md                   # 贡献者文档
├── SKILL.md                    # 权威来源(同步至.claude/skills/)
├── install_skill_test.sh       # 同步脚本
├── scripts/                    # 包装脚本
│   ├── _common.py              # 共享工具
│   ├── run_eval.py
│   ├── regression.py
│   ├── init_skill.py
│   ├── add.py
│   ├── baseline.py
│   ├── mlflow_eval.py
│   ├── routing_eval.py
│   ├── trace_eval.py           # 追踪评估
│   ├── list_traces.py          # 列出可用追踪数据
│   ├── scorers.py
│   ├── scorers_update.py
│   └── sync.py
├── src/
│   └── skill_test/             # Python包
│       ├── cli/                # CLI命令模块
│       ├── fixtures/           # 测试夹具设置
│       ├── scorers/            # 评估评分器
│       ├── grp/                # Generate-Review-Promote pipeline
│       └── runners/            # 评估运行器
├── skills/                     # 各技能的测试定义
│   ├── _routing/               # 路由测试用例
│   └── {skill-name}/           # 技能专属测试
│       ├── ground_truth.yaml
│       ├── candidates.yaml
│       └── manifest.yaml
├── tests/                      # 单元测试
├── references/                 # 文档参考
└── baselines/                  # 回归基线

References

参考文档

  • Scorers - Available scorers and quality gates
  • YAML Schemas - Manifest and ground truth formats
  • Python API - Programmatic usage examples
  • Workflows - Detailed example workflows
  • Trace Evaluation - Session trace analysis
  • 评分器 - 可用的评分器和质量门槛
  • YAML 模式 - 清单和基准数据集格式
  • Python API - 程序化使用示例
  • 工作流 - 详细的示例工作流
  • 追踪评估 - 会话追踪分析