promptfoo-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePromptfoo Evaluation
Promptfoo 评估
Overview
概述
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
本技能提供使用Promptfoo配置和运行LLM评估的指导,Promptfoo是一个用于测试和比较LLM输出的开源CLI工具。
Quick Start
快速开始
bash
undefinedbash
undefinedInitialize a new evaluation project
Initialize a new evaluation project
npx promptfoo@latest init
npx promptfoo@latest init
Run evaluation
Run evaluation
npx promptfoo@latest eval
npx promptfoo@latest eval
View results in browser
View results in browser
npx promptfoo@latest view
undefinednpx promptfoo@latest view
undefinedConfiguration Structure
配置结构
A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertions典型的Promptfoo项目结构:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertionsCore Configuration (promptfooconfig.yaml)
核心配置(promptfooconfig.yaml)
yaml
undefinedyaml
undefinedyaml-language-server: $schema=https://promptfoo.dev/config-schema.json
yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"
description: "My LLM Evaluation"
Prompts to test
Prompts to test
prompts:
- file://prompts/system.md
- file://prompts/chat.json
prompts:
- file://prompts/system.md
- file://prompts/chat.json
Models to compare
Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929 label: Claude-4.5-Sonnet
- id: openai:gpt-4.1 label: GPT-4.1
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929 label: Claude-4.5-Sonnet
- id: openai:gpt-4.1 label: GPT-4.1
Test cases
Test cases
tests: file://tests/cases.yaml
tests: file://tests/cases.yaml
Default assertions for all tests
Default assertions for all tests
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
Output path
Output path
outputPath: results/eval-results.json
undefinedoutputPath: results/eval-results.json
undefinedPrompt Formats
提示词格式
Text Prompt (system.md)
文本提示词(system.md)
markdown
You are a helpful assistant.
Task: {{task}}
Context: {{context}}markdown
You are a helpful assistant.
Task: {{task}}
Context: {{context}}Chat Format (chat.json)
对话格式(chat.json)
json
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]json
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]Few-Shot Pattern
少样本模式
Embed examples directly in prompt or use chat format with assistant messages:
json
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]直接在提示词中嵌入示例,或使用包含助手消息的对话格式:
json
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]Test Cases (tests/cases.yaml)
测试用例(tests/cases.yaml)
yaml
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8yaml
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8Python Custom Assertions
Python自定义断言
Create a Python file for custom assertions (e.g., ):
scripts/metrics.pypython
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}Key points:
- Default function name is
get_assert - Specify function with
file://path.py:function_name - Return ,
bool(score), orfloatwith pass/score/reasondict - Access variables via
context['vars']
创建一个Python文件用于自定义断言(例如):
scripts/metrics.pypython
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}关键点:
- 默认函数名为
get_assert - 使用指定函数
file://path.py:function_name - 返回、
bool(分数)或包含pass/score/reason的floatdict - 通过访问变量
context['vars']
LLM-as-Judge (llm-rubric)
LLM-as-Judge(llm-rubric)
yaml
assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader modelBest practices:
- Provide clear scoring criteria
- Use to set minimum passing score
threshold - Default grader uses available API keys (OpenAI → Anthropic → Google)
yaml
assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader model最佳实践:
- 提供明确的评分标准
- 使用设置最低及格分数
threshold - 默认评分器使用可用的API密钥(优先级:OpenAI → Anthropic → Google)
Common Assertion Types
常见断言类型
| Type | Usage | Example |
|---|---|---|
| Check substring | |
| Case-insensitive | |
| Exact match | |
| Pattern match | |
| Custom logic | |
| LLM grading | |
| Response time | |
| 类型 | 用途 | 示例 |
|---|---|---|
| 检查子字符串 | |
| 不区分大小写检查 | |
| 完全匹配 | |
| 模式匹配 | |
| 自定义逻辑 | |
| LLM评分 | |
| 响应时间 | |
File References
文件引用
All paths are relative to config file location:
yaml
undefined所有路径均相对于配置文件位置:
yaml
undefinedLoad file content as variable
Load file content as variable
vars:
content: file://data/input.txt
vars:
content: file://data/input.txt
Load prompt from file
Load prompt from file
prompts:
- file://prompts/main.md
prompts:
- file://prompts/main.md
Load test cases from file
Load test cases from file
tests: file://tests/cases.yaml
tests: file://tests/cases.yaml
Load Python assertion
Load Python assertion
assert:
- type: python value: file://scripts/check.py:validate
undefinedassert:
- type: python value: file://scripts/check.py:validate
undefinedRunning Evaluations
运行评估
bash
undefinedbash
undefinedBasic run
Basic run
npx promptfoo@latest eval
npx promptfoo@latest eval
With specific config
With specific config
npx promptfoo@latest eval --config path/to/config.yaml
npx promptfoo@latest eval --config path/to/config.yaml
Output to file
Output to file
npx promptfoo@latest eval --output results.json
npx promptfoo@latest eval --output results.json
Filter tests
Filter tests
npx promptfoo@latest eval --filter-metadata category=math
npx promptfoo@latest eval --filter-metadata category=math
View results
View results
npx promptfoo@latest view
undefinednpx promptfoo@latest view
undefinedTroubleshooting
故障排除
Python not found:
bash
export PROMPTFOO_PYTHON=python3Large outputs truncated:
Outputs over 30000 characters are truncated. Use in assertions.
head_limitFile not found errors:
Ensure paths are relative to location.
promptfooconfig.yaml未找到Python:
bash
export PROMPTFOO_PYTHON=python3大输出被截断:
超过30000字符的输出会被截断。在断言中使用参数。
head_limit文件未找到错误:
确保路径相对于的位置。
promptfooconfig.yamlEcho Provider (Preview Mode)
Echo Provider(预览模式)
Use the echo provider to preview rendered prompts without making API calls:
yaml
undefined使用echo provider预览渲染后的提示词,无需调用API:
yaml
undefinedpromptfooconfig-preview.yaml
promptfooconfig-preview.yaml
providers:
- echo # Returns prompt as output, no API calls
tests:
- vars: input: "test content"
**Use cases:**
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure
```bashproviders:
- echo # Returns prompt as output, no API calls
tests:
- vars: input: "test content"
**适用场景:**
- 在调用昂贵的API之前预览提示词渲染效果
- 验证少样本示例是否正确加载
- 调试变量替换问题
- 验证提示词结构
```bashRun preview mode
Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
**Cost:** Free - no API tokens consumed.npx promptfoo@latest eval --config promptfooconfig-preview.yaml
**成本:** 免费 - 不消耗API令牌。Advanced Few-Shot Implementation
高级少样本实现
Multi-turn Conversation Pattern
多轮对话模式
For complex few-shot learning with full examples:
json
[
{"role": "system", "content": "{{system_prompt}}"},
// Few-shot Example 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// Few-shot Example 2 (optional)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// Actual test
{"role": "user", "content": "Task: {{actual_input}}"}
]Test case configuration:
yaml
tests:
- vars:
system_prompt: file://prompts/system.md
# Few-shot examples
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# Actual test
actual_input: file://data/test1.txtBest practices:
- Use 1-3 few-shot examples (more may dilute effectiveness)
- Ensure examples match the task format exactly
- Load examples from files for better maintainability
- Use echo provider first to verify structure
适用于包含完整示例的复杂少样本学习场景:
json
[
{"role": "system", "content": "{{system_prompt}}"},
// Few-shot Example 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// Few-shot Example 2 (optional)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// Actual test
{"role": "user", "content": "Task: {{actual_input}}"}
]测试用例配置:
yaml
tests:
- vars:
system_prompt: file://prompts/system.md
# Few-shot examples
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# Actual test
actual_input: file://data/test1.txt最佳实践:
- 使用1-3个少样本示例(过多可能降低效果)
- 确保示例与任务格式完全匹配
- 从文件加载示例以提升可维护性
- 先使用echo provider验证结构
Long Text Handling
长文本处理
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
yaml
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
config:
max_tokens: 8192 # Increase for long outputs
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:check_lengthPython assertion for text metrics:
python
import re
def strip_tags(text: str) -> str:
"""Remove HTML tags for pure text."""
return re.sub(r'<[^>]+>', '', text)
def check_length(output: str, context: dict) -> dict:
"""Check output length constraints."""
raw_input = context['vars'].get('raw_input', '')
input_len = len(strip_tags(raw_input))
output_len = len(strip_tags(output))
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
return {
"pass": 0.7 <= reduction_ratio <= 0.9,
"score": reduction_ratio,
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
"named_scores": {
"input_length": input_len,
"output_length": output_len,
"reduction_ratio": reduction_ratio
}
}针对中文/长文本内容评估(10000+字符):
配置:
yaml
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
config:
max_tokens: 8192 # Increase for long outputs
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:check_length用于文本指标的Python断言:
python
import re
def strip_tags(text: str) -> str:
"""Remove HTML tags for pure text."""
return re.sub(r'<[^>]+>', '', text)
def check_length(output: str, context: dict) -> dict:
"""Check output length constraints."""
raw_input = context['vars'].get('raw_input', '')
input_len = len(strip_tags(raw_input))
output_len = len(strip_tags(output))
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
return {
"pass": 0.7 <= reduction_ratio <= 0.9,
"score": reduction_ratio,
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
"named_scores": {
"input_length": input_len,
"output_length": output_len,
"reduction_ratio": reduction_ratio
}
}Real-World Example
实际示例
Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/
├── promptfooconfig.yaml # Production config
├── promptfooconfig-preview.yaml # Preview config (echo provider)
├── prompts/
│ ├── tiaogaoren-prompt.json # Chat format with few-shot
│ └── v4/system-v4.md # System prompt
├── tests/cases.yaml # 3 test samples
├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)
├── data/ # 5 samples (2 few-shot, 3 eval)
└── results/See: for full implementation.
/Users/tiansheng/Workspace/prompts/tiaogaoren/项目: 从长转录文本中筛选中文短视频内容
结构:
tiaogaoren/
├── promptfooconfig.yaml # Production config
├── promptfooconfig-preview.yaml # Preview config (echo provider)
├── prompts/
│ ├── tiaogaoren-prompt.json # Chat format with few-shot
│ └── v4/system-v4.md # System prompt
├── tests/cases.yaml # 3 test samples
├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)
├── data/ # 5 samples (2 few-shot, 3 eval)
└── results/查看: 获取完整实现。
/Users/tiansheng/Workspace/prompts/tiaogaoren/Resources
资源
For detailed API reference and advanced patterns, see references/promptfoo_api.md.
如需详细API参考和高级模式,请查看references/promptfoo_api.md。