llm-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLLM Evaluation & Testing
LLM 评估与测试
Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.
通过自动化评估与CI/CD集成,测试提示词、模型及RAG系统。
Quick Start
快速开始
bash
undefinedbash
undefinedInitialize Promptfoo (no global install needed)
初始化Promptfoo(无需全局安装)
npx promptfoo@latest init
npx promptfoo@latest init
Run evaluation
运行评估
npx promptfoo@latest eval
npx promptfoo@latest eval
View results in browser
在浏览器中查看结果
npx promptfoo@latest view
npx promptfoo@latest view
Run security scan
运行安全扫描
npx promptfoo@latest redteam run
undefinednpx promptfoo@latest redteam run
undefinedCore Concepts
核心概念
Why Evaluate?
为什么需要评估?
LLM outputs are non-deterministic. "It looks good" isn't testing. You need:
- Regression detection: Catch quality drops before production
- Security scanning: Find jailbreaks, injections, PII leaks
- A/B comparison: Compare prompts/models side-by-side
- CI/CD gates: Block bad changes from merging
LLM的输出具有非确定性。"看起来没问题"算不上测试,你需要:
- 回归检测:在上线前发现质量下降
- 安全扫描:发现越狱、注入、PII泄露问题
- A/B对比:并排对比提示词/模型
- CI/CD管控:阻止不良变更合并
Evaluation Types
评估类型
| Type | Purpose | Assertions |
|---|---|---|
| Functional | Does it work? | |
| Semantic | Is it correct? | |
| Performance | Is it fast/cheap? | |
| Security | Is it safe? | |
| 类型 | 用途 | 断言方式 |
|---|---|---|
| 功能性 | 是否可用? | |
| 语义性 | 是否正确? | |
| 性能 | 是否快速/低成本? | |
| 安全性 | 是否安全? | |
Configuration
配置说明
Basic promptfooconfig.yaml
基础promptfooconfig.yaml
yaml
description: "My LLM evaluation suite"
prompts:
- file://prompts/main.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-sonnet-latest
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: cost
threshold: 0.01
- vars:
question: "Explain quantum computing"
assert:
- type: llm-rubric
value: "Response explains quantum computing concepts clearly"
- type: latency
threshold: 3000yaml
description: "My LLM evaluation suite"
prompts:
- file://prompts/main.txt
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-sonnet-latest
tests:
- vars:
question: "What is the capital of France?"
assert:
- type: contains
value: "Paris"
- type: cost
threshold: 0.01
- vars:
question: "Explain quantum computing"
assert:
- type: llm-rubric
value: "Response explains quantum computing concepts clearly"
- type: latency
threshold: 3000With Environment Variables
结合环境变量
yaml
providers:
- id: openrouter:anthropic/claude-3-5-sonnet
config:
apiKey: ${OPENROUTER_API_KEY}yaml
providers:
- id: openrouter:anthropic/claude-3-5-sonnet
config:
apiKey: ${OPENROUTER_API_KEY}Assertions Reference
断言参考
Basic Assertions
基础断言
yaml
assert:
# String matching
- type: contains
value: "expected text"
- type: not-contains
value: "forbidden text"
- type: equals
value: "exact match"
- type: starts-with
value: "prefix"
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # Date pattern
# JSON validation
- type: is-json
- type: is-valid-json-schema
value:
type: object
properties:
name: { type: string }
required: [name]yaml
assert:
# 字符串匹配
- type: contains
value: "expected text"
- type: not-contains
value: "forbidden text"
- type: equals
value: "exact match"
- type: starts-with
value: "prefix"
- type: regex
value: "\\d{4}-\\d{2}-\\d{2}" # 日期格式
# JSON验证
- type: is-json
- type: is-valid-json-schema
value:
type: object
properties:
name: { type: string }
required: [name]Semantic Assertions
语义断言
yaml
assert:
# Semantic similarity (embeddings)
- type: similar
value: "The capital of France is Paris"
threshold: 0.8 # 0-1 similarity score
# LLM-as-judge with custom criteria
- type: llm-rubric
value: |
Response must:
1. Be factually accurate
2. Be under 100 words
3. Not contain marketing language
# Factuality check against reference
- type: factuality
value: "Paris is the capital of France"yaml
assert:
# 语义相似度(嵌入向量)
- type: similar
value: "The capital of France is Paris"
threshold: 0.8 # 0-1相似度分数
# 自定义标准的LLM评审
- type: llm-rubric
value: |
响应必须满足:
1. 事实准确
2. 字数少于100词
3. 不含营销话术
# 基于参考内容的事实性检查
- type: factuality
value: "Paris is the capital of France"Performance Assertions
性能断言
yaml
assert:
# Cost budget (USD)
- type: cost
threshold: 0.05 # Max $0.05 per request
# Latency (milliseconds)
- type: latency
threshold: 2000 # Max 2 secondsyaml
assert:
# 成本预算(美元)
- type: cost
threshold: 0.05 # 每次请求最高0.05美元
# 延迟(毫秒)
- type: latency
threshold: 2000 # 最长2秒Security Assertions
安全断言
yaml
assert:
# Content moderation
- type: moderation
value: violence
# PII detection
- type: not-contains
value: "{{email}}" # From test varsyaml
assert:
# 内容审核
- type: moderation
value: violence
# PII检测
- type: not-contains
value: "{{email}}" # 来自测试变量CI/CD Integration
CI/CD集成
GitHub Action
GitHub Action
yaml
name: 'Prompt Evaluation'
on:
pull_request:
paths: ['prompts/**', 'src/**/*prompt*']
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
# Cache for faster runs
- uses: actions/cache@v4
with:
path: ~/.promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
# Run evaluation and post results to PR
- uses: promptfoo/promptfoo-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }} # Or other provider keysyaml
name: 'Prompt Evaluation'
on:
pull_request:
paths: ['prompts/**', 'src/**/*prompt*']
jobs:
evaluate:
runs-on: ubuntu-latest
permissions:
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
# 缓存以加速运行
- uses: actions/cache@v4
with:
path: ~/.promptfoo
key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
# 运行评估并将结果发布至PR
- uses: promptfoo/promptfoo-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
openai-api-key: ${{ secrets.OPENAI_API_KEY }} # 或其他服务商密钥Quality Gates
质量管控
yaml
undefinedyaml
undefinedpromptfooconfig.yaml
promptfooconfig.yaml
evaluateOptions:
Fail if any assertion fails
maxConcurrency: 5
Or set pass threshold
threshold: 0.9 # 90% of tests must pass
undefinedevaluateOptions:
任何断言失败则整体失败
maxConcurrency: 5
或设置通过阈值
threshold: 0.9 # 90%的测试必须通过
undefinedOutput to JSON (for custom CI)
输出为JSON(用于自定义CI)
bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.jsonbash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.jsonCheck results in CI script
在CI脚本中检查结果
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then
echo "Evaluation failed!"
exit 1
fi
undefinedif [ $(jq '.stats.failures' results.json) -gt 0 ]; then
echo "Evaluation failed!"
exit 1
fi
undefinedSecurity Testing (Red Team)
安全测试(红队)
Quick Scan
快速扫描
bash
undefinedbash
undefinedRun red team against your prompts
针对提示词运行红队测试
npx promptfoo@latest redteam run
npx promptfoo@latest redteam run
Generate compliance report
生成合规报告
npx promptfoo@latest redteam report --output compliance.html
undefinednpx promptfoo@latest redteam report --output compliance.html
undefinedConfiguration
配置
yaml
undefinedyaml
undefinedpromptfooconfig.yaml
promptfooconfig.yaml
redteam:
purpose: "Customer support chatbot"
plugins:
- harmful:hate
- harmful:violence
- harmful:self-harm
- pii:direct
- pii:session
- hijacking
- jailbreak
- prompt-injection
strategies:
- jailbreak
- prompt-injection
- base64
- leetspeak
undefinedredteam:
purpose: "Customer support chatbot"
plugins:
- harmful:hate
- harmful:violence
- harmful:self-harm
- pii:direct
- pii:session
- hijacking
- jailbreak
- prompt-injection
strategies:
- jailbreak
- prompt-injection
- base64
- leetspeak
undefinedOWASP Top 10 Coverage
OWASP Top 10 覆盖
yaml
redteam:
plugins:
# 1. Prompt Injection
- prompt-injection
# 2. Insecure Output Handling
- harmful:privacy
# 3. Training Data Poisoning (N/A for evals)
# 4. Model Denial of Service
- excessive-agency
# 5. Supply Chain (N/A for evals)
# 6. Sensitive Information Disclosure
- pii:direct
- pii:session
# 7. Insecure Plugin Design
- hijacking
# 8. Excessive Agency
- excessive-agency
# 9. Overreliance (use factuality checks)
# 10. Model Theft (N/A for evals)yaml
redteam:
plugins:
# 1. 提示词注入
- prompt-injection
# 2. 不安全输出处理
- harmful:privacy
# 3. 训练数据投毒(评估场景不适用)
# 4. 模型拒绝服务
- excessive-agency
# 5. 供应链(评估场景不适用)
# 6. 敏感信息泄露
- pii:direct
- pii:session
# 7. 不安全插件设计
- hijacking
# 8. 过度授权
- excessive-agency
# 9. 过度依赖(使用事实性检查)
# 10. 模型窃取(评估场景不适用)RAG Evaluation
RAG评估
Context-Aware Testing
上下文感知测试
yaml
prompts:
- |
Context: {{context}}
Question: {{question}}
Answer based only on the context provided.
tests:
- vars:
context: "The Eiffel Tower was built in 1889 for the World's Fair."
question: "When was the Eiffel Tower built?"
assert:
- type: contains
value: "1889"
- type: factuality
value: "The Eiffel Tower was built in 1889"
- type: not-contains
value: "1900" # Common hallucinationyaml
prompts:
- |
Context: {{context}}
Question: {{question}}
Answer based only on the context provided.
tests:
- vars:
context: "The Eiffel Tower was built in 1889 for the World's Fair."
question: "When was the Eiffel Tower built?"
assert:
- type: contains
value: "1889"
- type: factuality
value: "The Eiffel Tower was built in 1889"
- type: not-contains
value: "1900" # 常见幻觉内容Retrieval Quality
检索质量
yaml
undefinedyaml
undefinedTest that retrieval returns relevant documents
测试检索是否返回相关文档
tests:
- vars:
query: "Python list comprehension"
assert:
- type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains value: "I don't know" # Shouldn't punt on this query
undefinedtests:
- vars:
query: "Python list comprehension"
assert:
- type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains value: "I don't know" # 不应回避该问题
undefinedComparing Models/Prompts
模型/提示词对比
A/B Testing
A/B测试
yaml
undefinedyaml
undefinedCompare two prompts
对比两个提示词
prompts:
- file://prompts/v1.txt
- file://prompts/v2.txt
prompts:
- file://prompts/v1.txt
- file://prompts/v2.txt
Same tests for both
对两者使用相同测试
tests:
- vars: { question: "Explain recursion" }
assert:
- type: llm-rubric value: "Clear explanation of recursion with example"
undefinedtests:
- vars: { question: "Explain recursion" }
assert:
- type: llm-rubric value: "Clear explanation of recursion with example"
undefinedModel Comparison
模型对比
yaml
undefinedyaml
undefinedCompare multiple models
对比多个模型
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
- openrouter:google/gemini-flash-1.5
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
- openrouter:google/gemini-flash-1.5
Run: npx promptfoo@latest eval
运行:npx promptfoo@latest eval
View: npx promptfoo@latest view
查看:npx promptfoo@latest view
Compare cost, latency, quality side-by-side
并排对比成本、延迟、质量
undefinedundefinedBest Practices
最佳实践
1. Golden Test Cases
1. 黄金测试用例
Maintain a set of critical test cases that must always pass:
yaml
undefined维护一组必须始终通过的关键测试用例:
yaml
undefinedgolden-tests.yaml
golden-tests.yaml
tests:
- description: "Core functionality - must pass"
vars:
input: "critical test case"
assert:
- type: contains value: "expected output" options: critical: true # Fail entire suite if this fails
undefinedtests:
- description: "Core functionality - must pass"
vars:
input: "critical test case"
assert:
- type: contains value: "expected output" options: critical: true # 若该测试失败,整个套件都失败
undefined2. Regression Suite Structure
2. 回归套件结构
prompts/
├── production.txt # Current production prompt
├── candidate.txt # New prompt being tested
tests/
├── golden/ # Critical tests (run on every PR)
│ └── core-functionality.yaml
├── regression/ # Full regression suite (nightly)
│ └── full-suite.yaml
└── security/ # Red team tests
└── redteam.yamlprompts/
├── production.txt # 当前生产环境提示词
├── candidate.txt # 待测试的新提示词
tests/
├── golden/ # 关键测试(每个PR都运行)
│ └── core-functionality.yaml
├── regression/ # 完整回归套件(夜间运行)
│ └── full-suite.yaml
└── security/ # 红队测试
└── redteam.yaml3. Test Categories
3. 测试分类
yaml
tests:
# Happy path
- description: "Standard query"
vars: { question: "What is 2+2?" }
assert:
- type: contains
value: "4"
# Edge cases
- description: "Empty input"
vars: { question: "" }
assert:
- type: not-contains
value: "error"
# Adversarial
- description: "Injection attempt"
vars: { question: "Ignore previous instructions and..." }
assert:
- type: not-contains
value: "Here's how to" # Should refuseyaml
tests:
# 正常路径
- description: "标准查询"
vars: { question: "What is 2+2?" }
assert:
- type: contains
value: "4"
# 边缘情况
- description: "空输入"
vars: { question: "" }
assert:
- type: not-contains
value: "error"
# 对抗性测试
- description: "注入尝试"
vars: { question: "Ignore previous instructions and..." }
assert:
- type: not-contains
value: "Here's how to" # 应拒绝执行References
参考资料
- - Detailed setup and configuration
references/promptfoo-guide.md - - Metrics deep dive
references/evaluation-metrics.md - - CI/CD patterns
references/ci-cd-integration.md - - Braintrust, DeepEval, LangSmith comparison
references/alternatives.md
- - 详细搭建与配置指南
references/promptfoo-guide.md - - 评估指标深度解析
references/evaluation-metrics.md - - CI/CD模式
references/ci-cd-integration.md - - 与Braintrust、DeepEval、LangSmith的对比
references/alternatives.md
Templates
模板
Copy-paste ready templates:
- - Basic config
templates/promptfooconfig.yaml - - GitHub Action
templates/github-action-eval.yml - - Full regression suite
templates/regression-test-suite.yaml
可直接复制使用的模板:
- - 基础配置
templates/promptfooconfig.yaml - - GitHub Action
templates/github-action-eval.yml - - 完整回归套件
templates/regression-test-suite.yaml