promptfoo-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWriting Promptfoo Evals
编写Promptfoo评估套件
You produce maintainable promptfoo eval suites: clear test cases, deterministic
assertions where possible, model-graded only when needed.
See for the full assertion and provider reference.
For deep questions about promptfoo features, consult https://www.promptfoo.dev/llms-full.txt
references/cheatsheet.md你需要生成可维护的promptfoo评估套件:清晰的测试用例,尽可能使用确定性断言,仅在必要时使用模型评分。
完整的断言和提供者参考请查看 。若有关于promptfoo功能的深度问题,请参考 https://www.promptfoo.dev/llms-full.txt
references/cheatsheet.mdInputs (infer from repo context if not provided)
输入(若未提供则从仓库上下文推断)
- What is being evaluated (prompt, agent, endpoint, RAG pipeline)?
- What are the inputs and outputs (text, JSON, multi-turn chat, tool calls)?
- What does "good" look like (acceptance criteria, failure modes)?
If context is insufficient, scaffold with TODO markers and starter tests.
- 评估对象是什么(提示词、Agent、端点、RAG流水线)?
- 输入和输出类型是什么(文本、JSON、多轮对话、工具调用)?
- "合格"的标准是什么(验收条件、失败模式)?
若上下文信息不足,使用TODO标记和初始测试用例搭建框架。
Workflow
工作流程
1. Find or create the eval suite
1. 查找或创建评估套件
Search for existing configs: , ,
or any / folder. Extend existing suites when possible.
promptfooconfig.yamlpromptfooconfig.ymlpromptfooevalsFor new suites, use this layout (unless the repo uses another convention):
text
evals/<suite-name>/
promptfooconfig.yaml
prompts/
tests/Always add
at the top of config files.
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json搜索现有配置文件:、,或任何名为/的文件夹。尽可能扩展现有套件。
promptfooconfig.yamlpromptfooconfig.ymlpromptfooevals对于新套件,使用以下目录结构(除非仓库采用其他约定):
text
evals/<suite-name>/
promptfooconfig.yaml
prompts/
tests/务必在配置文件顶部添加 。
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json2. Write prompts
2. 编写提示词
- Put prompts in (plain) or
prompts/*.txt(chat format)prompts/*.json - Reference via
file://prompts/main.txt - Use for test inputs
{{variable}} - If the app builds prompts dynamically, use a JS/Python provider instead of duplicating logic
- 将提示词放在(纯文本)或
prompts/*.txt(对话格式)中prompts/*.json - 通过引用
file://prompts/main.txt - 使用表示测试输入
{{variable}} - 如果应用动态构建提示词,请使用JS/Python提供者而非重复逻辑
3. Choose providers
3. 选择提供者
Pick the simplest option that matches the real system:
| Scenario | Provider pattern |
|---|---|
| Compare models | |
| Test an HTTP API | |
| Test local code | |
| Echo/passthrough | |
Keep provider count small: 1 for regression, 2 for comparison.
For JSON output, add to the provider config:
response_formatyaml
config:
temperature: 0
response_format:
type: json_object选择与实际系统匹配的最简单方案:
| 场景 | 提供者模式 |
|---|---|
| 对比模型 | |
| 测试HTTP API | |
| 测试本地代码 | |
| 回显/直通 | |
保持提供者数量精简:回归测试用1个,对比测试用2个。
对于JSON输出,在提供者配置中添加:
response_formatyaml
config:
temperature: 0
response_format:
type: json_object4. Write tests
4. 编写测试用例
Use file-based tests so they scale:
tests: file://tests/*.yamlFor larger suites, use dataset-backed tests:
yaml
tests: file://tests.csv使用基于文件的测试以实现可扩展性:
tests: file://tests/*.yaml对于大型套件,使用数据集驱动的测试:
yaml
tests: file://tests.csvor
或
tests: file://generate_tests.py:create_tests
Every test should have:
- `description` - short, specific
- `vars` - the inputs
- `assert` - validations (when automatable)
Cover: happy paths, edge cases, known regressions, safety/refusal checks,
output format compliance.tests: file://generate_tests.py:create_tests
每个测试用例应包含:
- `description` - 简短、具体的描述
- `vars` - 输入变量
- `assert` - 验证逻辑(当可自动化时)
覆盖场景:正常路径、边缘情况、已知回归问题、安全/拒绝检查、输出格式合规性。5. Add assertions
5. 添加断言
Deterministic first (fast, reliable, free):
, , , , , ,
, , , ,
equalscontainsicontainsregexis-jsoncontains-jsonstarts-withcostlatencyjavascriptpythonModel-graded sparingly (slow, costs money, non-deterministic):
, , ,
llm-rubricfactualityanswer-relevancecontext-faithfulnessAssertions support optional (for scoring relative importance) and
(named score in reports). is assertion-specific: for
graded assertions it is usually a minimum score (0-1), while for assertions
like / it is a maximum allowed value.
weightmetricthresholdcostlatencyFor model-graded assertions, explicitly set the grader provider so grading is
stable across runs:
yaml
defaultTest:
options:
provider: openai:gpt-5-mini
tests:
- description: 'Model-graded quality check'
assert:
- type: llm-rubric
value: 'Accurate and concise'
# Optional per-assertion override:
# provider: anthropic:messages:claude-sonnet-4-6Hallucination / faithfulness pattern:
When checking that output is grounded in source material, include the source in
the rubric so the grader can compare. Use when you have
a context var, or inline the source in the value:
context-faithfulnessllm-rubricyaml
assert:
- type: llm-rubric
value: |
The summary only states facts from this source article:
"{{article}}"
It does not add, infer, or fabricate any claims.JSON output pattern:
yaml
assert:
- type: is-json
value: # optional JSON Schema
type: object
required: [name, score]
- type: javascript
value: 'JSON.parse(output).score >= 0.8'Transform pattern (preprocess output before assertions):
When models wrap JSON in markdown fences or add preamble text, use
on the test to clean output before assertions run:
options.transformyaml
options:
transform: "output.replace(/```json\\n?|```/g, '').trim()"Use for assertions shared across all tests (cost limits, format
checks, etc.).
defaultTest优先使用确定性断言(快速、可靠、免费):
、、、、、、、、、、
equalscontainsicontainsregexis-jsoncontains-jsonstarts-withcostlatencyjavascriptpython谨慎使用模型评分断言(缓慢、产生费用、非确定性):
、、、
llm-rubricfactualityanswer-relevancecontext-faithfulness断言支持可选的(用于设置相对重要性评分)和(报告中的命名分数)。是断言专属参数:对于评分断言通常是最低分数(0-1),而对于/等断言则是允许的最大值。
weightmetricthresholdcostlatency对于模型评分断言,明确设置评分提供者以确保不同运行间的稳定性:
yaml
defaultTest:
options:
provider: openai:gpt-5-mini
tests:
- description: '模型评分质量检查'
assert:
- type: llm-rubric
value: '准确且简洁'
# 可选的断言级覆盖:
# provider: anthropic:messages:claude-sonnet-4-6幻觉/忠实度检查模式:
当检查输出是否基于源材料时,将源内容包含在规则中以便评分者进行对比。若有上下文变量,使用,或者将源内容内联到的值中:
context-faithfulnessllm-rubricyaml
assert:
- type: llm-rubric
value: |
摘要仅包含以下源文章中的事实:
"{{article}}"
不得添加、推断或编造任何内容。JSON输出检查模式:
yaml
assert:
- type: is-json
value: # 可选的JSON Schema
type: object
required: [name, score]
- type: javascript
value: 'JSON.parse(output).score >= 0.8'转换模式(断言前预处理输出):
当模型将JSON包裹在markdown代码块中或添加前言文本时,在测试中使用在断言运行前清理输出:
options.transformyaml
options:
transform: "output.replace(/```json\\n?|```/g, '').trim()"使用定义所有测试共享的断言(如成本限制、格式检查等)。
defaultTest6. Validate and run
6. 验证与运行
Before finishing, validate and provide run commands. Always use
during development to avoid stale results. Only run eval if credentials are
available and safe to call.
--no-cachebash
npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-shareFor CI/non-UI workflows, prefer the command and inspect
, , and fields.
-o output.jsonsuccessscoreerrorIf working in the promptfoo repo itself, prefer the local build:
bash
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-shareAdd only when the eval needs local credentials and that file
exists.
--env-file .envDo not run unless explicitly asked.
npm run local -- view完成前,进行验证并提供运行命令。开发过程中务必使用以避免过时结果。仅当凭据可用且调用安全时才运行评估。
--no-cachebash
npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-share对于CI/非UI工作流,优先使用命令,并检查、和字段。
-o output.jsonsuccessscoreerror若在promptfoo仓库内工作,优先使用本地构建:
bash
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-share仅当评估需要本地凭据且文件存在时,添加。
.env--env-file .env除非明确要求,否则不要运行。
npm run local -- viewCommon mistakes
常见错误
yaml
undefinedyaml
undefined❌ WRONG — shell-style env vars don't work in YAML configs
❌ 错误 — YAML配置中不支持shell风格的环境变量
apiKey: $OPENAI_API_KEY
apiKey: $OPENAI_API_KEY
✅ CORRECT — use Nunjucks syntax with quotes
✅ 正确 — 使用带引号的Nunjucks语法
apiKey: '{{env.OPENAI_API_KEY}}'
```yamlapiKey: '{{env.OPENAI_API_KEY}}'
```yaml❌ WRONG — rubric references "the article" but grader can't see it
❌ 错误 — 规则引用"文章"但评分者无法查看
- type: llm-rubric value: 'Only contains info from the original article'
- type: llm-rubric value: '仅包含原始文章中的信息'
✅ CORRECT — inline the source so the grader can compare
✅ 正确 — 内联源内容以便评分者对比
- type: llm-rubric value: | Only states facts from: "{{article}}"
undefined- type: llm-rubric value: | 仅包含以下内容中的事实:"{{article}}"
undefinedOutput contract
输出约定
When done, state:
- What the suite evaluates (1-3 bullets)
- Files created/modified (paths)
- How to run (copy-pastable commands)
- Required env vars
- TODOs left behind (only if unavoidable)
完成后,说明:
- 套件评估的内容(1-3条要点)
- 创建/修改的文件(路径)
- 运行方式(可复制粘贴的命令)
- 所需的环境变量
- 遗留的TODO项(仅在无法避免时)