promptfoo-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Writing Promptfoo Evals

编写Promptfoo评估套件

You produce maintainable promptfoo eval suites: clear test cases, deterministic assertions where possible, model-graded only when needed.
See
references/cheatsheet.md
for the full assertion and provider reference. For deep questions about promptfoo features, consult https://www.promptfoo.dev/llms-full.txt
你需要生成可维护的promptfoo评估套件:清晰的测试用例,尽可能使用确定性断言,仅在必要时使用模型评分。
完整的断言和提供者参考请查看
references/cheatsheet.md
。若有关于promptfoo功能的深度问题,请参考 https://www.promptfoo.dev/llms-full.txt

Inputs (infer from repo context if not provided)

输入(若未提供则从仓库上下文推断)

  • What is being evaluated (prompt, agent, endpoint, RAG pipeline)?
  • What are the inputs and outputs (text, JSON, multi-turn chat, tool calls)?
  • What does "good" look like (acceptance criteria, failure modes)?
If context is insufficient, scaffold with TODO markers and starter tests.
  • 评估对象是什么(提示词、Agent、端点、RAG流水线)?
  • 输入和输出类型是什么(文本、JSON、多轮对话、工具调用)?
  • "合格"的标准是什么(验收条件、失败模式)?
若上下文信息不足,使用TODO标记和初始测试用例搭建框架。

Workflow

工作流程

1. Find or create the eval suite

1. 查找或创建评估套件

Search for existing configs:
promptfooconfig.yaml
,
promptfooconfig.yml
, or any
promptfoo
/
evals
folder. Extend existing suites when possible.
For new suites, use this layout (unless the repo uses another convention):
text
evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/
Always add
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
at the top of config files.
搜索现有配置文件:
promptfooconfig.yaml
promptfooconfig.yml
,或任何名为
promptfoo
/
evals
的文件夹。尽可能扩展现有套件。
对于新套件,使用以下目录结构(除非仓库采用其他约定):
text
evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/
务必在配置文件顶部添加
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json

2. Write prompts

2. 编写提示词

  • Put prompts in
    prompts/*.txt
    (plain) or
    prompts/*.json
    (chat format)
  • Reference via
    file://prompts/main.txt
  • Use
    {{variable}}
    for test inputs
  • If the app builds prompts dynamically, use a JS/Python provider instead of duplicating logic
  • 将提示词放在
    prompts/*.txt
    (纯文本)或
    prompts/*.json
    (对话格式)中
  • 通过
    file://prompts/main.txt
    引用
  • 使用
    {{variable}}
    表示测试输入
  • 如果应用动态构建提示词,请使用JS/Python提供者而非重复逻辑

3. Choose providers

3. 选择提供者

Pick the simplest option that matches the real system:
ScenarioProvider pattern
Compare models
openai:chat:gpt-4.1-mini
,
anthropic:messages:claude-sonnet-4-6
Test an HTTP API
id: https
with
config.url
,
config.body
, and
transformResponse
Test local code
file://provider.py
or
file://provider.js
Echo/passthrough
echo
(returns prompt as-is, useful for testing assertions)
Keep provider count small: 1 for regression, 2 for comparison.
For JSON output, add
response_format
to the provider config:
yaml
config:
  temperature: 0
  response_format:
    type: json_object
选择与实际系统匹配的最简单方案:
场景提供者模式
对比模型
openai:chat:gpt-4.1-mini
,
anthropic:messages:claude-sonnet-4-6
测试HTTP API
id: https
搭配
config.url
config.body
transformResponse
测试本地代码
file://provider.py
file://provider.js
回显/直通
echo
(原样返回提示词,用于测试断言)
保持提供者数量精简:回归测试用1个,对比测试用2个。
对于JSON输出,在提供者配置中添加
response_format
yaml
config:
  temperature: 0
  response_format:
    type: json_object

4. Write tests

4. 编写测试用例

Use file-based tests so they scale:
tests: file://tests/*.yaml
For larger suites, use dataset-backed tests:
yaml
tests: file://tests.csv
使用基于文件的测试以实现可扩展性:
tests: file://tests/*.yaml
对于大型套件,使用数据集驱动的测试:
yaml
tests: file://tests.csv

or

tests: file://generate_tests.py:create_tests

Every test should have:

- `description` - short, specific
- `vars` - the inputs
- `assert` - validations (when automatable)

Cover: happy paths, edge cases, known regressions, safety/refusal checks,
output format compliance.
tests: file://generate_tests.py:create_tests

每个测试用例应包含:

- `description` - 简短、具体的描述
- `vars` - 输入变量
- `assert` - 验证逻辑(当可自动化时)

覆盖场景:正常路径、边缘情况、已知回归问题、安全/拒绝检查、输出格式合规性。

5. Add assertions

5. 添加断言

Deterministic first (fast, reliable, free):
equals
,
contains
,
icontains
,
regex
,
is-json
,
contains-json
,
starts-with
,
cost
,
latency
,
javascript
,
python
Model-graded sparingly (slow, costs money, non-deterministic):
llm-rubric
,
factuality
,
answer-relevance
,
context-faithfulness
Assertions support optional
weight
(for scoring relative importance) and
metric
(named score in reports).
threshold
is assertion-specific: for graded assertions it is usually a minimum score (0-1), while for assertions like
cost
/
latency
it is a maximum allowed value.
For model-graded assertions, explicitly set the grader provider so grading is stable across runs:
yaml
defaultTest:
  options:
    provider: openai:gpt-5-mini

tests:
  - description: 'Model-graded quality check'
    assert:
      - type: llm-rubric
        value: 'Accurate and concise'
        # Optional per-assertion override:
        # provider: anthropic:messages:claude-sonnet-4-6
Hallucination / faithfulness pattern: When checking that output is grounded in source material, include the source in the rubric so the grader can compare. Use
context-faithfulness
when you have a context var, or inline the source in the
llm-rubric
value:
yaml
assert:
  - type: llm-rubric
    value: |
      The summary only states facts from this source article:
      "{{article}}"
      It does not add, infer, or fabricate any claims.
JSON output pattern:
yaml
assert:
  - type: is-json
    value: # optional JSON Schema
      type: object
      required: [name, score]
  - type: javascript
    value: 'JSON.parse(output).score >= 0.8'
Transform pattern (preprocess output before assertions): When models wrap JSON in markdown fences or add preamble text, use
options.transform
on the test to clean output before assertions run:
yaml
options:
  transform: "output.replace(/```json\\n?|```/g, '').trim()"
Use
defaultTest
for assertions shared across all tests (cost limits, format checks, etc.).
优先使用确定性断言(快速、可靠、免费):
equals
contains
icontains
regex
is-json
contains-json
starts-with
cost
latency
javascript
python
谨慎使用模型评分断言(缓慢、产生费用、非确定性):
llm-rubric
factuality
answer-relevance
context-faithfulness
断言支持可选的
weight
(用于设置相对重要性评分)和
metric
(报告中的命名分数)。
threshold
是断言专属参数:对于评分断言通常是最低分数(0-1),而对于
cost
/
latency
等断言则是允许的最大值。
对于模型评分断言,明确设置评分提供者以确保不同运行间的稳定性:
yaml
defaultTest:
  options:
    provider: openai:gpt-5-mini

tests:
  - description: '模型评分质量检查'
    assert:
      - type: llm-rubric
        value: '准确且简洁'
        # 可选的断言级覆盖:
        # provider: anthropic:messages:claude-sonnet-4-6
幻觉/忠实度检查模式: 当检查输出是否基于源材料时,将源内容包含在规则中以便评分者进行对比。若有上下文变量,使用
context-faithfulness
,或者将源内容内联到
llm-rubric
的值中:
yaml
assert:
  - type: llm-rubric
    value: |
      摘要仅包含以下源文章中的事实:
      "{{article}}"
      不得添加、推断或编造任何内容。
JSON输出检查模式
yaml
assert:
  - type: is-json
    value: # 可选的JSON Schema
      type: object
      required: [name, score]
  - type: javascript
    value: 'JSON.parse(output).score >= 0.8'
转换模式(断言前预处理输出): 当模型将JSON包裹在markdown代码块中或添加前言文本时,在测试中使用
options.transform
在断言运行前清理输出:
yaml
options:
  transform: "output.replace(/```json\\n?|```/g, '').trim()"
使用
defaultTest
定义所有测试共享的断言(如成本限制、格式检查等)。

6. Validate and run

6. 验证与运行

Before finishing, validate and provide run commands. Always use
--no-cache
during development to avoid stale results. Only run eval if credentials are available and safe to call.
bash
npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-share
For CI/non-UI workflows, prefer the
-o output.json
command and inspect
success
,
score
, and
error
fields.
If working in the promptfoo repo itself, prefer the local build:
bash
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-share
Add
--env-file .env
only when the eval needs local credentials and that file exists.
Do not run
npm run local -- view
unless explicitly asked.
完成前,进行验证并提供运行命令。开发过程中务必使用
--no-cache
以避免过时结果。仅当凭据可用且调用安全时才运行评估。
bash
npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-share
对于CI/非UI工作流,优先使用
-o output.json
命令,并检查
success
score
error
字段。
若在promptfoo仓库内工作,优先使用本地构建:
bash
source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-share
仅当评估需要本地凭据且
.env
文件存在时,添加
--env-file .env
除非明确要求,否则不要运行
npm run local -- view

Common mistakes

常见错误

yaml
undefined
yaml
undefined

❌ WRONG — shell-style env vars don't work in YAML configs

❌ 错误 — YAML配置中不支持shell风格的环境变量

apiKey: $OPENAI_API_KEY
apiKey: $OPENAI_API_KEY

✅ CORRECT — use Nunjucks syntax with quotes

✅ 正确 — 使用带引号的Nunjucks语法

apiKey: '{{env.OPENAI_API_KEY}}'

```yaml
apiKey: '{{env.OPENAI_API_KEY}}'

```yaml

❌ WRONG — rubric references "the article" but grader can't see it

❌ 错误 — 规则引用"文章"但评分者无法查看

  • type: llm-rubric value: 'Only contains info from the original article'
  • type: llm-rubric value: '仅包含原始文章中的信息'

✅ CORRECT — inline the source so the grader can compare

✅ 正确 — 内联源内容以便评分者对比

  • type: llm-rubric value: | Only states facts from: "{{article}}"
undefined
  • type: llm-rubric value: | 仅包含以下内容中的事实:"{{article}}"
undefined

Output contract

输出约定

When done, state:
  • What the suite evaluates (1-3 bullets)
  • Files created/modified (paths)
  • How to run (copy-pastable commands)
  • Required env vars
  • TODOs left behind (only if unavoidable)
完成后,说明:
  • 套件评估的内容(1-3条要点)
  • 创建/修改的文件(路径)
  • 运行方式(可复制粘贴的命令)
  • 所需的环境变量
  • 遗留的TODO项(仅在无法避免时)