promptfoo-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Writing Promptfoo Evals

编写Promptfoo评估套件

You produce maintainable promptfoo eval suites: clear test cases, deterministic assertions where possible, model-graded only when needed.

See

references/cheatsheet.md

for the full assertion and provider reference. For deep questions about promptfoo features, consult https://www.promptfoo.dev/llms-full.txt

你需要生成可维护的promptfoo评估套件：清晰的测试用例，尽可能使用确定性断言，仅在必要时使用模型评分。

完整的断言和提供者参考请查看

references/cheatsheet.md

。若有关于promptfoo功能的深度问题，请参考 https://www.promptfoo.dev/llms-full.txt

Inputs (infer from repo context if not provided)

输入（若未提供则从仓库上下文推断）

What is being evaluated (prompt, agent, endpoint, RAG pipeline)?
What are the inputs and outputs (text, JSON, multi-turn chat, tool calls)?
What does "good" look like (acceptance criteria, failure modes)?

If context is insufficient, scaffold with TODO markers and starter tests.

评估对象是什么（提示词、Agent、端点、RAG流水线）？
输入和输出类型是什么（文本、JSON、多轮对话、工具调用）？
"合格"的标准是什么（验收条件、失败模式）？

若上下文信息不足，使用TODO标记和初始测试用例搭建框架。

Workflow

工作流程

1. Find or create the eval suite

1. 查找或创建评估套件

Search for existing configs:

promptfooconfig.yaml

promptfooconfig.yml

, or any

promptfoo

evals

folder. Extend existing suites when possible.

For new suites, use this layout (unless the repo uses another convention):

text

evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/

Always add

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json

at the top of config files.

搜索现有配置文件：

promptfooconfig.yaml

、

promptfooconfig.yml

，或任何名为

promptfoo

evals

的文件夹。尽可能扩展现有套件。

对于新套件，使用以下目录结构（除非仓库采用其他约定）：

text

evals/<suite-name>/
  promptfooconfig.yaml
  prompts/
  tests/

务必在配置文件顶部添加

# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json

。

2. Write prompts

2. 编写提示词

Put prompts in
```
prompts/*.txt
```
(plain) or
```
prompts/*.json
```
(chat format)
Reference via
```
file://prompts/main.txt
```
Use
```
{{variable}}
```
for test inputs
If the app builds prompts dynamically, use a JS/Python provider instead of duplicating logic

将提示词放在
```
prompts/*.txt
```
（纯文本）或
```
prompts/*.json
```
（对话格式）中
通过
```
file://prompts/main.txt
```
引用
使用
```
{{variable}}
```
表示测试输入
如果应用动态构建提示词，请使用JS/Python提供者而非重复逻辑

3. Choose providers

3. 选择提供者

Pick the simplest option that matches the real system:

Scenario	Provider pattern
Compare models	`openai:chat:gpt-4.1-mini` , `anthropic:messages:claude-sonnet-4-6`
Test an HTTP API	`id: https` with `config.url` , `config.body` , and `transformResponse`
Test local code	`file://provider.py` or `file://provider.js`
Echo/passthrough	`echo` (returns prompt as-is, useful for testing assertions)

Keep provider count small: 1 for regression, 2 for comparison.

For JSON output, add

response_format

to the provider config:

yaml

config:
  temperature: 0
  response_format:
    type: json_object

选择与实际系统匹配的最简单方案：

场景	提供者模式
对比模型	`openai:chat:gpt-4.1-mini` , `anthropic:messages:claude-sonnet-4-6`
测试HTTP API	`id: https` 搭配 `config.url` 、 `config.body` 和 `transformResponse`
测试本地代码	`file://provider.py` 或 `file://provider.js`
回显/直通	`echo` （原样返回提示词，用于测试断言）

保持提供者数量精简：回归测试用1个，对比测试用2个。

对于JSON输出，在提供者配置中添加

response_format

：

yaml

config:
  temperature: 0
  response_format:
    type: json_object

4. Write tests

4. 编写测试用例

Use file-based tests so they scale:

tests: file://tests/*.yaml

For larger suites, use dataset-backed tests:

yaml

tests: file://tests.csv

使用基于文件的测试以实现可扩展性：

tests: file://tests/*.yaml

对于大型套件，使用数据集驱动的测试：

yaml

tests: file://tests.csv

or

或

tests: file://generate_tests.py:create_tests


Every test should have:

- `description` - short, specific
- `vars` - the inputs
- `assert` - validations (when automatable)

Cover: happy paths, edge cases, known regressions, safety/refusal checks,
output format compliance.

tests: file://generate_tests.py:create_tests


每个测试用例应包含：

- `description` - 简短、具体的描述
- `vars` - 输入变量
- `assert` - 验证逻辑（当可自动化时）

覆盖场景：正常路径、边缘情况、已知回归问题、安全/拒绝检查、输出格式合规性。

5. Add assertions

5. 添加断言

Deterministic first (fast, reliable, free):

equals

contains

icontains

regex

is-json

contains-json

starts-with

cost

latency

javascript

python

Model-graded sparingly (slow, costs money, non-deterministic):

llm-rubric

factuality

answer-relevance

context-faithfulness

Assertions support optional

weight

(for scoring relative importance) and

metric

(named score in reports).

threshold

is assertion-specific: for graded assertions it is usually a minimum score (0-1), while for assertions like

cost

latency

it is a maximum allowed value.

For model-graded assertions, explicitly set the grader provider so grading is stable across runs:

yaml

defaultTest:
  options:
    provider: openai:gpt-5-mini

tests:
  - description: 'Model-graded quality check'
    assert:
      - type: llm-rubric
        value: 'Accurate and concise'
        # Optional per-assertion override:
        # provider: anthropic:messages:claude-sonnet-4-6

Hallucination / faithfulness pattern: When checking that output is grounded in source material, include the source in the rubric so the grader can compare. Use

context-faithfulness

when you have a context var, or inline the source in the

llm-rubric

value:

yaml

assert:
  - type: llm-rubric
    value: |
      The summary only states facts from this source article:
      "{{article}}"
      It does not add, infer, or fabricate any claims.

JSON output pattern:

yaml

assert:
  - type: is-json
    value: # optional JSON Schema
      type: object
      required: [name, score]
  - type: javascript
    value: 'JSON.parse(output).score >= 0.8'

Transform pattern (preprocess output before assertions): When models wrap JSON in markdown fences or add preamble text, use

options.transform

on the test to clean output before assertions run:

yaml

options:
  transform: "output.replace(/```json\\n?|```/g, '').trim()"

Use

defaultTest

for assertions shared across all tests (cost limits, format checks, etc.).

优先使用确定性断言（快速、可靠、免费）：

equals

、

contains

、

icontains

、

regex

、

is-json

、

contains-json

、

starts-with

、

cost

、

latency

、

javascript

、

python

谨慎使用模型评分断言（缓慢、产生费用、非确定性）：

llm-rubric

、

factuality

、

answer-relevance

、

context-faithfulness

断言支持可选的

weight

（用于设置相对重要性评分）和

metric

（报告中的命名分数）。

threshold

是断言专属参数：对于评分断言通常是最低分数（0-1），而对于

cost

latency

等断言则是允许的最大值。

对于模型评分断言，明确设置评分提供者以确保不同运行间的稳定性：

yaml

defaultTest:
  options:
    provider: openai:gpt-5-mini

tests:
  - description: '模型评分质量检查'
    assert:
      - type: llm-rubric
        value: '准确且简洁'
        # 可选的断言级覆盖：
        # provider: anthropic:messages:claude-sonnet-4-6

幻觉/忠实度检查模式：当检查输出是否基于源材料时，将源内容包含在规则中以便评分者进行对比。若有上下文变量，使用

context-faithfulness

，或者将源内容内联到

llm-rubric

的值中：

yaml

assert:
  - type: llm-rubric
    value: |
      摘要仅包含以下源文章中的事实：
      "{{article}}"
      不得添加、推断或编造任何内容。

JSON输出检查模式：

yaml

assert:
  - type: is-json
    value: # 可选的JSON Schema
      type: object
      required: [name, score]
  - type: javascript
    value: 'JSON.parse(output).score >= 0.8'

转换模式（断言前预处理输出）：当模型将JSON包裹在markdown代码块中或添加前言文本时，在测试中使用

options.transform

在断言运行前清理输出：

yaml

options:
  transform: "output.replace(/```json\\n?|```/g, '').trim()"

使用

defaultTest

定义所有测试共享的断言（如成本限制、格式检查等）。

6. Validate and run

6. 验证与运行

Before finishing, validate and provide run commands. Always use

--no-cache

during development to avoid stale results. Only run eval if credentials are available and safe to call.

bash

npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-share

For CI/non-UI workflows, prefer the

-o output.json

command and inspect

success

score

, and

error

fields.

If working in the promptfoo repo itself, prefer the local build:

bash

source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-share

Add

--env-file .env

only when the eval needs local credentials and that file exists.

Do not run

npm run local -- view

unless explicitly asked.

完成前，进行验证并提供运行命令。开发过程中务必使用

--no-cache

以避免过时结果。仅当凭据可用且调用安全时才运行评估。

bash

npx promptfoo@latest validate config -c <config>
npx promptfoo@latest eval -c <config> -o output.json --no-cache --no-share

对于CI/非UI工作流，优先使用

-o output.json

命令，并检查

success

、

score

和

error

字段。

若在promptfoo仓库内工作，优先使用本地构建：

bash

source ~/.nvm/nvm.sh && nvm use
npm run local -- validate config -c <config>
npm run local -- eval -c <config> -o output.json --no-cache --no-share

仅当评估需要本地凭据且

.env

文件存在时，添加

--env-file .env

。

除非明确要求，否则不要运行

npm run local -- view

。

Common mistakes

常见错误

yaml

undefined

yaml

undefined

❌ WRONG — shell-style env vars don't work in YAML configs

❌ 错误 — YAML配置中不支持shell风格的环境变量

apiKey: $OPENAI_API_KEY

✅ CORRECT — use Nunjucks syntax with quotes

✅ 正确 — 使用带引号的Nunjucks语法

apiKey: '{{env.OPENAI_API_KEY}}'


```yaml

apiKey: '{{env.OPENAI_API_KEY}}'


```yaml

❌ WRONG — rubric references "the article" but grader can't see it

❌ 错误 — 规则引用"文章"但评分者无法查看

type: llm-rubric value: 'Only contains info from the original article'

type: llm-rubric value: '仅包含原始文章中的信息'

✅ CORRECT — inline the source so the grader can compare

✅ 正确 — 内联源内容以便评分者对比

type: llm-rubric value: | Only states facts from: "{{article}}"

undefined

type: llm-rubric value: | 仅包含以下内容中的事实："{{article}}"

undefined

Output contract

输出约定

When done, state:

What the suite evaluates (1-3 bullets)
Files created/modified (paths)
How to run (copy-pastable commands)
Required env vars
TODOs left behind (only if unavoidable)

完成后，说明：

套件评估的内容（1-3条要点）
创建/修改的文件（路径）
运行方式（可复制粘贴的命令）
所需的环境变量
遗留的TODO项（仅在无法避免时）