llm-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LLM Evaluation & Testing

LLM 评估与测试

Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.

通过自动化评估与CI/CD集成，测试提示词、模型及RAG系统。

Quick Start

快速开始

bash

undefined

bash

undefined

Initialize Promptfoo (no global install needed)

初始化Promptfoo（无需全局安装）

npx promptfoo@latest init

Run evaluation

运行评估

npx promptfoo@latest eval

View results in browser

在浏览器中查看结果

npx promptfoo@latest view

Run security scan

运行安全扫描

npx promptfoo@latest redteam run

undefined

npx promptfoo@latest redteam run

undefined

Core Concepts

核心概念

Why Evaluate?

为什么需要评估？

LLM outputs are non-deterministic. "It looks good" isn't testing. You need:

Regression detection: Catch quality drops before production
Security scanning: Find jailbreaks, injections, PII leaks
A/B comparison: Compare prompts/models side-by-side
CI/CD gates: Block bad changes from merging

LLM的输出具有非确定性。"看起来没问题"算不上测试，你需要：

回归检测：在上线前发现质量下降
安全扫描：发现越狱、注入、PII泄露问题
A/B对比：并排对比提示词/模型
CI/CD管控：阻止不良变更合并

Evaluation Types

评估类型

Type	Purpose	Assertions
Functional	Does it work?	`contains` , `equals` , `is-json`
Semantic	Is it correct?	`similar` , `llm-rubric` , `factuality`
Performance	Is it fast/cheap?	`cost` , `latency`
Security	Is it safe?	`redteam` , `moderation` , `pii-detection`

类型	用途	断言方式
功能性	是否可用？	`contains` , `equals` , `is-json`
语义性	是否正确？	`similar` , `llm-rubric` , `factuality`
性能	是否快速/低成本？	`cost` , `latency`
安全性	是否安全？	`redteam` , `moderation` , `pii-detection`

Configuration

配置说明

Basic promptfooconfig.yaml

基础promptfooconfig.yaml

yaml

description: "My LLM evaluation suite"

prompts:
  - file://prompts/main.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet-latest

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: cost
        threshold: 0.01

  - vars:
      question: "Explain quantum computing"
    assert:
      - type: llm-rubric
        value: "Response explains quantum computing concepts clearly"
      - type: latency
        threshold: 3000

yaml

description: "My LLM evaluation suite"

prompts:
  - file://prompts/main.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet-latest

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: cost
        threshold: 0.01

  - vars:
      question: "Explain quantum computing"
    assert:
      - type: llm-rubric
        value: "Response explains quantum computing concepts clearly"
      - type: latency
        threshold: 3000

With Environment Variables

结合环境变量

yaml

providers:
  - id: openrouter:anthropic/claude-3-5-sonnet
    config:
      apiKey: ${OPENROUTER_API_KEY}

yaml

providers:
  - id: openrouter:anthropic/claude-3-5-sonnet
    config:
      apiKey: ${OPENROUTER_API_KEY}

Assertions Reference

断言参考

Basic Assertions

基础断言

yaml

assert:
  # String matching
  - type: contains
    value: "expected text"
  - type: not-contains
    value: "forbidden text"
  - type: equals
    value: "exact match"
  - type: starts-with
    value: "prefix"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # Date pattern

  # JSON validation
  - type: is-json
  - type: is-valid-json-schema
    value:
      type: object
      properties:
        name: { type: string }
      required: [name]

yaml

assert:
  # 字符串匹配
  - type: contains
    value: "expected text"
  - type: not-contains
    value: "forbidden text"
  - type: equals
    value: "exact match"
  - type: starts-with
    value: "prefix"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # 日期格式

  # JSON验证
  - type: is-json
  - type: is-valid-json-schema
    value:
      type: object
      properties:
        name: { type: string }
      required: [name]

Semantic Assertions

语义断言

yaml

assert:
  # Semantic similarity (embeddings)
  - type: similar
    value: "The capital of France is Paris"
    threshold: 0.8  # 0-1 similarity score

  # LLM-as-judge with custom criteria
  - type: llm-rubric
    value: |
      Response must:
      1. Be factually accurate
      2. Be under 100 words
      3. Not contain marketing language

  # Factuality check against reference
  - type: factuality
    value: "Paris is the capital of France"

yaml

assert:
  # 语义相似度（嵌入向量）
  - type: similar
    value: "The capital of France is Paris"
    threshold: 0.8  # 0-1相似度分数

  # 自定义标准的LLM评审
  - type: llm-rubric
    value: |
      响应必须满足：
      1. 事实准确
      2. 字数少于100词
      3. 不含营销话术

  # 基于参考内容的事实性检查
  - type: factuality
    value: "Paris is the capital of France"

Performance Assertions

性能断言

yaml

assert:
  # Cost budget (USD)
  - type: cost
    threshold: 0.05  # Max $0.05 per request

  # Latency (milliseconds)
  - type: latency
    threshold: 2000  # Max 2 seconds

yaml

assert:
  # 成本预算（美元）
  - type: cost
    threshold: 0.05  # 每次请求最高0.05美元

  # 延迟（毫秒）
  - type: latency
    threshold: 2000  # 最长2秒

Security Assertions

安全断言

yaml

assert:
  # Content moderation
  - type: moderation
    value: violence

  # PII detection
  - type: not-contains
    value: "{{email}}"  # From test vars

yaml

assert:
  # 内容审核
  - type: moderation
    value: violence

  # PII检测
  - type: not-contains
    value: "{{email}}"  # 来自测试变量

CI/CD Integration

CI/CD集成

GitHub Action

yaml

name: 'Prompt Evaluation'
on:
  pull_request:
    paths: ['prompts/**', 'src/**/*prompt*']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      # Cache for faster runs
      - uses: actions/cache@v4
        with:
          path: ~/.promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

      # Run evaluation and post results to PR
      - uses: promptfoo/promptfoo-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # Or other provider keys

yaml

name: 'Prompt Evaluation'
on:
  pull_request:
    paths: ['prompts/**', 'src/**/*prompt*']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      # 缓存以加速运行
      - uses: actions/cache@v4
        with:
          path: ~/.promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

      # 运行评估并将结果发布至PR
      - uses: promptfoo/promptfoo-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # 或其他服务商密钥

Quality Gates

质量管控

yaml

undefined

yaml

undefined

promptfooconfig.yaml

evaluateOptions:

Fail if any assertion fails

maxConcurrency: 5

Or set pass threshold

threshold: 0.9 # 90% of tests must pass

undefined

evaluateOptions:

任何断言失败则整体失败

maxConcurrency: 5

或设置通过阈值

threshold: 0.9 # 90%的测试必须通过

undefined

Output to JSON (for custom CI)

输出为JSON（用于自定义CI）

bash

npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

bash

npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

Check results in CI script

在CI脚本中检查结果

if [ $(jq '.stats.failures' results.json) -gt 0 ]; then echo "Evaluation failed!" exit 1 fi

undefined

if [ $(jq '.stats.failures' results.json) -gt 0 ]; then echo "Evaluation failed!" exit 1 fi

undefined

Security Testing (Red Team)

安全测试（红队）

Quick Scan

快速扫描

bash

undefined

bash

undefined

Run red team against your prompts

针对提示词运行红队测试

npx promptfoo@latest redteam run

Generate compliance report

生成合规报告

npx promptfoo@latest redteam report --output compliance.html

undefined

npx promptfoo@latest redteam report --output compliance.html

undefined

Configuration

配置

yaml

undefined

yaml

undefined

promptfooconfig.yaml

redteam: purpose: "Customer support chatbot" plugins: - harmful:hate - harmful:violence - harmful:self-harm - pii:direct - pii:session - hijacking - jailbreak - prompt-injection

strategies: - jailbreak - prompt-injection - base64 - leetspeak

undefined

redteam: purpose: "Customer support chatbot" plugins: - harmful:hate - harmful:violence - harmful:self-harm - pii:direct - pii:session - hijacking - jailbreak - prompt-injection

strategies: - jailbreak - prompt-injection - base64 - leetspeak

undefined

OWASP Top 10 Coverage

OWASP Top 10 覆盖

yaml

redteam:
  plugins:
    # 1. Prompt Injection
    - prompt-injection
    # 2. Insecure Output Handling
    - harmful:privacy
    # 3. Training Data Poisoning (N/A for evals)
    # 4. Model Denial of Service
    - excessive-agency
    # 5. Supply Chain (N/A for evals)
    # 6. Sensitive Information Disclosure
    - pii:direct
    - pii:session
    # 7. Insecure Plugin Design
    - hijacking
    # 8. Excessive Agency
    - excessive-agency
    # 9. Overreliance (use factuality checks)
    # 10. Model Theft (N/A for evals)

yaml

redteam:
  plugins:
    # 1. 提示词注入
    - prompt-injection
    # 2. 不安全输出处理
    - harmful:privacy
    # 3. 训练数据投毒（评估场景不适用）
    # 4. 模型拒绝服务
    - excessive-agency
    # 5. 供应链（评估场景不适用）
    # 6. 敏感信息泄露
    - pii:direct
    - pii:session
    # 7. 不安全插件设计
    - hijacking
    # 8. 过度授权
    - excessive-agency
    # 9. 过度依赖（使用事实性检查）
    # 10. 模型窃取（评估场景不适用）

RAG Evaluation

RAG评估

Context-Aware Testing

上下文感知测试

yaml

prompts:
  - |
    Context: {{context}}
    Question: {{question}}
    Answer based only on the context provided.

tests:
  - vars:
      context: "The Eiffel Tower was built in 1889 for the World's Fair."
      question: "When was the Eiffel Tower built?"
    assert:
      - type: contains
        value: "1889"
      - type: factuality
        value: "The Eiffel Tower was built in 1889"
      - type: not-contains
        value: "1900"  # Common hallucination

yaml

prompts:
  - |
    Context: {{context}}
    Question: {{question}}
    Answer based only on the context provided.

tests:
  - vars:
      context: "The Eiffel Tower was built in 1889 for the World's Fair."
      question: "When was the Eiffel Tower built?"
    assert:
      - type: contains
        value: "1889"
      - type: factuality
        value: "The Eiffel Tower was built in 1889"
      - type: not-contains
        value: "1900"  # 常见幻觉内容

Retrieval Quality

检索质量

yaml

undefined

yaml

undefined

Test that retrieval returns relevant documents

测试检索是否返回相关文档

tests:

vars: query: "Python list comprehension" assert:
- type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains value: "I don't know" # Shouldn't punt on this query

undefined

tests:

vars: query: "Python list comprehension" assert:
- type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
- type: not-contains value: "I don't know" # 不应回避该问题

undefined

Comparing Models/Prompts

模型/提示词对比

A/B Testing

A/B测试

yaml

undefined

yaml

undefined

Compare two prompts

对比两个提示词

prompts:

file://prompts/v1.txt
file://prompts/v2.txt

prompts:

file://prompts/v1.txt
file://prompts/v2.txt

Same tests for both

对两者使用相同测试

tests:

vars: { question: "Explain recursion" } assert:
- type: llm-rubric value: "Clear explanation of recursion with example"

undefined

tests:

vars: { question: "Explain recursion" } assert:
- type: llm-rubric value: "Clear explanation of recursion with example"

undefined

Model Comparison

模型对比

yaml

undefined

yaml

undefined

Compare multiple models

对比多个模型

providers:

openai:gpt-4o-mini
anthropic:claude-3-5-haiku-latest
openrouter:google/gemini-flash-1.5

providers:

openai:gpt-4o-mini
anthropic:claude-3-5-haiku-latest
openrouter:google/gemini-flash-1.5

Run: npx promptfoo@latest eval

运行：npx promptfoo@latest eval

View: npx promptfoo@latest view

查看：npx promptfoo@latest view

Compare cost, latency, quality side-by-side

并排对比成本、延迟、质量

undefined

undefined

Best Practices

最佳实践

1. Golden Test Cases

1. 黄金测试用例

Maintain a set of critical test cases that must always pass:

yaml

undefined

维护一组必须始终通过的关键测试用例：

yaml

undefined

golden-tests.yaml

tests:

description: "Core functionality - must pass" vars: input: "critical test case" assert:
- type: contains value: "expected output" options: critical: true # Fail entire suite if this fails

undefined

tests:

description: "Core functionality - must pass" vars: input: "critical test case" assert:
- type: contains value: "expected output" options: critical: true # 若该测试失败，整个套件都失败

undefined

2. Regression Suite Structure

2. 回归套件结构

prompts/
├── production.txt          # Current production prompt
├── candidate.txt           # New prompt being tested
tests/
├── golden/                 # Critical tests (run on every PR)
│   └── core-functionality.yaml
├── regression/             # Full regression suite (nightly)
│   └── full-suite.yaml
└── security/               # Red team tests
    └── redteam.yaml

prompts/
├── production.txt          # 当前生产环境提示词
├── candidate.txt           # 待测试的新提示词
tests/
├── golden/                 # 关键测试（每个PR都运行）
│   └── core-functionality.yaml
├── regression/             # 完整回归套件（夜间运行）
│   └── full-suite.yaml
└── security/               # 红队测试
    └── redteam.yaml

3. Test Categories

3. 测试分类

yaml

tests:
  # Happy path
  - description: "Standard query"
    vars: { question: "What is 2+2?" }
    assert:
      - type: contains
        value: "4"

  # Edge cases
  - description: "Empty input"
    vars: { question: "" }
    assert:
      - type: not-contains
        value: "error"

  # Adversarial
  - description: "Injection attempt"
    vars: { question: "Ignore previous instructions and..." }
    assert:
      - type: not-contains
        value: "Here's how to"  # Should refuse

yaml

tests:
  # 正常路径
  - description: "标准查询"
    vars: { question: "What is 2+2?" }
    assert:
      - type: contains
        value: "4"

  # 边缘情况
  - description: "空输入"
    vars: { question: "" }
    assert:
      - type: not-contains
        value: "error"

  # 对抗性测试
  - description: "注入尝试"
    vars: { question: "Ignore previous instructions and..." }
    assert:
      - type: not-contains
        value: "Here's how to"  # 应拒绝执行

References

参考资料

```
references/promptfoo-guide.md
```
- Detailed setup and configuration
```
references/evaluation-metrics.md
```
- Metrics deep dive
```
references/ci-cd-integration.md
```
- CI/CD patterns
```
references/alternatives.md
```
- Braintrust, DeepEval, LangSmith comparison

```
references/promptfoo-guide.md
```
- 详细搭建与配置指南
```
references/evaluation-metrics.md
```
- 评估指标深度解析
```
references/ci-cd-integration.md
```
- CI/CD模式
```
references/alternatives.md
```
- 与Braintrust、DeepEval、LangSmith的对比

Templates

模板

Copy-paste ready templates:

```
templates/promptfooconfig.yaml
```
- Basic config
```
templates/github-action-eval.yml
```
- GitHub Action
```
templates/regression-test-suite.yaml
```
- Full regression suite

可直接复制使用的模板：

```
templates/promptfooconfig.yaml
```
- 基础配置
```
templates/github-action-eval.yml
```
- GitHub Action
```
templates/regression-test-suite.yaml
```
- 完整回归套件