llm-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LLM Evaluation & Testing

LLM 评估与测试

Test prompts, models, and RAG systems with automated evaluation and CI/CD integration.
通过自动化评估与CI/CD集成,测试提示词、模型及RAG系统。

Quick Start

快速开始

bash
undefined
bash
undefined

Initialize Promptfoo (no global install needed)

初始化Promptfoo(无需全局安装)

npx promptfoo@latest init
npx promptfoo@latest init

Run evaluation

运行评估

npx promptfoo@latest eval
npx promptfoo@latest eval

View results in browser

在浏览器中查看结果

npx promptfoo@latest view
npx promptfoo@latest view

Run security scan

运行安全扫描

npx promptfoo@latest redteam run
undefined
npx promptfoo@latest redteam run
undefined

Core Concepts

核心概念

Why Evaluate?

为什么需要评估?

LLM outputs are non-deterministic. "It looks good" isn't testing. You need:
  • Regression detection: Catch quality drops before production
  • Security scanning: Find jailbreaks, injections, PII leaks
  • A/B comparison: Compare prompts/models side-by-side
  • CI/CD gates: Block bad changes from merging
LLM的输出具有非确定性。"看起来没问题"算不上测试,你需要:
  • 回归检测:在上线前发现质量下降
  • 安全扫描:发现越狱、注入、PII泄露问题
  • A/B对比:并排对比提示词/模型
  • CI/CD管控:阻止不良变更合并

Evaluation Types

评估类型

TypePurposeAssertions
FunctionalDoes it work?
contains
,
equals
,
is-json
SemanticIs it correct?
similar
,
llm-rubric
,
factuality
PerformanceIs it fast/cheap?
cost
,
latency
SecurityIs it safe?
redteam
,
moderation
,
pii-detection
类型用途断言方式
功能性是否可用?
contains
,
equals
,
is-json
语义性是否正确?
similar
,
llm-rubric
,
factuality
性能是否快速/低成本?
cost
,
latency
安全性是否安全?
redteam
,
moderation
,
pii-detection

Configuration

配置说明

Basic promptfooconfig.yaml

基础promptfooconfig.yaml

yaml
description: "My LLM evaluation suite"

prompts:
  - file://prompts/main.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet-latest

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: cost
        threshold: 0.01

  - vars:
      question: "Explain quantum computing"
    assert:
      - type: llm-rubric
        value: "Response explains quantum computing concepts clearly"
      - type: latency
        threshold: 3000
yaml
description: "My LLM evaluation suite"

prompts:
  - file://prompts/main.txt

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-sonnet-latest

tests:
  - vars:
      question: "What is the capital of France?"
    assert:
      - type: contains
        value: "Paris"
      - type: cost
        threshold: 0.01

  - vars:
      question: "Explain quantum computing"
    assert:
      - type: llm-rubric
        value: "Response explains quantum computing concepts clearly"
      - type: latency
        threshold: 3000

With Environment Variables

结合环境变量

yaml
providers:
  - id: openrouter:anthropic/claude-3-5-sonnet
    config:
      apiKey: ${OPENROUTER_API_KEY}
yaml
providers:
  - id: openrouter:anthropic/claude-3-5-sonnet
    config:
      apiKey: ${OPENROUTER_API_KEY}

Assertions Reference

断言参考

Basic Assertions

基础断言

yaml
assert:
  # String matching
  - type: contains
    value: "expected text"
  - type: not-contains
    value: "forbidden text"
  - type: equals
    value: "exact match"
  - type: starts-with
    value: "prefix"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # Date pattern

  # JSON validation
  - type: is-json
  - type: is-valid-json-schema
    value:
      type: object
      properties:
        name: { type: string }
      required: [name]
yaml
assert:
  # 字符串匹配
  - type: contains
    value: "expected text"
  - type: not-contains
    value: "forbidden text"
  - type: equals
    value: "exact match"
  - type: starts-with
    value: "prefix"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}"  # 日期格式

  # JSON验证
  - type: is-json
  - type: is-valid-json-schema
    value:
      type: object
      properties:
        name: { type: string }
      required: [name]

Semantic Assertions

语义断言

yaml
assert:
  # Semantic similarity (embeddings)
  - type: similar
    value: "The capital of France is Paris"
    threshold: 0.8  # 0-1 similarity score

  # LLM-as-judge with custom criteria
  - type: llm-rubric
    value: |
      Response must:
      1. Be factually accurate
      2. Be under 100 words
      3. Not contain marketing language

  # Factuality check against reference
  - type: factuality
    value: "Paris is the capital of France"
yaml
assert:
  # 语义相似度(嵌入向量)
  - type: similar
    value: "The capital of France is Paris"
    threshold: 0.8  # 0-1相似度分数

  # 自定义标准的LLM评审
  - type: llm-rubric
    value: |
      响应必须满足:
      1. 事实准确
      2. 字数少于100词
      3. 不含营销话术

  # 基于参考内容的事实性检查
  - type: factuality
    value: "Paris is the capital of France"

Performance Assertions

性能断言

yaml
assert:
  # Cost budget (USD)
  - type: cost
    threshold: 0.05  # Max $0.05 per request

  # Latency (milliseconds)
  - type: latency
    threshold: 2000  # Max 2 seconds
yaml
assert:
  # 成本预算(美元)
  - type: cost
    threshold: 0.05  # 每次请求最高0.05美元

  # 延迟(毫秒)
  - type: latency
    threshold: 2000  # 最长2秒

Security Assertions

安全断言

yaml
assert:
  # Content moderation
  - type: moderation
    value: violence

  # PII detection
  - type: not-contains
    value: "{{email}}"  # From test vars
yaml
assert:
  # 内容审核
  - type: moderation
    value: violence

  # PII检测
  - type: not-contains
    value: "{{email}}"  # 来自测试变量

CI/CD Integration

CI/CD集成

GitHub Action

GitHub Action

yaml
name: 'Prompt Evaluation'
on:
  pull_request:
    paths: ['prompts/**', 'src/**/*prompt*']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      # Cache for faster runs
      - uses: actions/cache@v4
        with:
          path: ~/.promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

      # Run evaluation and post results to PR
      - uses: promptfoo/promptfoo-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # Or other provider keys
yaml
name: 'Prompt Evaluation'
on:
  pull_request:
    paths: ['prompts/**', 'src/**/*prompt*']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      # 缓存以加速运行
      - uses: actions/cache@v4
        with:
          path: ~/.promptfoo
          key: ${{ runner.os }}-promptfoo-${{ hashFiles('promptfooconfig.yaml') }}

      # 运行评估并将结果发布至PR
      - uses: promptfoo/promptfoo-action@v1
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}  # 或其他服务商密钥

Quality Gates

质量管控

yaml
undefined
yaml
undefined

promptfooconfig.yaml

promptfooconfig.yaml

evaluateOptions:

Fail if any assertion fails

maxConcurrency: 5

Or set pass threshold

threshold: 0.9 # 90% of tests must pass
undefined
evaluateOptions:

任何断言失败则整体失败

maxConcurrency: 5

或设置通过阈值

threshold: 0.9 # 90%的测试必须通过
undefined

Output to JSON (for custom CI)

输出为JSON(用于自定义CI)

bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json
bash
npx promptfoo@latest eval -c promptfooconfig.yaml -o results.json

Check results in CI script

在CI脚本中检查结果

if [ $(jq '.stats.failures' results.json) -gt 0 ]; then echo "Evaluation failed!" exit 1 fi
undefined
if [ $(jq '.stats.failures' results.json) -gt 0 ]; then echo "Evaluation failed!" exit 1 fi
undefined

Security Testing (Red Team)

安全测试(红队)

Quick Scan

快速扫描

bash
undefined
bash
undefined

Run red team against your prompts

针对提示词运行红队测试

npx promptfoo@latest redteam run
npx promptfoo@latest redteam run

Generate compliance report

生成合规报告

npx promptfoo@latest redteam report --output compliance.html
undefined
npx promptfoo@latest redteam report --output compliance.html
undefined

Configuration

配置

yaml
undefined
yaml
undefined

promptfooconfig.yaml

promptfooconfig.yaml

redteam: purpose: "Customer support chatbot" plugins: - harmful:hate - harmful:violence - harmful:self-harm - pii:direct - pii:session - hijacking - jailbreak - prompt-injection
strategies: - jailbreak - prompt-injection - base64 - leetspeak
undefined
redteam: purpose: "Customer support chatbot" plugins: - harmful:hate - harmful:violence - harmful:self-harm - pii:direct - pii:session - hijacking - jailbreak - prompt-injection
strategies: - jailbreak - prompt-injection - base64 - leetspeak
undefined

OWASP Top 10 Coverage

OWASP Top 10 覆盖

yaml
redteam:
  plugins:
    # 1. Prompt Injection
    - prompt-injection
    # 2. Insecure Output Handling
    - harmful:privacy
    # 3. Training Data Poisoning (N/A for evals)
    # 4. Model Denial of Service
    - excessive-agency
    # 5. Supply Chain (N/A for evals)
    # 6. Sensitive Information Disclosure
    - pii:direct
    - pii:session
    # 7. Insecure Plugin Design
    - hijacking
    # 8. Excessive Agency
    - excessive-agency
    # 9. Overreliance (use factuality checks)
    # 10. Model Theft (N/A for evals)
yaml
redteam:
  plugins:
    # 1. 提示词注入
    - prompt-injection
    # 2. 不安全输出处理
    - harmful:privacy
    # 3. 训练数据投毒(评估场景不适用)
    # 4. 模型拒绝服务
    - excessive-agency
    # 5. 供应链(评估场景不适用)
    # 6. 敏感信息泄露
    - pii:direct
    - pii:session
    # 7. 不安全插件设计
    - hijacking
    # 8. 过度授权
    - excessive-agency
    # 9. 过度依赖(使用事实性检查)
    # 10. 模型窃取(评估场景不适用)

RAG Evaluation

RAG评估

Context-Aware Testing

上下文感知测试

yaml
prompts:
  - |
    Context: {{context}}
    Question: {{question}}
    Answer based only on the context provided.

tests:
  - vars:
      context: "The Eiffel Tower was built in 1889 for the World's Fair."
      question: "When was the Eiffel Tower built?"
    assert:
      - type: contains
        value: "1889"
      - type: factuality
        value: "The Eiffel Tower was built in 1889"
      - type: not-contains
        value: "1900"  # Common hallucination
yaml
prompts:
  - |
    Context: {{context}}
    Question: {{question}}
    Answer based only on the context provided.

tests:
  - vars:
      context: "The Eiffel Tower was built in 1889 for the World's Fair."
      question: "When was the Eiffel Tower built?"
    assert:
      - type: contains
        value: "1889"
      - type: factuality
        value: "The Eiffel Tower was built in 1889"
      - type: not-contains
        value: "1900"  # 常见幻觉内容

Retrieval Quality

检索质量

yaml
undefined
yaml
undefined

Test that retrieval returns relevant documents

测试检索是否返回相关文档

tests:
  • vars: query: "Python list comprehension" assert:
    • type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
    • type: not-contains value: "I don't know" # Shouldn't punt on this query
undefined
tests:
  • vars: query: "Python list comprehension" assert:
    • type: llm-rubric value: "Response discusses Python list comprehension syntax and examples"
    • type: not-contains value: "I don't know" # 不应回避该问题
undefined

Comparing Models/Prompts

模型/提示词对比

A/B Testing

A/B测试

yaml
undefined
yaml
undefined

Compare two prompts

对比两个提示词

prompts:
  • file://prompts/v1.txt
  • file://prompts/v2.txt
prompts:
  • file://prompts/v1.txt
  • file://prompts/v2.txt

Same tests for both

对两者使用相同测试

tests:
  • vars: { question: "Explain recursion" } assert:
    • type: llm-rubric value: "Clear explanation of recursion with example"
undefined
tests:
  • vars: { question: "Explain recursion" } assert:
    • type: llm-rubric value: "Clear explanation of recursion with example"
undefined

Model Comparison

模型对比

yaml
undefined
yaml
undefined

Compare multiple models

对比多个模型

providers:
  • openai:gpt-4o-mini
  • anthropic:claude-3-5-haiku-latest
  • openrouter:google/gemini-flash-1.5
providers:
  • openai:gpt-4o-mini
  • anthropic:claude-3-5-haiku-latest
  • openrouter:google/gemini-flash-1.5

Run: npx promptfoo@latest eval

运行:npx promptfoo@latest eval

View: npx promptfoo@latest view

查看:npx promptfoo@latest view

Compare cost, latency, quality side-by-side

并排对比成本、延迟、质量

undefined
undefined

Best Practices

最佳实践

1. Golden Test Cases

1. 黄金测试用例

Maintain a set of critical test cases that must always pass:
yaml
undefined
维护一组必须始终通过的关键测试用例:
yaml
undefined

golden-tests.yaml

golden-tests.yaml

tests:
  • description: "Core functionality - must pass" vars: input: "critical test case" assert:
    • type: contains value: "expected output" options: critical: true # Fail entire suite if this fails
undefined
tests:
  • description: "Core functionality - must pass" vars: input: "critical test case" assert:
    • type: contains value: "expected output" options: critical: true # 若该测试失败,整个套件都失败
undefined

2. Regression Suite Structure

2. 回归套件结构

prompts/
├── production.txt          # Current production prompt
├── candidate.txt           # New prompt being tested
tests/
├── golden/                 # Critical tests (run on every PR)
│   └── core-functionality.yaml
├── regression/             # Full regression suite (nightly)
│   └── full-suite.yaml
└── security/               # Red team tests
    └── redteam.yaml
prompts/
├── production.txt          # 当前生产环境提示词
├── candidate.txt           # 待测试的新提示词
tests/
├── golden/                 # 关键测试(每个PR都运行)
│   └── core-functionality.yaml
├── regression/             # 完整回归套件(夜间运行)
│   └── full-suite.yaml
└── security/               # 红队测试
    └── redteam.yaml

3. Test Categories

3. 测试分类

yaml
tests:
  # Happy path
  - description: "Standard query"
    vars: { question: "What is 2+2?" }
    assert:
      - type: contains
        value: "4"

  # Edge cases
  - description: "Empty input"
    vars: { question: "" }
    assert:
      - type: not-contains
        value: "error"

  # Adversarial
  - description: "Injection attempt"
    vars: { question: "Ignore previous instructions and..." }
    assert:
      - type: not-contains
        value: "Here's how to"  # Should refuse
yaml
tests:
  # 正常路径
  - description: "标准查询"
    vars: { question: "What is 2+2?" }
    assert:
      - type: contains
        value: "4"

  # 边缘情况
  - description: "空输入"
    vars: { question: "" }
    assert:
      - type: not-contains
        value: "error"

  # 对抗性测试
  - description: "注入尝试"
    vars: { question: "Ignore previous instructions and..." }
    assert:
      - type: not-contains
        value: "Here's how to"  # 应拒绝执行

References

参考资料

  • references/promptfoo-guide.md
    - Detailed setup and configuration
  • references/evaluation-metrics.md
    - Metrics deep dive
  • references/ci-cd-integration.md
    - CI/CD patterns
  • references/alternatives.md
    - Braintrust, DeepEval, LangSmith comparison
  • references/promptfoo-guide.md
    - 详细搭建与配置指南
  • references/evaluation-metrics.md
    - 评估指标深度解析
  • references/ci-cd-integration.md
    - CI/CD模式
  • references/alternatives.md
    - 与Braintrust、DeepEval、LangSmith的对比

Templates

模板

Copy-paste ready templates:
  • templates/promptfooconfig.yaml
    - Basic config
  • templates/github-action-eval.yml
    - GitHub Action
  • templates/regression-test-suite.yaml
    - Full regression suite
可直接复制使用的模板:
  • templates/promptfooconfig.yaml
    - 基础配置
  • templates/github-action-eval.yml
    - GitHub Action
  • templates/regression-test-suite.yaml
    - 完整回归套件