microbenchmarking

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Benchmark Authoring Guidelines

基准测试编写指南

BenchmarkDotNet (BDN) is a .NET library for writing and running microbenchmarks. Throughout this skill, "BDN" refers to BenchmarkDotNet.
Note: Evaluations of LLMs writing BenchmarkDotNet benchmarks have revealed common failure patterns caused by outdated assumptions about BDN's behavior — particularly around runtime comparison, job configuration, and execution defaults that have changed in recent versions. The reference files in this skill contain verified, current information. You MUST read the reference files relevant to the task before writing any code — your training data likely contains outdated or incorrect BDN patterns.
BenchmarkDotNet(BDN)是一个用于编写和运行微基准测试的.NET类库。在本技能中,"BDN"均指代BenchmarkDotNet。
注意: 对大语言模型编写BenchmarkDotNet基准测试的评估显示,常见失败模式源于对BDN行为的过时假设——尤其是在运行时对比、作业配置和执行默认值方面,这些内容在新版本中已发生变化。本技能中的参考文件包含经过验证的最新信息。在编写任何代码之前,你必须阅读与任务相关的参考文件——你的训练数据可能包含过时或错误的BDN使用模式。

Key concepts

核心概念

  • Job — describes how to run a benchmark: runtime, iteration counts, launch count, run strategy, and environment settings. Multiple jobs can be configured to run the same benchmarks under different conditions.
  • Benchmark case — one method × one parameter combination × one job. The atomic unit BDN measures.
  • Operation — the logical unit of work being measured. All BDN output columns (Mean, Error, etc.) report time per operation.
  • Invocation — a single call to the benchmark method. By default, 1 invocation = 1 operation. With
    OperationsPerInvoke=N
    , each invocation counts as N operations.
  • Iteration — a timed batch of invocations. BDN measures the total time for all invocations in an iteration, then divides by the total operation count to get per-operation time.
  • Job(作业) —— 描述如何运行基准测试:运行时、迭代次数、启动次数、运行策略和环境设置。可配置多个作业,在不同条件下运行相同的基准测试。
  • Benchmark case(基准测试案例) —— 一个方法 × 一个参数组合 × 一个作业。这是BDN测量的最小单元。
  • Operation(操作) —— 被测量的逻辑工作单元。BDN的所有输出列(平均值、误差等)均报告每次操作的耗时。
  • Invocation(调用) —— 对基准测试方法的单次调用。默认情况下,1次调用 = 1次操作。使用
    OperationsPerInvoke=N
    时,每次调用相当于N次操作。
  • Iteration(迭代) —— 一批计时的调用。BDN测量迭代中所有调用的总耗时,然后除以总操作次数得到每次操作的耗时。

Benchmarks are comparative instruments

基准测试是对比工具

A single benchmark number has limited value — it can confirm the order of magnitude of a measurement, but the exact value changes across machines, operating systems, and runtime configurations. Benchmarks produce the most useful information when compared against something. Before writing benchmarks, identify the comparison axis for the current task:
  • Approaches (A vs B): comparing alternative implementations side-by-side in the same run.
  • Runtimes: comparing the same code across .NET versions (e.g., net8.0 vs net9.0).
  • Package versions: comparing different versions of a NuGet dependency.
  • Builds (before/after): comparing a saved DLL of the old code against the current source.
  • Runtime configuration (GC mode, JIT settings): understanding how runtime settings affect performance — compared via multiple jobs in a single run.
  • Scale (N=100 vs N=1000): understanding how performance changes as input size grows.
  • Hardware/OS: comparing across different machines or operating systems — requires separate runs on each environment.
  • Historical measurements: comparing against measurements recorded at a previous point in time.
BDN can compare the first six axes side-by-side in a single run, but each requires specific CLI flags or configuration that differ from what you might expect — read references/comparison-strategies.md for the correct approach for each strategy before configuring a comparison.
单个基准测试数值的价值有限——它可以确认测量的数量级,但具体数值会因机器、操作系统和运行时配置而异。基准测试在与其他对象对比时能产生最有用的信息。在编写基准测试之前,确定当前任务的对比维度
  • 实现方案(A vs B):在同一次运行中对比不同的实现方案。
  • 运行时:在不同.NET版本中对比相同代码(如net8.0 vs net9.0)。
  • 包版本:对比NuGet依赖的不同版本。
  • 构建版本(变更前后):对比旧代码的已保存DLL与当前源代码。
  • 运行时配置(GC模式、JIT设置):了解运行时设置对性能的影响——通过单次运行中的多个作业进行对比。
  • 规模(N=100 vs N=1000):了解性能随输入大小增长的变化情况。
  • 硬件/操作系统:在不同机器或操作系统之间对比——需要在每个环境中单独运行。
  • 历史测量值:与之前记录的测量值进行对比。
BDN可以在单次运行中对前六个维度进行并排对比,但每个维度都需要特定的CLI标志或配置,可能与你预期的不同——在配置对比之前,请阅读references/comparison-strategies.md以获取每个策略的正确方法。

Use cases and benchmark lifecycle

使用场景与基准测试生命周期

There are four distinct reasons a developer writes a benchmark, and each one changes how the benchmark should be designed and where it should live:
  1. Coverage suite: Write benchmarks to maximize coverage of real-world usage patterns so that regressions affecting most users are caught. These benchmarks are permanent — they belong in the project's benchmark suite, follow its conventions (directory structure, base classes, naming), and are checked in.
  2. Issue investigation: Someone has reported a specific performance problem. Write benchmarks to reproduce and diagnose that specific issue. These benchmarks are task-scoped — they persist across the investigation (reproduce → isolate → verify fix) but are not part of the permanent suite.
  3. Change validation: A developer has a PR or change and wants to understand its performance characteristics before merging. These benchmarks are task-scoped — they persist across the review cycle but are not checked in.
  4. Development feedback: A developer is actively working on a task and wants to use benchmarks to evaluate approaches and get information early. These benchmarks are task-scoped and throwaway — they persist across the development session but are deleted when the decision is made.
For use case 1, add to the existing benchmark project following its conventions. For use cases 2–4, create a standalone project in a working directory that persists for the task but is clearly not part of the permanent codebase.
For coverage suite benchmarks, design from the perspective of real callers — what code patterns use this API, what inputs they pass, and what performance characteristics matter to them. Each permanent benchmark should justify its maintenance cost through real-world relevance. For temporary benchmarks, keep the case count intentional — each additional test case costs wall-clock time (read Cost awareness).
开发者编写基准测试有四个不同的原因,每个原因都会改变基准测试的设计方式和存放位置:
  1. 覆盖套件:编写基准测试以最大化覆盖真实世界的使用模式,从而捕获影响大多数用户的性能回归。这些基准测试是永久性的——它们属于项目的基准测试套件,遵循其约定(目录结构、基类、命名规则),并会被提交到代码仓库。
  2. 问题调查:有人报告了特定的性能问题。编写基准测试以重现和诊断该特定问题。这些基准测试是任务范围的——它们会在调查期间保留(重现 → 隔离 → 验证修复),但不属于永久性套件。
  3. 变更验证:开发者有一个PR或变更,希望在合并前了解其性能特征。这些基准测试是任务范围的——它们会在评审周期内保留,但不会被提交到代码仓库。
  4. 开发反馈:开发者正在积极处理一项任务,希望使用基准测试来评估不同方案并尽早获取信息。这些基准测试是任务范围的一次性测试——它们会在开发会话期间保留,一旦做出决策就会被删除。
对于使用场景1,请遵循现有基准测试项目的约定将其添加到项目中。对于使用场景2-4,请在工作目录中创建一个独立项目,该项目会在任务期间保留,但明确不属于永久代码库。
对于覆盖套件基准测试,请从真实调用者的角度进行设计——调用者使用哪些代码模式、传递哪些输入,以及哪些性能特征对他们重要。每个永久性基准测试都应通过真实世界的相关性来证明其维护成本的合理性。对于临时基准测试,要合理控制案例数量——每个额外的测试案例都会耗费实际时间(请阅读成本意识)。

Cost awareness

成本意识

Each benchmark case (one method × one parameter combination × one job) takes 15–25 seconds with default settings.
[Params]
creates a Cartesian product: two
[Params]
with 3 and 4 values across 5 methods = 60 cases ≈ 20 minutes. Multiple jobs multiply this further. Before running, estimate the total case count and match the job preset to the situation:
PresetPer-case timeWhen to use
--job Dry
<1sValidate correctness — confirms compilation and execution without measurement
--job Short
5–8sQuick measurements during development or investigation
(default)15–25sFinal measurements for a coverage suite
--job Medium
33–52sHigher confidence when results matter
--job Long
3–12 minHigh statistical confidence
If benchmark runs take longer than expected, results seem unstable, or you need to tune iteration counts or execution settings, read references/bdn-internals-and-tuning.md for detailed information about BDN's execution pipeline and configuration options.
每个基准测试案例(一个方法 × 一个参数组合 × 一个作业)在默认设置下耗时15-25秒
[Params]
会生成笛卡尔积:两个
[Params]
分别有3个和4个值,对应5个方法,总共60个案例≈20分钟。多个作业会进一步增加耗时。在运行之前,估算总案例数,并根据情况选择合适的作业预设:
预设单案例耗时适用场景
--job Dry
<1秒验证正确性——确认编译和执行正常,不进行测量
--job Short
5-8秒开发或调查期间的快速测量
(默认)15-25秒覆盖套件的最终测量
--job Medium
33-52秒结果重要时的高可信度测量
--job Long
3-12分钟高统计可信度测量
如果基准测试运行时间超出预期、结果不稳定,或者你需要调整迭代次数或执行设置,请阅读references/bdn-internals-and-tuning.md以获取关于BDN执行流程和配置选项的详细信息。

Entry points and configuration

入口点与配置

BDN programs use either
BenchmarkSwitcher
(provides interactive benchmark selection for humans, parses CLI arguments) or
BenchmarkRunner
(runs specified benchmarks directly). Both support CLI flags like
--filter
and
--runtimes
, but only when
args
is passed through — without it, CLI flags are silently ignored. When using
BenchmarkSwitcher
, always pass
--filter
to avoid hanging on an interactive prompt.
BDN behavior is customized through attributes, config objects, and CLI flags.
Read references/project-setup-and-running.md for entry point setup, config object patterns, and CLI flags. If you need to collect data beyond wall-clock time — such as memory allocations, hardware counters, or profiling traces — read references/diagnosers-and-exporters.md.
BDN程序使用**
BenchmarkSwitcher
(为人类用户提供交互式基准测试选择,解析CLI参数)或
BenchmarkRunner
**(直接运行指定的基准测试)。两者都支持
--filter
--runtimes
等CLI标志,但只有在传递
args
时才会生效——如果不传递
args
,CLI标志会被静默忽略。使用
BenchmarkSwitcher
时,请始终传递
--filter
以避免在交互式提示处挂起。
BDN的行为可通过特性配置对象CLI标志进行自定义。
请阅读references/project-setup-and-running.md以获取入口点设置、配置对象模式和CLI标志的相关信息。如果你需要收集墙钟时间之外的数据——如内存分配、硬件计数器或分析追踪,请阅读references/diagnosers-and-exporters.md

Running benchmarks

运行基准测试

BenchmarkDotNet console output is extremely verbose — hundreds of lines per case showing internal calibration, warmup, and measurement details. Redirect all output to a file to avoid consuming context on verbose iteration output:
dotnet run -c Release -- --filter "*MethodName" --noOverwrite > benchmark.log 2>&1
Each benchmark method can take several minutes. Rather than running all benchmarks at once, use
--filter
to run a subset at a time (e.g. one or two methods per invocation), read the results, then run the next subset. This keeps each invocation short — avoiding session or terminal timeouts — and lets you verify results incrementally. Read references/project-setup-and-running.md for filter syntax, CLI flags, and project setup.
After each run, read the Markdown report (
*-report-github.md
) from the results directory for the summary table. Only read
benchmark.log
if you need to investigate errors or unexpected results.
BenchmarkDotNet的控制台输出非常冗长——每个案例会输出数百行内部校准、预热和测量细节。请将所有输出重定向到文件,避免冗长的迭代输出占用上下文:
dotnet run -c Release -- --filter "*MethodName" --noOverwrite > benchmark.log 2>&1
每个基准测试方法可能需要几分钟时间。不要一次性运行所有基准测试,而是使用
--filter
每次运行一个子集(例如每次调用运行一到两个方法),读取结果后再运行下一个子集。这样可以缩短每次调用的时间——避免会话或终端超时——并让你可以逐步验证结果。请阅读references/project-setup-and-running.md以获取过滤器语法、CLI标志和项目设置的相关信息。
每次运行后,请从结果目录中读取Markdown报告(
*-report-github.md
)以查看汇总表。只有在需要调查错误或意外结果时才需要读取
benchmark.log

Writing new benchmarks

编写新基准测试

Step 1: Plan the test cases

步骤1:规划测试案例

Before writing any code, determine:
  • Which use case this benchmark serves (coverage, investigation, change validation, or development feedback).
  • Which comparison axis applies (what will the number be compared against?).
  • What real-world scenarios to benchmark, based on how callers actually use the API.
Each benchmark case should justify its cost. An uncovered scenario is usually more valuable than another parameter combination for one already covered, but when a specific parameter dimension genuinely affects performance characteristics, the depth is warranted.
Decide on the list of test cases. For each test case, think through:
  • How to express variation: BenchmarkDotNet provides several mechanisms for parameterizing benchmarks —
    [Params]
    and
    [ParamsSource]
    for property-level parameters,
    [Arguments]
    and
    [ArgumentsSource]
    for method-level arguments,
    [ParamsAllValues]
    to enumerate all values of a
    bool
    or enum, and
    [GenericTypeArguments]
    for varying type parameters on generic benchmark classes. Choose the mechanism that best fits the dimension being varied. Read references/writing-benchmarks.md for the full set of options and correctness patterns.
  • Where input data comes from — consider which sources are appropriate (these can be combined):
    • Hard-coded values — small, fixed values where the exact input matters (e.g., specific strings, known edge-case sizes). Store in fields or
      [Params]
      to avoid constant folding.
    • Asset files — static data that is too large or impractical to embed in source code such as binary blobs.
    • Programmatically generated via
      [ParamsSource]
      /
      [ArgumentsSource]
      /
      [GlobalSetup]
      — when data shape matters more than specific content, or when input must be parameterized by size.
  • Whether randomness is appropriate: If using generated data, use seeded randomness for reproducibility. When generating random data, use a large enough sample that the generated distribution is representative (e.g., 4 random values may cluster in a narrow range, while 1000 will better exercise the full distribution).
在编写任何代码之前,确定:
  • 此基准测试服务于哪个使用场景(覆盖套件、问题调查、变更验证或开发反馈)。
  • 适用的对比维度(数值将与什么进行对比?)。
  • 要基准测试哪些真实场景,基于调用者实际使用API的方式。
每个基准测试案例都应证明其成本的合理性。一个未覆盖的场景通常比已覆盖场景的另一个参数组合更有价值,但当特定参数维度确实会影响性能特征时,深入测试是必要的。
确定测试案例列表。对于每个测试案例,请考虑:
  • 如何表达变化:BenchmarkDotNet提供多种参数化基准测试的机制——
    [Params]
    [ParamsSource]
    用于属性级参数,
    [Arguments]
    [ArgumentsSource]
    用于方法级参数,
    [ParamsAllValues]
    用于枚举
    bool
    或枚举类型的所有值,
    [GenericTypeArguments]
    用于在泛型基准测试类上改变类型参数。选择最适合当前变化维度的机制。请阅读references/writing-benchmarks.md以获取完整的选项和正确的使用模式。
  • 输入数据的来源——考虑哪些来源是合适的(可组合使用):
    • 硬编码值——小的固定值,其中精确输入很重要(如特定字符串、已知边缘情况的大小)。将其存储在字段或
      [Params]
      中以避免常量折叠。
    • 资源文件——太大或不适合嵌入源代码的静态数据,如二进制 blob。
    • 通过
      [ParamsSource]
      /
      [ArgumentsSource]
      /
      [GlobalSetup]
      以编程方式生成——当数据形状比具体内容更重要,或者输入必须按大小参数化时。
  • 是否适合使用随机数据:如果使用生成的数据,请使用种子随机以确保可重复性。生成随机数据时,请使用足够大的样本,使生成的分布具有代表性(例如,4个随机值可能集中在狭窄范围内,而1000个随机值能更好地覆盖整个分布)。

Step 2: Implement the benchmarks

步骤2:实现基准测试

For coverage suite benchmarks, add to the existing benchmark project and follow its conventions. For temporary benchmarks (investigation, change validation, development feedback), create a standalone project — read references/project-setup-and-running.md for project setup and entry point configuration.
Adding the BenchmarkDotNet package: Always use
dotnet add package BenchmarkDotNet
(no version) — this lets NuGet resolve the latest compatible version. Do NOT manually write a
<PackageReference>
with a version number into the
.csproj
; BDN versions in training data are outdated and may lack support for current .NET runtimes.
Write the benchmark code. Follow the patterns in references/writing-benchmarks.md to avoid common measurement errors — in particular:
  • Return results from benchmark methods to prevent dead code elimination
  • Move initialization to
    [GlobalSetup]
    — setup inside the benchmark method is measured; use
    [IterationSetup]
    only when the benchmark mutates state that must be reset between iterations
  • Do not add manual loops — BDN controls invocation count automatically
  • Mark a baseline when comparing alternatives — use
    [Benchmark(Baseline = true)]
    for method-level comparisons or
    .AsBaseline()
    on a job for multi-job comparisons so results show relative ratios
  • Store inputs in fields or
    [Params]
    , not as literals or
    const
    values — the JIT can fold constant expressions at compile time, making the benchmark measure a precomputed result instead of the actual computation
对于覆盖套件基准测试,请添加到现有基准测试项目并遵循其约定。对于临时基准测试(调查、变更验证、开发反馈),请创建一个独立项目——请阅读references/project-setup-and-running.md以获取项目设置和入口点配置的相关信息。
添加BenchmarkDotNet包:始终使用
dotnet add package BenchmarkDotNet
(不指定版本)——这会让NuGet解析最新的兼容版本。不要手动在
.csproj
中编写带版本号的
<PackageReference>
;训练数据中的BDN版本已过时,可能不支持当前的.NET运行时。
编写基准测试代码。遵循references/writing-benchmarks.md中的模式以避免常见的测量错误——特别是:
  • 从基准测试方法返回结果以避免死代码消除
  • 将初始化逻辑移至
    [GlobalSetup]
    ——基准测试方法内的初始化会被测量;仅当基准测试会修改必须在迭代之间重置的状态时才使用
    [IterationSetup]
  • 不要添加手动循环——BDN会自动控制调用次数
  • 在对比不同方案时标记基准线——对于方法级对比使用
    [Benchmark(Baseline = true)]
    ,对于多作业对比在作业上使用
    .AsBaseline()
    ,这样结果会显示相对比率
  • 将输入存储在字段或
    [Params]
    ,而不是字面量或
    const
    值——JIT可能会在编译时折叠常量表达式,使基准测试测量的是预计算结果而非实际计算的耗时

Step 3: Validate and run

步骤3:验证与运行

Validate before committing to a long run:
  1. Run with
    --job Dry
    first to catch compilation errors and runtime exceptions without spending time on measurement.
  2. Run a single representative case with default settings to verify the output looks correct and the numbers are in the expected range.
  3. Only run the full suite after validation passes.
When iterating on benchmark design, use
--job Short
until confident, then switch to default for final numbers.
在进行长时间运行之前先验证:
  1. 首先使用
    --job Dry
    运行以捕获编译错误和运行时异常,无需花费时间进行测量。
  2. 使用默认设置运行一个有代表性的案例,以验证输出看起来正确且数值在预期范围内。
  3. 仅在验证通过后再运行完整套件。
在迭代基准测试设计时,使用
--job Short
直到你有信心,然后切换到默认设置以获取最终数值。