pgo

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PGO (Profile-Guided Optimisation)

PGO（基于剖面的优化）

Purpose

用途

Guide agents through the full PGO workflow: instrument build → representative workload → collect profile → optimised build, covering both GCC and Clang, plus BOLT for post-link optimisation.

指导Agent完成完整的PGO工作流：插桩构建→代表性工作负载→收集剖面数据→优化构建，涵盖GCC和Clang，以及用于链接后优化的BOLT。

Triggers

触发场景

"How do I use PGO to speed up my binary?"
"What is profile-guided optimization and when should I use it?"
"How do I use
```
-fprofile-generate
```
and
```
-fprofile-use
```
?"
"My
```
-O3
```
build isn't fast enough — what next?"
"How does BOLT differ from PGO?"
"How do I collect representative profile data?"

"如何使用PGO加速我的二进制程序？"
"什么是基于剖面的优化？何时应该使用它？"
"如何使用
```
-fprofile-generate
```
和
```
-fprofile-use
```
？"
"我的
```
-O3
```
构建速度还不够快——接下来该怎么做？"
"BOLT与PGO有什么区别？"
"如何收集具有代表性的剖面数据？"

Workflow

工作流

1. When to use PGO

1. 何时使用PGO

text

Is -O3 -march=native already applied?
  no  → apply standard optimisation first
  yes → is workload branch-heavy or has irregular call patterns?
          yes → PGO will likely help 5-30%
          no  → PGO may not help; profile first with linux-perf

PGO helps most with:

Large binaries with many cold/hot code paths (compilers, databases, servers)
Branch-heavy code where static prediction is wrong
Function call-heavy code where inlining decisions improve with profile data

text

是否已应用-O3 -march=native？
  否  → 先应用标准优化
  是 → 工作负载是否分支密集或调用模式不规则？
          是 → PGO可能带来5-30%的性能提升
          否 → PGO可能无效；先使用linux-perf进行剖面分析

PGO在以下场景中效果最显著：

存在大量冷/热代码路径的大型二进制程序（如编译器、数据库、服务器）
静态预测不准确的分支密集型代码
函数调用密集型代码，此时剖面数据能优化内联决策

2. GCC PGO workflow

2. GCC PGO工作流

bash

undefined

bash

undefined

Step 1: Build with instrumentation

步骤1：构建插桩版本

gcc -O2 -fprofile-generate -fprofile-dir=./pgo-data
prog.c -o prog_instr

Step 2: Run with representative workload(s)

步骤2：运行代表性工作负载

./prog_instr < workload1.input ./prog_instr < workload2.input

Generates .gcda files in ./pgo-data/

在./pgo-data/目录下生成.gcda文件

Step 3: Build optimised binary using profile

步骤3：使用剖面数据构建优化后的二进制程序

gcc -O2 -fprofile-use -fprofile-dir=./pgo-data
-fprofile-correction
prog.c -o prog_pgo


`-fprofile-correction`: handles profile count inconsistencies from parallel or nondeterministic runs. Always include it.

gcc -O2 -fprofile-use -fprofile-dir=./pgo-data
-fprofile-correction
prog.c -o prog_pgo


`-fprofile-correction`：处理并行或非确定性运行导致的剖面计数不一致问题，建议始终添加该参数。

3. Clang PGO workflow (IR-based, preferred)

3. Clang PGO工作流（基于IR，推荐使用）

bash

undefined

bash

undefined

Step 1: Instrument build

步骤1：构建插桩版本

clang -O2 -fprofile-instr-generate prog.c -o prog_instr

Step 2: Run workload (generates default.profraw)

步骤2：运行工作负载（生成default.profraw）

./prog_instr < workload.input LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # per-PID files for parallel runs

./prog_instr < workload.input LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # 并行运行时按PID生成文件

Step 3: Merge raw profiles

步骤3：合并原始剖面数据

llvm-profdata merge -output=prog.profdata *.profraw

Step 4: Optimised build

步骤4：构建优化后的二进制程序

clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo


Clang's IR PGO is more accurate than GCC's and supports `SamplePGO` (sampling-based, no instrumentation overhead).

clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo


Clang的IR-based PGO比GCC的更准确，并且支持`SamplePGO`（基于采样，无插桩开销）。

4. Clang SamplePGO (sampling, no instrumentation)

4. Clang SamplePGO（基于采样，无插桩）

bash

undefined

bash

undefined

Step 1: Build with frame pointers for accurate stacks

步骤1：构建带帧指针的版本以获取准确的调用栈

clang -O2 -fno-omit-frame-pointer prog.c -o prog

Step 2: Sample with perf

步骤2：使用perf进行采样

perf record -b -e cycles:u ./prog < workload.input perf script -F ip,brstack > perf.script # or use perf2bolt

perf record -b -e cycles:u ./prog < workload.input perf script -F ip,brstack > perf.script # 或使用perf2bolt

Step 3: Convert perf data

步骤3：转换perf数据

llvm-profgen --binary=./prog --perf-script=perf.script
--output=prog.profdata

Step 4: Optimised build

步骤4：构建优化后的二进制程序

clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo


SamplePGO is ideal for production profiling without instrumentation overhead.

clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo


SamplePGO非常适合无需插桩开销的生产环境剖面分析。

5. CMake integration

5. CMake集成

cmake

option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)

if(PGO_INSTRUMENT)
    add_compile_options(-fprofile-instr-generate)
    add_link_options(-fprofile-instr-generate)
endif()

if(PGO_USE)
    add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
    add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()

Build script:

bash

undefined

cmake

option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)

if(PGO_INSTRUMENT)
    add_compile_options(-fprofile-instr-generate)
    add_link_options(-fprofile-instr-generate)
endif()

if(PGO_USE)
    add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
    add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()

构建脚本：

bash

undefined

Phase 1: instrument

阶段1：构建插桩版本

cmake -S . -B build-pgo-instr -DPGO_INSTRUMENT=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-pgo-instr -j$(nproc)

Collect profile

收集剖面数据

./build-pgo-instr/prog < workload.input llvm-profdata merge -output=prog.profdata *.profraw

Phase 2: optimised

阶段2：构建优化版本

cmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-pgo -j$(nproc)

undefined

cmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-pgo -j$(nproc)

undefined

6. BOLT (post-link binary optimisation)

6. BOLT（链接后二进制优化）

BOLT reorders functions and basic blocks in the final binary based on profile data, improving instruction cache locality. Works after PGO for additional 5-15%.

bash

undefined

BOLT根据剖面数据重新排列最终二进制程序中的函数和基本块，提升指令缓存局部性。在PGO之后使用BOLT可额外获得5-15%的性能提升。

bash

undefined

Step 1: Build with relocation support

步骤1：构建带重定位支持的版本

clang -O2 -Wl,--emit-relocs prog.c -o prog

Step 2: Collect profile with perf

步骤2：使用perf收集剖面数据

perf record -e cycles:u -b ./prog < workload.input perf2bolt prog -p perf.data -o prog.fdata

Or use instrumented BOLT

或使用插桩版BOLT

llvm-bolt prog -instrument -o prog.instr ./prog.instr < workload.input

Generates /tmp/prof.fdata

生成/tmp/prof.fdata

Step 3: Apply BOLT optimisation

步骤3：应用BOLT优化

llvm-bolt prog -data prog.fdata -o prog.bolt
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats

undefined

llvm-bolt prog -data prog.fdata -o prog.bolt
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats

undefined

7. Verifying PGO impact

7. 验证PGO效果

bash

undefined

bash

undefined

Compare perf of instrumented vs PGO build

对比基线版本与PGO版本的性能

perf stat ./prog_baseline < workload.input perf stat ./prog_pgo < workload.input

Check which functions are hot in each

查看各版本中的热点函数

perf record ./prog_pgo < workload.input perf report --stdio | head -30


For full workflow details and Clang vs GCC profile format notes, see [references/pgo-workflow.md](references/pgo-workflow.md).

perf record ./prog_pgo < workload.input perf report --stdio | head -30


有关完整工作流细节以及Clang与GCC剖面格式的说明，请参阅[references/pgo-workflow.md](references/pgo-workflow.md)。

pgo

Original

Translation

PGO (Profile-Guided Optimisation)

PGO（基于剖面的优化）

Purpose

用途

Triggers

触发场景

Workflow

工作流

1. When to use PGO

1. 何时使用PGO

2. GCC PGO workflow

2. GCC PGO工作流

Step 1: Build with instrumentation

步骤1：构建插桩版本

Step 2: Run with representative workload(s)

步骤2：运行代表性工作负载

Generates .gcda files in ./pgo-data/

在./pgo-data/目录下生成.gcda文件

Step 3: Build optimised binary using profile

步骤3：使用剖面数据构建优化后的二进制程序

3. Clang PGO workflow (IR-based, preferred)

3. Clang PGO工作流（基于IR，推荐使用）

Step 1: Instrument build

步骤1：构建插桩版本

Step 2: Run workload (generates default.profraw)

步骤2：运行工作负载（生成default.profraw）

Step 3: Merge raw profiles

步骤3：合并原始剖面数据

Step 4: Optimised build

步骤4：构建优化后的二进制程序

4. Clang SamplePGO (sampling, no instrumentation)

4. Clang SamplePGO（基于采样，无插桩）

Step 1: Build with frame pointers for accurate stacks

步骤1：构建带帧指针的版本以获取准确的调用栈

Step 2: Sample with perf

步骤2：使用perf进行采样

Step 3: Convert perf data

步骤3：转换perf数据

Step 4: Optimised build

步骤4：构建优化后的二进制程序

5. CMake integration

5. CMake集成

Phase 1: instrument

阶段1：构建插桩版本

Collect profile

收集剖面数据

Phase 2: optimised

阶段2：构建优化版本

6. BOLT (post-link binary optimisation)

6. BOLT（链接后二进制优化）

Step 1: Build with relocation support

步骤1：构建带重定位支持的版本

Step 2: Collect profile with perf

步骤2：使用perf收集剖面数据

Or use instrumented BOLT

或使用插桩版BOLT

Generates /tmp/prof.fdata

生成/tmp/prof.fdata

Step 3: Apply BOLT optimisation

步骤3：应用BOLT优化

7. Verifying PGO impact

7. 验证PGO效果

Compare perf of instrumented vs PGO build

对比基线版本与PGO版本的性能

Check which functions are hot in each

查看各版本中的热点函数

Related skills

相关技能