pgo

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PGO (Profile-Guided Optimisation)

PGO(基于剖面的优化)

Purpose

用途

Guide agents through the full PGO workflow: instrument build → representative workload → collect profile → optimised build, covering both GCC and Clang, plus BOLT for post-link optimisation.
指导Agent完成完整的PGO工作流:插桩构建→代表性工作负载→收集剖面数据→优化构建,涵盖GCC和Clang,以及用于链接后优化的BOLT。

Triggers

触发场景

  • "How do I use PGO to speed up my binary?"
  • "What is profile-guided optimization and when should I use it?"
  • "How do I use
    -fprofile-generate
    and
    -fprofile-use
    ?"
  • "My
    -O3
    build isn't fast enough — what next?"
  • "How does BOLT differ from PGO?"
  • "How do I collect representative profile data?"
  • "如何使用PGO加速我的二进制程序?"
  • "什么是基于剖面的优化?何时应该使用它?"
  • "如何使用
    -fprofile-generate
    -fprofile-use
    ?"
  • "我的
    -O3
    构建速度还不够快——接下来该怎么做?"
  • "BOLT与PGO有什么区别?"
  • "如何收集具有代表性的剖面数据?"

Workflow

工作流

1. When to use PGO

1. 何时使用PGO

text
Is -O3 -march=native already applied?
  no  → apply standard optimisation first
  yes → is workload branch-heavy or has irregular call patterns?
          yes → PGO will likely help 5-30%
          no  → PGO may not help; profile first with linux-perf
PGO helps most with:
  • Large binaries with many cold/hot code paths (compilers, databases, servers)
  • Branch-heavy code where static prediction is wrong
  • Function call-heavy code where inlining decisions improve with profile data
text
是否已应用-O3 -march=native?
  否  → 先应用标准优化
  是 → 工作负载是否分支密集或调用模式不规则?
          是 → PGO可能带来5-30%的性能提升
          否 → PGO可能无效;先使用linux-perf进行剖面分析
PGO在以下场景中效果最显著:
  • 存在大量冷/热代码路径的大型二进制程序(如编译器、数据库、服务器)
  • 静态预测不准确的分支密集型代码
  • 函数调用密集型代码,此时剖面数据能优化内联决策

2. GCC PGO workflow

2. GCC PGO工作流

bash
undefined
bash
undefined

Step 1: Build with instrumentation

步骤1:构建插桩版本

gcc -O2 -fprofile-generate -fprofile-dir=./pgo-data
prog.c -o prog_instr
gcc -O2 -fprofile-generate -fprofile-dir=./pgo-data
prog.c -o prog_instr

Step 2: Run with representative workload(s)

步骤2:运行代表性工作负载

./prog_instr < workload1.input ./prog_instr < workload2.input
./prog_instr < workload1.input ./prog_instr < workload2.input

Generates .gcda files in ./pgo-data/

在./pgo-data/目录下生成.gcda文件

Step 3: Build optimised binary using profile

步骤3:使用剖面数据构建优化后的二进制程序

gcc -O2 -fprofile-use -fprofile-dir=./pgo-data
-fprofile-correction
prog.c -o prog_pgo

`-fprofile-correction`: handles profile count inconsistencies from parallel or nondeterministic runs. Always include it.
gcc -O2 -fprofile-use -fprofile-dir=./pgo-data
-fprofile-correction
prog.c -o prog_pgo

`-fprofile-correction`:处理并行或非确定性运行导致的剖面计数不一致问题,建议始终添加该参数。

3. Clang PGO workflow (IR-based, preferred)

3. Clang PGO工作流(基于IR,推荐使用)

bash
undefined
bash
undefined

Step 1: Instrument build

步骤1:构建插桩版本

clang -O2 -fprofile-instr-generate prog.c -o prog_instr
clang -O2 -fprofile-instr-generate prog.c -o prog_instr

Step 2: Run workload (generates default.profraw)

步骤2:运行工作负载(生成default.profraw)

./prog_instr < workload.input LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # per-PID files for parallel runs
./prog_instr < workload.input LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # 并行运行时按PID生成文件

Step 3: Merge raw profiles

步骤3:合并原始剖面数据

llvm-profdata merge -output=prog.profdata *.profraw
llvm-profdata merge -output=prog.profdata *.profraw

Step 4: Optimised build

步骤4:构建优化后的二进制程序

clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo

Clang's IR PGO is more accurate than GCC's and supports `SamplePGO` (sampling-based, no instrumentation overhead).
clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo

Clang的IR-based PGO比GCC的更准确,并且支持`SamplePGO`(基于采样,无插桩开销)。

4. Clang SamplePGO (sampling, no instrumentation)

4. Clang SamplePGO(基于采样,无插桩)

bash
undefined
bash
undefined

Step 1: Build with frame pointers for accurate stacks

步骤1:构建带帧指针的版本以获取准确的调用栈

clang -O2 -fno-omit-frame-pointer prog.c -o prog
clang -O2 -fno-omit-frame-pointer prog.c -o prog

Step 2: Sample with perf

步骤2:使用perf进行采样

perf record -b -e cycles:u ./prog < workload.input perf script -F ip,brstack > perf.script # or use perf2bolt
perf record -b -e cycles:u ./prog < workload.input perf script -F ip,brstack > perf.script # 或使用perf2bolt

Step 3: Convert perf data

步骤3:转换perf数据

llvm-profgen --binary=./prog --perf-script=perf.script
--output=prog.profdata
llvm-profgen --binary=./prog --perf-script=perf.script
--output=prog.profdata

Step 4: Optimised build

步骤4:构建优化后的二进制程序

clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo

SamplePGO is ideal for production profiling without instrumentation overhead.
clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo

SamplePGO非常适合无需插桩开销的生产环境剖面分析。

5. CMake integration

5. CMake集成

cmake
option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)

if(PGO_INSTRUMENT)
    add_compile_options(-fprofile-instr-generate)
    add_link_options(-fprofile-instr-generate)
endif()

if(PGO_USE)
    add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
    add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()
Build script:
bash
undefined
cmake
option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)

if(PGO_INSTRUMENT)
    add_compile_options(-fprofile-instr-generate)
    add_link_options(-fprofile-instr-generate)
endif()

if(PGO_USE)
    add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
    add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()
构建脚本:
bash
undefined

Phase 1: instrument

阶段1:构建插桩版本

cmake -S . -B build-pgo-instr -DPGO_INSTRUMENT=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-pgo-instr -j$(nproc)
cmake -S . -B build-pgo-instr -DPGO_INSTRUMENT=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-pgo-instr -j$(nproc)

Collect profile

收集剖面数据

./build-pgo-instr/prog < workload.input llvm-profdata merge -output=prog.profdata *.profraw
./build-pgo-instr/prog < workload.input llvm-profdata merge -output=prog.profdata *.profraw

Phase 2: optimised

阶段2:构建优化版本

cmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-pgo -j$(nproc)
undefined
cmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release cmake --build build-pgo -j$(nproc)
undefined

6. BOLT (post-link binary optimisation)

6. BOLT(链接后二进制优化)

BOLT reorders functions and basic blocks in the final binary based on profile data, improving instruction cache locality. Works after PGO for additional 5-15%.
bash
undefined
BOLT根据剖面数据重新排列最终二进制程序中的函数和基本块,提升指令缓存局部性。在PGO之后使用BOLT可额外获得5-15%的性能提升。
bash
undefined

Step 1: Build with relocation support

步骤1:构建带重定位支持的版本

clang -O2 -Wl,--emit-relocs prog.c -o prog
clang -O2 -Wl,--emit-relocs prog.c -o prog

Step 2: Collect profile with perf

步骤2:使用perf收集剖面数据

perf record -e cycles:u -b ./prog < workload.input perf2bolt prog -p perf.data -o prog.fdata
perf record -e cycles:u -b ./prog < workload.input perf2bolt prog -p perf.data -o prog.fdata

Or use instrumented BOLT

或使用插桩版BOLT

llvm-bolt prog -instrument -o prog.instr ./prog.instr < workload.input
llvm-bolt prog -instrument -o prog.instr ./prog.instr < workload.input

Generates /tmp/prof.fdata

生成/tmp/prof.fdata

Step 3: Apply BOLT optimisation

步骤3:应用BOLT优化

llvm-bolt prog -data prog.fdata -o prog.bolt
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats
undefined
llvm-bolt prog -data prog.fdata -o prog.bolt
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats
undefined

7. Verifying PGO impact

7. 验证PGO效果

bash
undefined
bash
undefined

Compare perf of instrumented vs PGO build

对比基线版本与PGO版本的性能

perf stat ./prog_baseline < workload.input perf stat ./prog_pgo < workload.input
perf stat ./prog_baseline < workload.input perf stat ./prog_pgo < workload.input

Check which functions are hot in each

查看各版本中的热点函数

perf record ./prog_pgo < workload.input perf report --stdio | head -30

For full workflow details and Clang vs GCC profile format notes, see [references/pgo-workflow.md](references/pgo-workflow.md).
perf record ./prog_pgo < workload.input perf report --stdio | head -30

有关完整工作流细节以及Clang与GCC剖面格式的说明,请参阅[references/pgo-workflow.md](references/pgo-workflow.md)。

Related skills

相关技能

  • Use
    skills/compilers/gcc
    for GCC flag context
  • Use
    skills/compilers/clang
    for Clang PGO and SamplePGO setup
  • Use
    skills/profilers/linux-perf
    for collecting SamplePGO perf data
  • Use
    skills/profilers/flamegraphs
    to identify hot paths before applying PGO
  • 如需GCC参数相关内容,请使用
    skills/compilers/gcc
  • 如需Clang PGO和SamplePGO设置,请使用
    skills/compilers/clang
  • 如需收集SamplePGO的perf数据,请使用
    skills/profilers/linux-perf
  • 如需在应用PGO前识别热点路径,请使用
    skills/profilers/flamegraphs