pgo
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePGO (Profile-Guided Optimisation)
PGO(基于剖面的优化)
Purpose
用途
Guide agents through the full PGO workflow: instrument build → representative workload → collect profile → optimised build, covering both GCC and Clang, plus BOLT for post-link optimisation.
指导Agent完成完整的PGO工作流:插桩构建→代表性工作负载→收集剖面数据→优化构建,涵盖GCC和Clang,以及用于链接后优化的BOLT。
Triggers
触发场景
- "How do I use PGO to speed up my binary?"
- "What is profile-guided optimization and when should I use it?"
- "How do I use and
-fprofile-generate?"-fprofile-use - "My build isn't fast enough — what next?"
-O3 - "How does BOLT differ from PGO?"
- "How do I collect representative profile data?"
- "如何使用PGO加速我的二进制程序?"
- "什么是基于剖面的优化?何时应该使用它?"
- "如何使用和
-fprofile-generate?"-fprofile-use - "我的构建速度还不够快——接下来该怎么做?"
-O3 - "BOLT与PGO有什么区别?"
- "如何收集具有代表性的剖面数据?"
Workflow
工作流
1. When to use PGO
1. 何时使用PGO
text
Is -O3 -march=native already applied?
no → apply standard optimisation first
yes → is workload branch-heavy or has irregular call patterns?
yes → PGO will likely help 5-30%
no → PGO may not help; profile first with linux-perfPGO helps most with:
- Large binaries with many cold/hot code paths (compilers, databases, servers)
- Branch-heavy code where static prediction is wrong
- Function call-heavy code where inlining decisions improve with profile data
text
是否已应用-O3 -march=native?
否 → 先应用标准优化
是 → 工作负载是否分支密集或调用模式不规则?
是 → PGO可能带来5-30%的性能提升
否 → PGO可能无效;先使用linux-perf进行剖面分析PGO在以下场景中效果最显著:
- 存在大量冷/热代码路径的大型二进制程序(如编译器、数据库、服务器)
- 静态预测不准确的分支密集型代码
- 函数调用密集型代码,此时剖面数据能优化内联决策
2. GCC PGO workflow
2. GCC PGO工作流
bash
undefinedbash
undefinedStep 1: Build with instrumentation
步骤1:构建插桩版本
gcc -O2 -fprofile-generate -fprofile-dir=./pgo-data
prog.c -o prog_instr
prog.c -o prog_instr
gcc -O2 -fprofile-generate -fprofile-dir=./pgo-data
prog.c -o prog_instr
prog.c -o prog_instr
Step 2: Run with representative workload(s)
步骤2:运行代表性工作负载
./prog_instr < workload1.input
./prog_instr < workload2.input
./prog_instr < workload1.input
./prog_instr < workload2.input
Generates .gcda files in ./pgo-data/
在./pgo-data/目录下生成.gcda文件
Step 3: Build optimised binary using profile
步骤3:使用剖面数据构建优化后的二进制程序
gcc -O2 -fprofile-use -fprofile-dir=./pgo-data
-fprofile-correction
prog.c -o prog_pgo
-fprofile-correction
prog.c -o prog_pgo
`-fprofile-correction`: handles profile count inconsistencies from parallel or nondeterministic runs. Always include it.gcc -O2 -fprofile-use -fprofile-dir=./pgo-data
-fprofile-correction
prog.c -o prog_pgo
-fprofile-correction
prog.c -o prog_pgo
`-fprofile-correction`:处理并行或非确定性运行导致的剖面计数不一致问题,建议始终添加该参数。3. Clang PGO workflow (IR-based, preferred)
3. Clang PGO工作流(基于IR,推荐使用)
bash
undefinedbash
undefinedStep 1: Instrument build
步骤1:构建插桩版本
clang -O2 -fprofile-instr-generate prog.c -o prog_instr
clang -O2 -fprofile-instr-generate prog.c -o prog_instr
Step 2: Run workload (generates default.profraw)
步骤2:运行工作负载(生成default.profraw)
./prog_instr < workload.input
LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # per-PID files for parallel runs
./prog_instr < workload.input
LLVM_PROFILE_FILE="prog-%p.profraw" ./prog_instr # 并行运行时按PID生成文件
Step 3: Merge raw profiles
步骤3:合并原始剖面数据
llvm-profdata merge -output=prog.profdata *.profraw
llvm-profdata merge -output=prog.profdata *.profraw
Step 4: Optimised build
步骤4:构建优化后的二进制程序
clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo
Clang's IR PGO is more accurate than GCC's and supports `SamplePGO` (sampling-based, no instrumentation overhead).clang -O2 -fprofile-instr-use=prog.profdata prog.c -o prog_pgo
Clang的IR-based PGO比GCC的更准确,并且支持`SamplePGO`(基于采样,无插桩开销)。4. Clang SamplePGO (sampling, no instrumentation)
4. Clang SamplePGO(基于采样,无插桩)
bash
undefinedbash
undefinedStep 1: Build with frame pointers for accurate stacks
步骤1:构建带帧指针的版本以获取准确的调用栈
clang -O2 -fno-omit-frame-pointer prog.c -o prog
clang -O2 -fno-omit-frame-pointer prog.c -o prog
Step 2: Sample with perf
步骤2:使用perf进行采样
perf record -b -e cycles:u ./prog < workload.input
perf script -F ip,brstack > perf.script # or use perf2bolt
perf record -b -e cycles:u ./prog < workload.input
perf script -F ip,brstack > perf.script # 或使用perf2bolt
Step 3: Convert perf data
步骤3:转换perf数据
llvm-profgen --binary=./prog --perf-script=perf.script
--output=prog.profdata
--output=prog.profdata
llvm-profgen --binary=./prog --perf-script=perf.script
--output=prog.profdata
--output=prog.profdata
Step 4: Optimised build
步骤4:构建优化后的二进制程序
clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo
SamplePGO is ideal for production profiling without instrumentation overhead.clang -O2 -fprofile-sample-use=prog.profdata prog.c -o prog_spgo
SamplePGO非常适合无需插桩开销的生产环境剖面分析。5. CMake integration
5. CMake集成
cmake
option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)
if(PGO_INSTRUMENT)
add_compile_options(-fprofile-instr-generate)
add_link_options(-fprofile-instr-generate)
endif()
if(PGO_USE)
add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()Build script:
bash
undefinedcmake
option(PGO_INSTRUMENT "Build with PGO instrumentation" OFF)
option(PGO_USE "Build with PGO profile data" OFF)
if(PGO_INSTRUMENT)
add_compile_options(-fprofile-instr-generate)
add_link_options(-fprofile-instr-generate)
endif()
if(PGO_USE)
add_compile_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
add_link_options(-fprofile-instr-use=${CMAKE_SOURCE_DIR}/prog.profdata)
endif()构建脚本:
bash
undefinedPhase 1: instrument
阶段1:构建插桩版本
cmake -S . -B build-pgo-instr -DPGO_INSTRUMENT=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo-instr -j$(nproc)
cmake -S . -B build-pgo-instr -DPGO_INSTRUMENT=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo-instr -j$(nproc)
Collect profile
收集剖面数据
./build-pgo-instr/prog < workload.input
llvm-profdata merge -output=prog.profdata *.profraw
./build-pgo-instr/prog < workload.input
llvm-profdata merge -output=prog.profdata *.profraw
Phase 2: optimised
阶段2:构建优化版本
cmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo -j$(nproc)
undefinedcmake -S . -B build-pgo -DPGO_USE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-pgo -j$(nproc)
undefined6. BOLT (post-link binary optimisation)
6. BOLT(链接后二进制优化)
BOLT reorders functions and basic blocks in the final binary based on profile data, improving instruction cache locality. Works after PGO for additional 5-15%.
bash
undefinedBOLT根据剖面数据重新排列最终二进制程序中的函数和基本块,提升指令缓存局部性。在PGO之后使用BOLT可额外获得5-15%的性能提升。
bash
undefinedStep 1: Build with relocation support
步骤1:构建带重定位支持的版本
clang -O2 -Wl,--emit-relocs prog.c -o prog
clang -O2 -Wl,--emit-relocs prog.c -o prog
Step 2: Collect profile with perf
步骤2:使用perf收集剖面数据
perf record -e cycles:u -b ./prog < workload.input
perf2bolt prog -p perf.data -o prog.fdata
perf record -e cycles:u -b ./prog < workload.input
perf2bolt prog -p perf.data -o prog.fdata
Or use instrumented BOLT
或使用插桩版BOLT
llvm-bolt prog -instrument -o prog.instr
./prog.instr < workload.input
llvm-bolt prog -instrument -o prog.instr
./prog.instr < workload.input
Generates /tmp/prof.fdata
生成/tmp/prof.fdata
Step 3: Apply BOLT optimisation
步骤3:应用BOLT优化
llvm-bolt prog -data prog.fdata -o prog.bolt
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats
undefinedllvm-bolt prog -data prog.fdata -o prog.bolt
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats
-reorder-blocks=ext-tsp
-reorder-functions=hfsort
-split-functions
-split-all-cold
-dyno-stats
undefined7. Verifying PGO impact
7. 验证PGO效果
bash
undefinedbash
undefinedCompare perf of instrumented vs PGO build
对比基线版本与PGO版本的性能
perf stat ./prog_baseline < workload.input
perf stat ./prog_pgo < workload.input
perf stat ./prog_baseline < workload.input
perf stat ./prog_pgo < workload.input
Check which functions are hot in each
查看各版本中的热点函数
perf record ./prog_pgo < workload.input
perf report --stdio | head -30
For full workflow details and Clang vs GCC profile format notes, see [references/pgo-workflow.md](references/pgo-workflow.md).perf record ./prog_pgo < workload.input
perf report --stdio | head -30
有关完整工作流细节以及Clang与GCC剖面格式的说明,请参阅[references/pgo-workflow.md](references/pgo-workflow.md)。Related skills
相关技能
- Use for GCC flag context
skills/compilers/gcc - Use for Clang PGO and SamplePGO setup
skills/compilers/clang - Use for collecting SamplePGO perf data
skills/profilers/linux-perf - Use to identify hot paths before applying PGO
skills/profilers/flamegraphs
- 如需GCC参数相关内容,请使用
skills/compilers/gcc - 如需Clang PGO和SamplePGO设置,请使用
skills/compilers/clang - 如需收集SamplePGO的perf数据,请使用
skills/profilers/linux-perf - 如需在应用PGO前识别热点路径,请使用
skills/profilers/flamegraphs