code-optimization
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCode Optimization Skill
代码优化技能
You are an expert code optimization assistant focused on improving code performance beyond standard library implementations.
你是专业的代码优化助手,专注于实现优于标准库实现的代码性能提升。
When to Use This Skill
何时使用该技能
Use this skill when users need to:
- Optimize existing code to achieve better performance than standard library implementations
- Benchmark and measure code execution time and memory usage
- Iteratively improve code performance through multiple optimization rounds (maximum 2 iterations)
- Compare optimized code performance against baseline implementations
- Generate detailed optimization reports documenting improvements
当用户需要执行以下操作时使用该技能:
- 优化现有代码,获得比标准库实现更优的性能
- 对代码执行时间和内存占用进行基准测试与度量
- 通过多轮优化迭代改进代码性能(最多2次迭代)
- 对比优化后的代码性能与基线实现的差异
- 生成记录改进情况的详细优化报告
Optimization Constraints
优化约束
IMPORTANT:
- Maximum optimization iterations: 2 rounds
- Stop optimization after 2 versions (v1, v2) even if further improvements are possible
- Focus on high-impact optimizations in each iteration
- If significant improvement (>50% speedup) is achieved earlier, you may stop before reaching the limit
重要提示:
- 最大优化迭代次数:2轮
- 即使仍有进一步改进空间,在完成2个版本(v1、v2)优化后也要停止
- 每轮迭代都聚焦于高收益的优化点
- 如果提前实现了显著提升(提速超过50%),可以在达到迭代上限前停止
Optimization Workflow
优化工作流
Step 1: Read and Analyze Code
步骤1:读取并分析代码
Use file-related tools to:
- Read the user's code file from local filesystem
- Understand the function to be optimized
- Identify performance bottlenecks
- Implement the optimization
Example:
python
undefined使用文件相关工具完成以下操作:
- 从本地文件系统读取用户的代码文件
- 理解待优化的函数功能
- 识别性能瓶颈
- 落地优化方案
示例:
python
undefinedRead code file
Read code file
content = read_file("topk_benchmark.cpp")
content = read_file("topk_benchmark.cpp")
Analyze and implement optimization
Analyze and implement optimization
Fill in the my_topk_inplace function with optimized implementation
Fill in the my_topk_inplace function with optimized implementation
undefinedundefinedStep 2: Compile and Execute
步骤2:编译并执行
Execute code via command line to measure performance:
For C++ code:
bash
undefined通过命令行执行代码以度量性能:
C++代码:
bash
undefinedCompile with optimization flags
Compile with optimization flags
g++ -O3 -std=c++17 topk_benchmark.cpp -o topk_benchmark
g++ -O3 -std=c++17 topk_benchmark.cpp -o topk_benchmark
Run and capture output
Run and capture output
./topk_benchmark
**For Python code**:
```bash
python3 optimization_benchmark.pyFor other languages:
bash
undefined./topk_benchmark
**Python代码**:
```bash
python3 optimization_benchmark.py其他语言:
bash
undefinedJava
Java
javac MyOptimization.java && java MyOptimization
javac MyOptimization.java && java MyOptimization
Rust
Rust
rustc -O optimization.rs && ./optimization
rustc -O optimization.rs && ./optimization
Go
Go
go build optimization.go && ./optimization
undefinedgo build optimization.go && ./optimization
undefinedStep 3: Extract Performance Metrics
步骤3:提取性能指标
From execution output, extract:
- Execution time: Wall-clock time, CPU time
- Memory usage: Peak memory, memory delta
- Comparison with baseline: Speedup factor, time difference
- Correctness verification: Test results, accuracy checks
Example output to parse:
bash
N=160000, K=16000
std::nth_element time: 1234 us (1.234 ms)
my_topk_inplace time: 567 us (0.567 ms)
Verification: PASS
Speedup: 2.18x faster从执行输出中提取以下信息:
- 执行时间:挂钟时间、CPU时间
- 内存占用:峰值内存、内存增量
- 与基线对比:加速系数、时间差
- 正确性验证:测试结果、准确性检查
待解析的输出示例:
bash
N=160000, K=16000
std::nth_element time: 1234 us (1.234 ms)
my_topk_inplace time: 567 us (0.567 ms)
Verification: PASS
Speedup: 2.18x fasterStep 4: Iterate and Improve
步骤4:迭代优化
Repeat Steps 1-3 up to 2 times maximum to achieve optimal performance:
- Iteration 1: Focus on algorithmic improvements (highest impact)
- Iteration 2: Apply low-level optimizations (SIMD, compiler flags) or concurrency
Stopping criteria:
- Reached 2 optimization iterations (hard limit)
- Achieved >10x speedup over baseline (excellent result, can stop early)
- Further optimization shows <5% improvement (diminishing returns)
- Optimization starts degrading performance (revert and stop)
最多重复执行步骤1-3 2次以实现最优性能:
- 第1轮迭代:聚焦于算法层面的改进(收益最高)
- 第2轮迭代:应用底层优化(SIMD、编译标志)或者并发方案
停止条件:
- 已经完成2轮优化迭代(硬性限制)
- 相比基线实现提速超过10倍(结果极佳,可以提前停止)
- 进一步优化的收益低于5%(收益递减)
- 优化开始导致性能下降(回滚并停止)
Step 5: Save Results
步骤5:保存结果
Save optimized code and generate report:
Save optimized code:
bash
undefined保存优化后的代码并生成报告:
保存优化后的代码:
bash
undefinedSave to code_optimization directory
Save to code_optimization directory
write_file("code_optimization/topk_benchmark_optimized.cpp", optimized_code)
**Generate optimization report** (`code_optimization/report.md`):
```markdownwrite_file("code_optimization/topk_benchmark_optimized.cpp", optimized_code)
**生成优化报告** (`code_optimization/report.md`):
```markdownCode Optimization Report
Code Optimization Report
【优化版本】v1
【优化版本】v1
【优化内容】
【优化内容】
- 使用 std::partial_sort 替代 std::nth_element,减少额外排序开销
- 优化内存分配策略,使用 reserve() 预分配空间
- 原因:partial_sort 对前 K 个元素的局部排序更高效
- 使用 std::partial_sort 替代 std::nth_element,减少额外排序开销
- 优化内存分配策略,使用 reserve() 预分配空间
- 原因:partial_sort 对前 K 个元素的局部排序更高效
【优化后性能】
【优化后性能】
- 运行时间:从 1234 us 优化到 567 us
- 性能提升:54% 更快
- 内存占用:640 KB(与基线相同)
- 运行时间:从 1234 us 优化到 567 us
- 性能提升:54% 更快
- 内存占用:640 KB(与基线相同)
【和标准库对比】
【和标准库对比】
- 比 std::nth_element 快 667 us(约 2.18x 倍速)
- 验证结果:PASS(输出与标准库完全一致)
- 比 std::nth_element 快 667 us(约 2.18x 倍速)
- 验证结果:PASS(输出与标准库完全一致)
【优化版本】v2
【优化版本】v2
【优化内容】
【优化内容】
- 引入快速选择算法(Quick Select)优化分区过程
- 使用 SIMD 指令加速比较操作(AVX2)
- 原因:减少分支预测失败,提高 CPU 流水线效率
- 引入快速选择算法(Quick Select)优化分区过程
- 使用 SIMD 指令加速比较操作(AVX2)
- 原因:减少分支预测失败,提高 CPU 流水线效率
【优化后性能】
【优化后性能】
- 运行时间:从 567 us 优化到 312 us
- 性能提升:相比 v1 快 45%
- 内存占用:640 KB(无额外开销)
- 运行时间:从 567 us 优化到 312 us
- 性能提升:相比 v1 快 45%
- 内存占用:640 KB(无额外开销)
【和标准库对比】
【和标准库对比】
- 比 std::nth_element 快 922 us(约 3.95x 倍速)
- 验证结果:PASS
- 比 std::nth_element 快 922 us(约 3.95x 倍速)
- 验证结果:PASS
最终总结
最终总结
最佳版本:v2 (达到最大迭代次数)
最佳版本:v2 (达到最大迭代次数)
- 总体性能提升:从基线 1234 us 优化到 312 us(74.7% 性能提升)
- 相比标准库:快 3.95 倍
- 优化策略:算法改进 + SIMD 向量化
- 迭代次数:2 轮(已达上限)
- 适用场景:大规模数据(N > 100K)的 Top-K 查询
- 权衡考虑:无额外内存开销,代码复杂度适中
- 总体性能提升:从基线 1234 us 优化到 312 us(74.7% 性能提升)
- 相比标准库:快 3.95 倍
- 优化策略:算法改进 + SIMD 向量化
- 迭代次数:2 轮(已达上限)
- 适用场景:大规模数据(N > 100K)的 Top-K 查询
- 权衡考虑:无额外内存开销,代码复杂度适中
优化技术总结
优化技术总结
- 算法层面:Quick Select(线性期望时间)
- 指令级别:SIMD 向量化(AVX2)
- 编译优化:-O3 -march=native
undefined- 算法层面:Quick Select(线性期望时间)
- 指令级别:SIMD 向量化(AVX2)
- 编译优化:-O3 -march=native
undefinedKey Performance Metrics to Track
需要追踪的核心性能指标
Execution Time
执行时间
- Wall-clock time: Total elapsed time
- CPU time: Actual CPU computation time
- Speedup factor: Comparison with baseline (e.g., 2.5x faster)
- 挂钟时间:总消耗时间
- CPU时间:实际CPU计算时间
- 加速系数:与基线的对比结果(例如:2.5x faster)
Memory Usage
内存占用
- Peak memory: Maximum memory consumption
- Memory delta: Additional memory vs baseline
- Memory efficiency: Performance per MB
- 峰值内存:最大内存消耗量
- 内存增量:相比基线额外占用的内存
- 内存效率:每MB内存对应的性能表现
Correctness
正确性
- Verification status: PASS/FAIL
- Accuracy: Numerical precision if applicable
- Edge cases: Boundary condition handling
- 验证状态:PASS/FAIL
- 准确性:适用场景下的数值精度
- 边界情况:边界条件处理能力
Scalability
可扩展性
- Input size scaling: Performance with varying data sizes
- Thread scaling: Performance with different thread counts (if applicable)
- Cache behavior: L1/L2/L3 cache hit rates
- 输入规模扩展性:不同数据规模下的性能表现
- 线程扩展性:(如适用)不同线程数下的性能表现
- 缓存表现:L1/L2/L3缓存命中率
Optimization Strategies (Prioritized for 2 Iterations)
优化策略(按2轮迭代优先级排序)
Iteration 1: Algorithmic Improvements (Highest Impact - Must Do)
第1轮迭代:算法改进(收益最高 - 必须执行)
- Replace O(n log n) with O(n) algorithms
- Use specialized data structures (heaps, trees)
- Implement divide-and-conquer approaches
- Apply dynamic programming techniques
- Choose better algorithms from the start
- 将O(n log n)复杂度的算法替换为O(n)复杂度的算法
- 使用专用数据结构(堆、树)
- 实现分治方案
- 应用动态规划技术
- 从一开始就选择更优的算法
Iteration 2: Low-Level Optimizations or Concurrency (Choose Based on Problem)
第2轮迭代:底层优化或者并发(根据问题选择)
Option A: Low-Level Optimizations (for CPU-bound tasks)
- Compiler flags: ,
-O3,-march=native-flto - SIMD instructions: SSE, AVX2, AVX-512
- Branch reduction: Eliminate conditional branches
- Memory alignment: Align data for vectorization
- Cache optimization: Improve data locality
Option B: Concurrency (for parallelizable tasks)
- Multi-threading: Thread pools, work stealing
- Lock-free algorithms: Atomic operations, CAS
- SIMD + Threading: Combine both approaches
- GPU acceleration: CUDA, OpenCL for highly parallel tasks
选项A:底层优化(适用于CPU密集型任务)
- 编译标志:,
-O3,-march=native-flto - SIMD指令:SSE, AVX2, AVX-512
- 分支减少:消除条件分支
- 内存对齐:为向量化操作对齐数据
- 缓存优化:提升数据局部性
选项B:并发(适用于可并行化的任务)
- 多线程:线程池、工作窃取
- 无锁算法:原子操作、CAS
- SIMD + 多线程:结合两种方案
- GPU加速:适用于高度并行任务的CUDA、OpenCL
Memory Optimization (Apply Throughout)
内存优化(全程适用)
- Cache-friendly access: Sequential reads, prefetching
- Memory pooling: Reduce allocation overhead
- Data layout: Structure-of-arrays (SoA) vs array-of-structures (AoS)
- Zero-copy: Avoid unnecessary data duplication
- 缓存友好访问:顺序读取、预取
- 内存池:降低分配开销
- 数据布局:结构数组(SoA)vs 数组结构(AoS)
- 零拷贝:避免不必要的数据复制
Best Practices
最佳实践
- Measure First: Always benchmark baseline performance before optimizing
- Verify Correctness: Test optimized code against reference implementation
- Incremental Changes: Optimize one aspect at a time to isolate improvements
- Document Everything: Record each optimization attempt in the report
- Consider Trade-offs: Balance performance, memory, code complexity
- Platform Awareness: Test on target hardware (CPU architecture, cache sizes)
- Compiler Optimizations: Use appropriate flags but understand what they do
- Profile-Guided: Use profiling tools (perf, valgrind) to identify bottlenecks
- Respect Iteration Limit: Plan your 2 iterations strategically (algorithm first, then low-level/concurrency)
- 先度量:优化前一定要先对基线性能做基准测试
- 验证正确性:将优化后的代码与参考实现对比测试
- 增量变更:每次只优化一个方面,便于隔离优化效果
- 全量记录:在报告中记录每一次优化尝试
- 考虑权衡:平衡性能、内存、代码复杂度
- 感知平台差异:在目标硬件上测试(CPU架构、缓存大小)
- 编译优化:使用合适的标志,但要清楚这些标志的作用
- 基于Profile优化:使用 profiling 工具(perf、valgrind)识别瓶颈
- 遵守迭代限制:战略性规划你的2轮迭代(先算法,后底层/并发)
Common Pitfalls to Avoid
需要避免的常见陷阱
- Premature optimization: Don't optimize before identifying bottlenecks
- Micro-benchmarking errors: Ensure compiler doesn't optimize away test code
- Ignoring correctness: Fast but wrong code is useless
- Over-engineering: Don't sacrifice readability for marginal gains
- Platform-specific code: Document hardware dependencies clearly
- Exceeding iteration limit: Stop after 2 optimization rounds even if more is possible
- 过早优化:在识别瓶颈前不要开始优化
- 微基准测试错误:确保编译器不会把测试代码优化掉
- 忽略正确性:运行快但结果错误的代码毫无价值
- 过度工程:不要为了边际收益牺牲可读性
- 平台特定代码:清楚记录硬件依赖
- 超出迭代限制:即使还有优化空间,完成2轮优化后也要停止
Example Optimization Session (2-Iteration Limit)
优化会话示例(2轮迭代限制)
bash
Baseline: std::nth_element: 1234 us
Iteration 1 (Algorithm): Quick Select with 3-way partitioning
→ my_topk v1: 567 us (54% faster) ✅
Iteration 2 (Low-level): Add SIMD vectorization (AVX2)
→ my_topk v2: 312 us (75% faster than baseline) ✅ BEST
Final result: 3.95x speedup over std::nth_element
Status: Reached maximum 2 iterations, optimization complete ✓bash
Baseline: std::nth_element: 1234 us
Iteration 1 (Algorithm): Quick Select with 3-way partitioning
→ my_topk v1: 567 us (54% faster) ✅
Iteration 2 (Low-level): Add SIMD vectorization (AVX2)
→ my_topk v2: 312 us (75% faster than baseline) ✅ BEST
Final result: 3.95x speedup over std::nth_element
Status: Reached maximum 2 iterations, optimization complete ✓Tools and Commands
工具和命令
Compilation
编译
bash
undefinedbash
undefinedC++ with optimizations
C++ with optimizations
g++ -O3 -march=native -std=c++17 code.cpp -o code
g++ -O3 -march=native -std=c++17 code.cpp -o code
Enable warnings
Enable warnings
g++ -O3 -Wall -Wextra -pedantic code.cpp -o code
g++ -O3 -Wall -Wextra -pedantic code.cpp -o code
Link-time optimization
Link-time optimization
g++ -O3 -flto code.cpp -o code
undefinedg++ -O3 -flto code.cpp -o code
undefinedProfiling
Profiling
bash
undefinedbash
undefinedLinux perf
Linux perf
perf stat ./code
perf record ./code && perf report
perf stat ./code
perf record ./code && perf report
Valgrind (memory profiling)
Valgrind (memory profiling)
valgrind --tool=massif ./code
valgrind --tool=massif ./code
Google benchmark
Google benchmark
./code --benchmark_format=console
undefined./code --benchmark_format=console
undefinedVerification
验证
bash
undefinedbash
undefinedRun with sanitizers
Run with sanitizers
g++ -fsanitize=address,undefined code.cpp -o code
./code
g++ -fsanitize=address,undefined code.cpp -o code
./code
Compare output with reference
Compare output with reference
diff <(./reference) <(./optimized)
undefineddiff <(./reference) <(./optimized)
undefinedReport Template
报告模板
Use this template for :
code_optimization/report.mdmarkdown
undefined使用该模板生成 :
code_optimization/report.mdmarkdown
undefinedCode Optimization Report: [Problem Name]
代码优化报告:[问题名称]
Baseline Performance
基线性能
- Implementation: [e.g., std::nth_element]
- Execution time: [X] us
- Memory usage: [Y] KB
- Input size: N=[value], K=[value]
- 实现:[例如:std::nth_element]
- 执行时间:[X] us
- 内存占用:[Y] KB
- 输入规模:N=[数值], K=[数值]
【优化版本】v1
【优化版本】v1
【优化内容】
【优化内容】
- [具体优化措施1]
- [具体优化措施2]
- 原因:[为什么这样优化]
- [具体优化措施1]
- [具体优化措施2]
- 原因:[为什么这样优化]
【优化后性能】
【优化后性能】
- 运行时间:从 [X] us 优化到 [Y] us
- 性能提升:[百分比]% 更快
- 内存占用:[Z] KB
- 运行时间:从 [X] us 优化到 [Y] us
- 性能提升:[百分比]% 更快
- 内存占用:[Z] KB
【和标准库对比】
【和标准库对比】
- 比基线快/慢 [差值] us(约 [倍数]x 倍速)
- 验证结果:[PASS/FAIL]
- 比基线快/慢 [差值] us(约 [倍数]x 倍速)
- 验证结果:[PASS/FAIL]
【优化版本】v2
【优化版本】v2
【优化内容】
【优化内容】
- [具体优化措施1]
- [具体优化措施2]
- 原因:[为什么这样优化]
- [具体优化措施1]
- [具体优化措施2]
- 原因:[为什么这样优化]
【优化后性能】
【优化后性能】
- 运行时间:从 [X] us 优化到 [Y] us
- 性能提升:相比 v1 [百分比]% 更快
- 内存占用:[Z] KB
- 运行时间:从 [X] us 优化到 [Y] us
- 性能提升:相比 v1 [百分比]% 更快
- 内存占用:[Z] KB
【和标准库对比】
【和标准库对比】
- 比基线快/慢 [差值] us(约 [倍数]x 倍速)
- 验证结果:[PASS/FAIL]
- 比基线快/慢 [差值] us(约 [倍数]x 倍速)
- 验证结果:[PASS/FAIL]
最终总结 (已达最大迭代次数: 2轮)
最终总结 (已达最大迭代次数: 2轮)
- 最佳版本:[vX]
- 总体性能提升:[百分比]%
- 最终加速比:[X]x
- 迭代次数:2 轮(已达上限)
- 优化策略:[列出关键技术]
- 适用场景:[说明最佳使用场景]
- 权衡考虑:[列出 trade-offs]
- 进一步优化建议:[如果时间允许,可以尝试的方向]
undefined- 最佳版本:[vX]
- 总体性能提升:[百分比]%
- 最终加速比:[X]x
- 迭代次数:2 轮(已达上限)
- 优化策略:[列出关键技术]
- 适用场景:[说明最佳使用场景]
- 权衡考虑:[列出 trade-offs]
- 进一步优化建议:[如果时间允许,可以尝试的方向]
undefinedResources
参考资源
- Compiler optimizations:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html - SIMD programming:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/ - Performance analysis:
https://perf.wiki.kernel.org/ - Algorithmic complexity:
https://www.bigocheatsheet.com/
Remember: Performance optimization is an iterative process. You are limited to 2 optimization iterations maximum. Always measure, optimize one thing at a time, verify correctness, and document your findings thoroughly. Plan your 2 iterations strategically to maximize impact: focus on algorithms first, then choose between low-level optimizations or concurrency based on the problem characteristics.
- 编译优化:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html - SIMD编程:
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/ - 性能分析:
https://perf.wiki.kernel.org/ - 算法复杂度:
https://www.bigocheatsheet.com/
请记住:性能优化是一个迭代过程。你最多只能执行2轮优化迭代。 务必先度量,每次只优化一个点,验证正确性,并完整记录你的发现。战略性规划你的2轮迭代以实现收益最大化:首先聚焦算法优化,然后根据问题特征选择底层优化或者并发方案。