write-compressor
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWrite Compressor
编写压缩器
Overview
概述
This skill provides strategies for implementing encoders that produce output compatible with existing decoders. It applies to tasks involving compression algorithms, arithmetic coding, entropy encoding, or any scenario requiring the construction of an encoder as the mathematical inverse of a decoder.
本技能提供了实现可生成与现有解码器兼容输出的编码器的策略。适用于涉及压缩算法、arithmetic coding、熵编码,或是任何需要构建作为解码器数学逆运算的编码器的场景。
Core Principle: Encoder as Decoder Inverse
核心原则:编码器作为解码器的逆运算
When implementing an encoder for an existing decoder, the encoder must be the exact mathematical inverse of the decoder operations. Every decoder operation has a corresponding encoder operation that must produce the exact values the decoder expects to read.
当为现有解码器实现编码器时,编码器必须是解码器操作的精确数学逆运算。每个解码器操作都有对应的编码器操作,且该操作必须生成解码器期望读取的精确值。
Mathematical Derivation First
先进行数学推导
Before writing any code:
- Document the decoder's state machine - Identify all state variables (e.g., ,
low,high,range) and how they evolvefraction - Derive encoder operations algebraically - For each decoder read operation, derive what the encoder must write
- Verify the inverse relationship on paper - Prove mathematically that encoder output → decoder input produces the original data
在编写任何代码之前:
- 记录解码器的状态机 - 确定所有状态变量(如、
low、high、range)及其演变方式fraction - 通过代数方法推导编码器操作 - 针对每个解码器读取操作,推导编码器必须输出的内容
- 在纸上验证逆运算关系 - 通过数学方法证明:编码器输出→解码器输入可还原原始数据
Recommended Approach
推荐实现步骤
Phase 1: Understand the Decoder Completely
阶段1:全面理解解码器
- Read the entire decoder implementation thoroughly
- Trace through the decoder manually with simple inputs
- Document every state variable and its valid ranges
- Identify the bit/byte reading patterns and what values they produce
- Map out conditional branches and what triggers each path
- 通读整个解码器实现代码
- 手动跟踪解码器处理简单输入的过程
- 记录每个状态变量及其有效范围
- 识别比特/字节读取模式及其生成的值
- 梳理条件分支及触发每个分支的条件
Phase 2: Start with Minimal Cases
阶段2:从最简案例入手
Build the encoder incrementally, verifying each step:
- Zero operations - Encode an empty/minimal input and verify it decodes correctly
- Single simple operation - Encode one basic element (e.g., one literal character)
- Two operations - Verify state carries correctly between operations
- Gradually increase complexity - Only after simpler cases work
逐步构建编码器,每一步都进行验证:
- 零操作 - 对空/最小化输入进行编码,验证解码结果正确
- 单个简单操作 - 编码一个基础元素(如一个字面量字符)
- 两个操作 - 验证操作之间的状态传递是否正确
- 逐步增加复杂度 - 仅在简单案例验证通过后再推进
Phase 3: Side-by-Side State Verification
阶段3:并行状态验证
Create a verification harness that:
- Runs encoder and decoder in parallel (or simulates this)
- Compares state variables after every single operation
- Immediately identifies the first point of divergence
- Prints both encoder and decoder state at each step for debugging
创建一个验证工具,实现以下功能:
- 并行运行编码器和解码器(或模拟并行运行)
- 每次操作后对比状态变量
- 立即识别首次出现状态分歧的位置
- 每一步都打印编码器和解码器的状态,便于调试
Phase 4: Full Implementation
阶段4:完整实现
Only after phases 1-3 succeed, proceed to full file encoding.
仅在阶段1-3全部完成且验证通过后,再进行完整文件编码的开发。
Verification Strategies
验证策略
Unit Testing Individual Components
组件单元测试
For arithmetic coding or similar algorithms, test each component independently:
- Bit encoding/decoding in isolation
- Integer encoding/decoding in isolation
- Symbol encoding/decoding in isolation
- Back-reference or special token encoding in isolation
对于arithmetic coding或类似算法,独立测试每个组件:
- 独立测试比特编码/解码
- 独立测试整数编码/解码
- 独立测试符号编码/解码
- 独立测试反向引用或特殊令牌编码
Round-Trip Testing
往返测试
original_data → encoder → compressed → decoder → recovered_data
assert original_data == recovered_dataRun round-trip tests at each complexity level before proceeding.
original_data → encoder → compressed → decoder → recovered_data
assert original_data == recovered_data在推进到更高复杂度之前,在每个复杂度级别都运行往返测试。
State Trace Comparison
状态轨迹对比
Build a debugging mode that outputs encoder state at each step. Feed the compressed output to the decoder with similar tracing. Compare traces to find divergence.
构建调试模式,输出编码器每一步的状态。将压缩后的输入提供给开启了类似跟踪功能的解码器,对比轨迹以找出分歧点。
Common Pitfalls
常见陷阱
1. Renormalization Formula Errors
1. 重归一化公式错误
In arithmetic coding, the renormalization step is critical. The formula for outputting bytes during renormalization must exactly match how the decoder reconstructs the fraction from bytes.
Prevention: Trace through specific numeric examples by hand. If the decoder reads bytes as , derive exactly what the encoder must output.
fraction += read_byte() - 1在arithmetic coding中,重归一化步骤至关重要。重归一化过程中输出字节的公式必须与解码器从字节重构fraction的方式完全匹配。
预防措施:手动跟踪具体数值示例。如果解码器以的方式读取字节,精确推导编码器必须输出的内容。
fraction += read_byte() - 12. Off-by-One Errors
2. 差一错误
Common in:
- Range calculations
- Byte output values (e.g., ,
+1,-1adjustments)% 256 - Loop bounds for flush/finalization
Prevention: Use concrete numeric examples with known expected outputs.
常见于:
- 范围计算
- 字节输出值(如、
+1、-1调整)% 256 - 刷新/收尾的循环边界
预防措施:使用已知预期输出的具体数值示例进行测试。
3. Flushing/Finalization Errors
3. 刷新/收尾错误
The final bytes to flush the encoder state are often implemented incorrectly.
Prevention: Test the flush procedure separately with known encoder states.
用于刷新编码器状态的最终字节通常会被错误实现。
预防措施:使用已知的编码器状态单独测试刷新流程。
4. Premature Optimization
4. 过早优化
Worrying about output size before achieving correctness.
Prevention: First make it work, then make it small. A working 3KB output is infinitely better than a broken 2KB output.
在确保正确性之前就担心输出大小。
预防措施:先确保功能正常,再优化大小。一个可正常工作的3KB输出远比一个损坏的2KB输出有价值。
5. Trial-and-Error Implementation
5. 试错式实现
Making random changes to formulas hoping something works.
Prevention: Every change should be justified by mathematical reasoning about why the previous version was wrong and why the new version is correct.
随意修改公式,寄希望于某个修改能生效。
预防措施:每一次修改都必须有数学依据,说明之前的版本为何错误,新版本为何正确。
6. Parallel Implementation Attempts
6. 并行尝试多个实现
Creating multiple encoder files (, , ) spreads effort thin.
encoder.pyencoder2.pyencoder_v3.pyPrevention: Work on one implementation. Use version control to track changes. Debug deeply rather than rewriting from scratch.
创建多个编码器文件(如、、)会分散精力。
encoder.pyencoder2.pyencoder_v3.py预防措施:专注于一个实现。使用版本控制跟踪变更。深入调试,而非从头重写。
Debugging Strategy
调试策略
When the decoder crashes or produces wrong output:
- Identify the first failure point - Where exactly does decoding first go wrong?
- Compare states at that point - What did the encoder think the state was vs. what the decoder computed?
- Trace backward - Find the operation that caused the divergence
- Fix with mathematical justification - Don't just try random changes
当解码器崩溃或输出错误结果时:
- 确定首次失败点 - 解码过程具体在哪个位置首次出错?
- 对比该点的状态 - 编码器认为的状态与解码器计算出的状态有何不同?
- 反向跟踪 - 找出导致分歧的操作
- 基于数学依据修复 - 不要随意尝试修改
For Segmentation Faults in Decoder
针对解码器中的段错误
A segfault typically means:
- Invalid memory access from corrupted indices
- The compressed stream is structurally invalid
- The encoder produced bytes the decoder interprets as impossible values
Debug by:
- Adding bounds checking to the decoder (temporarily)
- Printing decoder state before the crash
- Identifying what impossible state was reached
- Tracing back to what encoder output caused this
段故障通常意味着:
- 损坏的索引导致无效内存访问
- 压缩流结构无效
- 编码器生成了解码器视为不可能值的字节
调试方法:
- 临时为解码器添加边界检查
- 在崩溃前打印解码器状态
- 确定到达了哪个不可能的状态
- 反向跟踪导致该状态的编码器输出
Decision Checklist
验收检查清单
Before claiming the encoder is complete:
- Does the simplest possible input (empty/zero) encode and decode correctly?
- Does a single-element input encode and decode correctly?
- Have edge cases been tested (empty strings, maximum values, boundary conditions)?
- Has a side-by-side state trace been performed for at least one non-trivial input?
- Does the full input encode and decode correctly?
- If there are size constraints, does the output meet them?
在宣称编码器完成之前:
- 最简单的输入(空/零值)是否能正确编码和解码?
- 单个元素的输入是否能正确编码和解码?
- 是否测试了边缘案例(空字符串、最大值、边界条件)?
- 是否至少对一个非平凡输入执行了并行状态轨迹对比?
- 完整输入是否能正确编码和解码?
- 如果有大小限制,输出是否满足要求?