write-compressor

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Write Compressor

编写压缩器

Overview

概述

This skill provides strategies for implementing encoders that produce output compatible with existing decoders. It applies to tasks involving compression algorithms, arithmetic coding, entropy encoding, or any scenario requiring the construction of an encoder as the mathematical inverse of a decoder.

本技能提供了实现可生成与现有解码器兼容输出的编码器的策略。适用于涉及压缩算法、arithmetic coding、熵编码，或是任何需要构建作为解码器数学逆运算的编码器的场景。

Core Principle: Encoder as Decoder Inverse

核心原则：编码器作为解码器的逆运算

When implementing an encoder for an existing decoder, the encoder must be the exact mathematical inverse of the decoder operations. Every decoder operation has a corresponding encoder operation that must produce the exact values the decoder expects to read.

当为现有解码器实现编码器时，编码器必须是解码器操作的精确数学逆运算。每个解码器操作都有对应的编码器操作，且该操作必须生成解码器期望读取的精确值。

Mathematical Derivation First

先进行数学推导

Before writing any code:

Document the decoder's state machine - Identify all state variables (e.g.,
```
low
```
,
```
high
```
,
```
range
```
,
```
fraction
```
) and how they evolve
Derive encoder operations algebraically - For each decoder read operation, derive what the encoder must write
Verify the inverse relationship on paper - Prove mathematically that encoder output → decoder input produces the original data

在编写任何代码之前：

记录解码器的状态机 - 确定所有状态变量（如
```
low
```
、
```
high
```
、
```
range
```
、
```
fraction
```
）及其演变方式
通过代数方法推导编码器操作 - 针对每个解码器读取操作，推导编码器必须输出的内容
在纸上验证逆运算关系 - 通过数学方法证明：编码器输出→解码器输入可还原原始数据

Recommended Approach

Verification Strategies

验证策略

Unit Testing Individual Components

组件单元测试

For arithmetic coding or similar algorithms, test each component independently:

Bit encoding/decoding in isolation
Integer encoding/decoding in isolation
Symbol encoding/decoding in isolation
Back-reference or special token encoding in isolation

对于arithmetic coding或类似算法，独立测试每个组件：

独立测试比特编码/解码
独立测试整数编码/解码
独立测试符号编码/解码
独立测试反向引用或特殊令牌编码

Round-Trip Testing

往返测试

original_data → encoder → compressed → decoder → recovered_data
assert original_data == recovered_data

Run round-trip tests at each complexity level before proceeding.

original_data → encoder → compressed → decoder → recovered_data
assert original_data == recovered_data

在推进到更高复杂度之前，在每个复杂度级别都运行往返测试。

State Trace Comparison

状态轨迹对比

Build a debugging mode that outputs encoder state at each step. Feed the compressed output to the decoder with similar tracing. Compare traces to find divergence.

构建调试模式，输出编码器每一步的状态。将压缩后的输入提供给开启了类似跟踪功能的解码器，对比轨迹以找出分歧点。

Common Pitfalls

常见陷阱

1. Renormalization Formula Errors

1. 重归一化公式错误

In arithmetic coding, the renormalization step is critical. The formula for outputting bytes during renormalization must exactly match how the decoder reconstructs the fraction from bytes.

Prevention: Trace through specific numeric examples by hand. If the decoder reads bytes as

fraction += read_byte() - 1

, derive exactly what the encoder must output.

在arithmetic coding中，重归一化步骤至关重要。重归一化过程中输出字节的公式必须与解码器从字节重构fraction的方式完全匹配。

预防措施：手动跟踪具体数值示例。如果解码器以

fraction += read_byte() - 1

的方式读取字节，精确推导编码器必须输出的内容。

2. Off-by-One Errors

2. 差一错误

Common in:

Range calculations
Byte output values (e.g.,
```
+1
```
,
```
-1
```
,
```
% 256
```
adjustments)
Loop bounds for flush/finalization

Prevention: Use concrete numeric examples with known expected outputs.

常见于：

范围计算
字节输出值（如
```
+1
```
、
```
-1
```
、
```
% 256
```
调整）
刷新/收尾的循环边界

预防措施：使用已知预期输出的具体数值示例进行测试。

3. Flushing/Finalization Errors

3. 刷新/收尾错误

The final bytes to flush the encoder state are often implemented incorrectly.

Prevention: Test the flush procedure separately with known encoder states.

用于刷新编码器状态的最终字节通常会被错误实现。

预防措施：使用已知的编码器状态单独测试刷新流程。

4. Premature Optimization

4. 过早优化

Worrying about output size before achieving correctness.

Prevention: First make it work, then make it small. A working 3KB output is infinitely better than a broken 2KB output.

在确保正确性之前就担心输出大小。

预防措施：先确保功能正常，再优化大小。一个可正常工作的3KB输出远比一个损坏的2KB输出有价值。

5. Trial-and-Error Implementation

5. 试错式实现

Making random changes to formulas hoping something works.

Prevention: Every change should be justified by mathematical reasoning about why the previous version was wrong and why the new version is correct.

随意修改公式，寄希望于某个修改能生效。

预防措施：每一次修改都必须有数学依据，说明之前的版本为何错误，新版本为何正确。

6. Parallel Implementation Attempts

6. 并行尝试多个实现

Creating multiple encoder files (

encoder.py

encoder2.py

encoder_v3.py

) spreads effort thin.

Prevention: Work on one implementation. Use version control to track changes. Debug deeply rather than rewriting from scratch.

创建多个编码器文件（如

encoder.py

、

encoder2.py

、

encoder_v3.py

）会分散精力。

预防措施：专注于一个实现。使用版本控制跟踪变更。深入调试，而非从头重写。

Debugging Strategy

调试策略

When the decoder crashes or produces wrong output:

Identify the first failure point - Where exactly does decoding first go wrong?
Compare states at that point - What did the encoder think the state was vs. what the decoder computed?
Trace backward - Find the operation that caused the divergence
Fix with mathematical justification - Don't just try random changes

当解码器崩溃或输出错误结果时：

确定首次失败点 - 解码过程具体在哪个位置首次出错？
对比该点的状态 - 编码器认为的状态与解码器计算出的状态有何不同？
反向跟踪 - 找出导致分歧的操作
基于数学依据修复 - 不要随意尝试修改

For Segmentation Faults in Decoder

针对解码器中的段错误

A segfault typically means:

Invalid memory access from corrupted indices
The compressed stream is structurally invalid
The encoder produced bytes the decoder interprets as impossible values

Debug by:

Adding bounds checking to the decoder (temporarily)
Printing decoder state before the crash
Identifying what impossible state was reached
Tracing back to what encoder output caused this

段故障通常意味着：

损坏的索引导致无效内存访问
压缩流结构无效
编码器生成了解码器视为不可能值的字节

调试方法：

临时为解码器添加边界检查
在崩溃前打印解码器状态
确定到达了哪个不可能的状态
反向跟踪导致该状态的编码器输出