large-scale-text-editing

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Large-Scale Text Editing

大规模文本编辑

Overview

概述

This skill provides guidance for efficiently transforming large text files containing thousands to millions of lines. It covers strategies for understanding transformation requirements, designing efficient solutions (particularly with Vim macros), testing approaches, and verification techniques.

本技能为高效处理包含数千至数百万行的大型文本文件提供指导，涵盖了理解转换需求、设计高效解决方案（尤其是使用Vim宏）、测试方法以及验证技巧等内容。

When to Use This Skill

适用场景

Transforming CSV, TSV, or other delimited files at scale
Applying repetitive edits across files with millions of rows
Working within keystroke or operation count constraints
Using Vim macros, sed, awk, or similar batch processing tools
Pattern-based text transformations requiring regex

大规模转换CSV、TSV或其他分隔符格式文件
对包含数百万行的文件执行重复编辑操作
在按键或操作次数受限的场景下工作
使用Vim宏、sed、awk或类似的批处理工具
需要使用正则表达式的基于模式的文本转换

Approach Strategy

实施策略

Phase 1: Understand the Transformation

阶段1：明确转换需求

Before writing any transformation logic:

Assess file size first - Check file size with
```
ls -lh
```
or
```
wc -l
```
before attempting to read. Avoid reading multi-million line files directly.

Sample strategically - Extract samples from multiple locations:

Beginning:
```
head -n 100 input.csv > sample_head.csv
```

Middle:

sed -n '500000,500100p' input.csv > sample_middle.csv

End:
```
tail -n 100 input.csv > sample_tail.csv
```

Compare input and expected output - Identify all transformations needed:
- Column reordering or removal
- Delimiter changes
- Case transformations
- Whitespace handling
- Value appending or prepending
- Format conversions
Verify structural assumptions:
- Consistent column count across all rows
- Presence of header rows
- Empty lines or malformed rows
- Special characters that might break regex patterns

在编写任何转换逻辑之前：

先评估文件大小 - 在尝试读取文件前，使用
```
ls -lh
```
或
```
wc -l
```
查看文件大小。避免直接读取数百万行的文件。

针对性采样 - 从文件的不同位置提取样本：

文件开头：
```
head -n 100 input.csv > sample_head.csv
```

文件中间：

sed -n '500000,500100p' input.csv > sample_middle.csv

文件末尾：
```
tail -n 100 input.csv > sample_tail.csv
```

对比输入与预期输出 - 明确所有需要的转换操作：
- 列的重新排序或删除
- 分隔符修改
- 大小写转换
- 空白字符处理
- 值的追加或前置
- 格式转换
验证结构假设：
- 所有行的列数保持一致
- 是否存在表头行
- 是否有空行或格式错误的行
- 是否存在可能破坏正则表达式的特殊字符

Phase 2: Design the Solution

阶段2：设计解决方案

When designing transformations:

Break complex transformations into discrete steps - Each step should handle one logical transformation. This improves debuggability and allows independent testing.
Choose the right tool for the scale:
- Vim macros: Excellent for complex, multi-step transformations; efficient keystroke counting
- sed: Fast for simple substitutions across large files
- awk: Powerful for column manipulation and conditional logic
- Perl/Python: For complex logic that exceeds regex capabilities
Design for efficiency:
- Minimize the number of passes through the file
- Use line-based operations (
```
:%normal!
```
  in Vim) rather than iterating with explicit loops
- Leverage built-in commands (e.g.,
```
gU
```
  for uppercase in Vim) over manual character manipulation
Document design decisions - Record why specific approaches were chosen, especially when multiple valid alternatives exist.

在设计转换方案时：

将复杂转换拆分为独立步骤 - 每个步骤仅处理一项逻辑转换。这有助于提升可调试性，并支持独立测试。
根据规模选择合适的工具：
- Vim宏：非常适合复杂的多步骤转换，按键效率高
- sed：在大型文件中执行简单替换速度极快
- awk：在列处理和条件逻辑方面功能强大
- Perl/Python：适用于超出正则表达式能力范围的复杂逻辑
以效率为核心设计：
- 减少文件的读取次数
- 使用基于行的操作（如Vim中的
```
:%normal!
```
  ），而非显式循环遍历
- 优先使用内置命令（如Vim中用于转换为大写的
```
gU
```
  ），而非手动逐字符操作
记录设计决策 - 记录选择特定方案的原因，尤其是存在多种有效替代方案时。

Phase 3: Test Incrementally

阶段3：增量测试

Create a test sample - Use a small subset (100-1000 lines) for initial testing:

bash

head -n 100 input.csv > test_input.csv
head -n 100 expected.csv > test_expected.csv

Test each transformation independently - Verify each macro or command produces correct output before combining.
Verify with diff - Use byte-for-byte comparison:
bash
```
diff test_output.csv test_expected.csv
```
Check for edge cases in test output:
- First and last lines transformed correctly
- Lines with varying content lengths handled
- Special characters preserved or transformed as expected

创建测试样本 - 使用小样本（100-1000行）进行初始测试：

bash

head -n 100 input.csv > test_input.csv
head -n 100 expected.csv > test_expected.csv

独立测试每个转换步骤 - 在组合步骤前，验证每个宏或命令的输出是否正确。
使用diff工具验证 - 进行逐字节对比：
bash
```
diff test_output.csv test_expected.csv
```
检查测试输出中的边缘情况：
- 首行和末行转换正确
- 不同内容长度的行处理正常
- 特殊字符按预期保留或转换

Phase 4: Execute with Safeguards

阶段4：带防护措施执行

Create backups before in-place modifications:
bash
```
cp input.csv input.csv.backup
```
Set appropriate timeouts - For million-row files, allow sufficient processing time (e.g., 2-5 minutes depending on complexity).
Monitor progress when possible - Use tools that show progress or check intermediate output.
Verify final output:
- Confirm row count matches:
```
wc -l output.csv
```
- Run diff against expected output
- Spot-check samples from different file locations

在原地修改前创建备份：
bash
```
cp input.csv input.csv.backup
```
设置合理的超时时间 - 对于百万行级别的文件，预留足够的处理时间（例如根据复杂度预留2-5分钟）。
尽可能监控进度 - 使用可显示进度的工具，或检查中间输出。
验证最终输出：
- 确认行数匹配：
```
wc -l output.csv
```
- 与预期输出进行diff对比
- 抽查文件不同位置的样本

Vim-Specific Guidance

Vim专属指南

Macro Design Principles

宏设计原则

Register allocation: Use distinct registers (a, b, c) for different transformation stages
Keystroke efficiency: Prefer built-in commands over character-by-character operations
Regex patterns: Use non-greedy patterns and explicit delimiters to avoid over-matching

寄存器分配：为不同的转换阶段使用不同的寄存器（a、b、c等）
按键效率：优先使用内置命令，而非逐字符操作
正则表达式模式：使用非贪婪模式和明确的分隔符，避免过度匹配

Common Vim Patterns for Large Files

适用于大型文件的常见Vim模式

Task	Approach
Apply macro to all lines	`:%normal! @a`
Uppercase transformation	`gU` motion or `\U` in substitution
Column manipulation	Capture groups with `\(\)` and backreferences `\1` , `\2`
Delimiter replacement	`:s/old_delim/new_delim/g`
Whitespace removal	`:s/\s\+//g`

任务	实现方法
为所有行应用宏	`:%normal! @a`
转换为大写	使用 `gU` 动作或替换中的 `\U`
列处理	使用 `\(\)` 捕获分组，并通过 `\1` 、 `\2` 进行反向引用
替换分隔符	`:s/old_delim/new_delim/g`
移除空白字符	`:s/\s\+//g`

Escaping in Vim Scripts

Vim脚本中的转义规则

When using

setreg()

for macro definitions:

Escape backslashes:
```
\\
```
for literal backslash
Use
```
\r
```
for carriage return
Special characters may need double-escaping

在使用

setreg()

定义宏时：

反斜杠需要转义：使用
```
\\
```
表示字面意义上的反斜杠
使用
```
\r
```
表示回车符
特殊字符可能需要双重转义

Verification Checklist

验证 Checklist

Common Pitfalls

常见陷阱

Pitfall	Prevention
Reading large files directly	Always check file size first; use head/tail/sed for sampling
No backup before in-place edit	Create backup copy before any modification
Testing only on first few lines	Sample from multiple file locations
Assuming uniform structure	Verify structure with samples from different positions
Regex over-matching	Use explicit delimiters and non-greedy quantifiers
Insufficient timeout	Calculate expected processing time for file size
Not verifying exit codes	Check tool exit status after operations

陷阱	预防措施
直接读取大型文件	始终先检查文件大小；使用head/tail/sed工具进行采样
原地编辑前未创建备份	在进行任何修改前创建备份文件
仅测试前几行	从文件的多个位置采样测试
假设文件结构统一	使用不同位置的样本验证文件结构
正则表达式过度匹配	使用明确的分隔符和非贪婪量词
超时时间不足	根据文件大小计算预期处理时间
未验证退出码	操作完成后检查工具的退出状态

Efficiency Considerations

效率优化要点

When keystroke or operation counts matter:

Count accurately - Understand what constitutes a "keystroke" in the specific context (escape sequences, special keys)
Combine operations - A single regex substitution may replace multiple simpler operations
Use built-in commands - Native commands are typically more efficient than manual equivalents
Minimize redundancy - Avoid repeated file reads or redundant transformations

当按键或操作次数受限需关注时：

精准计数 - 明确特定场景下"按键"的定义（如转义序列、特殊按键）
合并操作 - 单个正则表达式替换可替代多个简单操作
使用内置命令 - 原生命令通常比手动操作更高效
减少冗余 - 避免重复读取文件或执行冗余转换