large-scale-text-editing
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLarge-Scale Text Editing
大规模文本编辑
Overview
概述
This skill provides guidance for efficiently transforming large text files containing thousands to millions of lines. It covers strategies for understanding transformation requirements, designing efficient solutions (particularly with Vim macros), testing approaches, and verification techniques.
本技能为高效处理包含数千至数百万行的大型文本文件提供指导,涵盖了理解转换需求、设计高效解决方案(尤其是使用Vim宏)、测试方法以及验证技巧等内容。
When to Use This Skill
适用场景
- Transforming CSV, TSV, or other delimited files at scale
- Applying repetitive edits across files with millions of rows
- Working within keystroke or operation count constraints
- Using Vim macros, sed, awk, or similar batch processing tools
- Pattern-based text transformations requiring regex
- 大规模转换CSV、TSV或其他分隔符格式文件
- 对包含数百万行的文件执行重复编辑操作
- 在按键或操作次数受限的场景下工作
- 使用Vim宏、sed、awk或类似的批处理工具
- 需要使用正则表达式的基于模式的文本转换
Approach Strategy
实施策略
Phase 1: Understand the Transformation
阶段1:明确转换需求
Before writing any transformation logic:
-
Assess file size first - Check file size withor
ls -lhbefore attempting to read. Avoid reading multi-million line files directly.wc -l -
Sample strategically - Extract samples from multiple locations:
- Beginning:
head -n 100 input.csv > sample_head.csv - Middle:
sed -n '500000,500100p' input.csv > sample_middle.csv - End:
tail -n 100 input.csv > sample_tail.csv
- Beginning:
-
Compare input and expected output - Identify all transformations needed:
- Column reordering or removal
- Delimiter changes
- Case transformations
- Whitespace handling
- Value appending or prepending
- Format conversions
-
Verify structural assumptions:
- Consistent column count across all rows
- Presence of header rows
- Empty lines or malformed rows
- Special characters that might break regex patterns
在编写任何转换逻辑之前:
-
先评估文件大小 - 在尝试读取文件前,使用或
ls -lh查看文件大小。避免直接读取数百万行的文件。wc -l -
针对性采样 - 从文件的不同位置提取样本:
- 文件开头:
head -n 100 input.csv > sample_head.csv - 文件中间:
sed -n '500000,500100p' input.csv > sample_middle.csv - 文件末尾:
tail -n 100 input.csv > sample_tail.csv
- 文件开头:
-
对比输入与预期输出 - 明确所有需要的转换操作:
- 列的重新排序或删除
- 分隔符修改
- 大小写转换
- 空白字符处理
- 值的追加或前置
- 格式转换
-
验证结构假设:
- 所有行的列数保持一致
- 是否存在表头行
- 是否有空行或格式错误的行
- 是否存在可能破坏正则表达式的特殊字符
Phase 2: Design the Solution
阶段2:设计解决方案
When designing transformations:
-
Break complex transformations into discrete steps - Each step should handle one logical transformation. This improves debuggability and allows independent testing.
-
Choose the right tool for the scale:
- Vim macros: Excellent for complex, multi-step transformations; efficient keystroke counting
- sed: Fast for simple substitutions across large files
- awk: Powerful for column manipulation and conditional logic
- Perl/Python: For complex logic that exceeds regex capabilities
-
Design for efficiency:
- Minimize the number of passes through the file
- Use line-based operations (in Vim) rather than iterating with explicit loops
:%normal! - Leverage built-in commands (e.g., for uppercase in Vim) over manual character manipulation
gU
-
Document design decisions - Record why specific approaches were chosen, especially when multiple valid alternatives exist.
在设计转换方案时:
-
将复杂转换拆分为独立步骤 - 每个步骤仅处理一项逻辑转换。这有助于提升可调试性,并支持独立测试。
-
根据规模选择合适的工具:
- Vim宏:非常适合复杂的多步骤转换,按键效率高
- sed:在大型文件中执行简单替换速度极快
- awk:在列处理和条件逻辑方面功能强大
- Perl/Python:适用于超出正则表达式能力范围的复杂逻辑
-
以效率为核心设计:
- 减少文件的读取次数
- 使用基于行的操作(如Vim中的),而非显式循环遍历
:%normal! - 优先使用内置命令(如Vim中用于转换为大写的),而非手动逐字符操作
gU
-
记录设计决策 - 记录选择特定方案的原因,尤其是存在多种有效替代方案时。
Phase 3: Test Incrementally
阶段3:增量测试
-
Create a test sample - Use a small subset (100-1000 lines) for initial testing:bash
head -n 100 input.csv > test_input.csv head -n 100 expected.csv > test_expected.csv -
Test each transformation independently - Verify each macro or command produces correct output before combining.
-
Verify with diff - Use byte-for-byte comparison:bash
diff test_output.csv test_expected.csv -
Check for edge cases in test output:
- First and last lines transformed correctly
- Lines with varying content lengths handled
- Special characters preserved or transformed as expected
-
创建测试样本 - 使用小样本(100-1000行)进行初始测试:bash
head -n 100 input.csv > test_input.csv head -n 100 expected.csv > test_expected.csv -
独立测试每个转换步骤 - 在组合步骤前,验证每个宏或命令的输出是否正确。
-
使用diff工具验证 - 进行逐字节对比:bash
diff test_output.csv test_expected.csv -
检查测试输出中的边缘情况:
- 首行和末行转换正确
- 不同内容长度的行处理正常
- 特殊字符按预期保留或转换
Phase 4: Execute with Safeguards
阶段4:带防护措施执行
-
Create backups before in-place modifications:bash
cp input.csv input.csv.backup -
Set appropriate timeouts - For million-row files, allow sufficient processing time (e.g., 2-5 minutes depending on complexity).
-
Monitor progress when possible - Use tools that show progress or check intermediate output.
-
Verify final output:
- Confirm row count matches:
wc -l output.csv - Run diff against expected output
- Spot-check samples from different file locations
- Confirm row count matches:
-
在原地修改前创建备份:bash
cp input.csv input.csv.backup -
设置合理的超时时间 - 对于百万行级别的文件,预留足够的处理时间(例如根据复杂度预留2-5分钟)。
-
尽可能监控进度 - 使用可显示进度的工具,或检查中间输出。
-
验证最终输出:
- 确认行数匹配:
wc -l output.csv - 与预期输出进行diff对比
- 抽查文件不同位置的样本
- 确认行数匹配:
Vim-Specific Guidance
Vim专属指南
Macro Design Principles
宏设计原则
- Register allocation: Use distinct registers (a, b, c) for different transformation stages
- Keystroke efficiency: Prefer built-in commands over character-by-character operations
- Regex patterns: Use non-greedy patterns and explicit delimiters to avoid over-matching
- 寄存器分配:为不同的转换阶段使用不同的寄存器(a、b、c等)
- 按键效率:优先使用内置命令,而非逐字符操作
- 正则表达式模式:使用非贪婪模式和明确的分隔符,避免过度匹配
Common Vim Patterns for Large Files
适用于大型文件的常见Vim模式
| Task | Approach |
|---|---|
| Apply macro to all lines | |
| Uppercase transformation | |
| Column manipulation | Capture groups with |
| Delimiter replacement | |
| Whitespace removal | |
| 任务 | 实现方法 |
|---|---|
| 为所有行应用宏 | |
| 转换为大写 | 使用 |
| 列处理 | 使用 |
| 替换分隔符 | |
| 移除空白字符 | |
Escaping in Vim Scripts
Vim脚本中的转义规则
When using for macro definitions:
setreg()- Escape backslashes: for literal backslash
\\ - Use for carriage return
\r - Special characters may need double-escaping
在使用定义宏时:
setreg()- 反斜杠需要转义:使用表示字面意义上的反斜杠
\\ - 使用表示回车符
\r - 特殊字符可能需要双重转义
Verification Checklist
验证 Checklist
Before considering the task complete:
- Output file exists and is non-empty
- Row count matches expected count
- Byte-for-byte diff passes against expected output (if available)
- Spot-check samples from beginning, middle, and end of file
- Any constraints (keystroke limits, command restrictions) are satisfied
- Tool exited with success code (exit code 0)
在确认任务完成前,请检查以下项:
- 输出文件存在且非空
- 行数与预期一致
- 与预期输出的逐字节对比通过(若预期输出存在)
- 抽查文件开头、中间和末尾的样本
- 满足所有约束条件(如按键次数限制、命令限制)
- 工具执行成功(退出码为0)
Common Pitfalls
常见陷阱
| Pitfall | Prevention |
|---|---|
| Reading large files directly | Always check file size first; use head/tail/sed for sampling |
| No backup before in-place edit | Create backup copy before any modification |
| Testing only on first few lines | Sample from multiple file locations |
| Assuming uniform structure | Verify structure with samples from different positions |
| Regex over-matching | Use explicit delimiters and non-greedy quantifiers |
| Insufficient timeout | Calculate expected processing time for file size |
| Not verifying exit codes | Check tool exit status after operations |
| 陷阱 | 预防措施 |
|---|---|
| 直接读取大型文件 | 始终先检查文件大小;使用head/tail/sed工具进行采样 |
| 原地编辑前未创建备份 | 在进行任何修改前创建备份文件 |
| 仅测试前几行 | 从文件的多个位置采样测试 |
| 假设文件结构统一 | 使用不同位置的样本验证文件结构 |
| 正则表达式过度匹配 | 使用明确的分隔符和非贪婪量词 |
| 超时时间不足 | 根据文件大小计算预期处理时间 |
| 未验证退出码 | 操作完成后检查工具的退出状态 |
Efficiency Considerations
效率优化要点
When keystroke or operation counts matter:
- Count accurately - Understand what constitutes a "keystroke" in the specific context (escape sequences, special keys)
- Combine operations - A single regex substitution may replace multiple simpler operations
- Use built-in commands - Native commands are typically more efficient than manual equivalents
- Minimize redundancy - Avoid repeated file reads or redundant transformations
当按键或操作次数受限需关注时:
- 精准计数 - 明确特定场景下"按键"的定义(如转义序列、特殊按键)
- 合并操作 - 单个正则表达式替换可替代多个简单操作
- 使用内置命令 - 原生命令通常比手动操作更高效
- 减少冗余 - 避免重复读取文件或执行冗余转换