large-scale-text-editing

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Large-Scale Text Editing

大规模文本编辑

Overview

概述

This skill provides guidance for efficiently transforming large text files containing thousands to millions of lines. It covers strategies for understanding transformation requirements, designing efficient solutions (particularly with Vim macros), testing approaches, and verification techniques.
本技能为高效处理包含数千至数百万行的大型文本文件提供指导,涵盖了理解转换需求、设计高效解决方案(尤其是使用Vim宏)、测试方法以及验证技巧等内容。

When to Use This Skill

适用场景

  • Transforming CSV, TSV, or other delimited files at scale
  • Applying repetitive edits across files with millions of rows
  • Working within keystroke or operation count constraints
  • Using Vim macros, sed, awk, or similar batch processing tools
  • Pattern-based text transformations requiring regex
  • 大规模转换CSV、TSV或其他分隔符格式文件
  • 对包含数百万行的文件执行重复编辑操作
  • 在按键或操作次数受限的场景下工作
  • 使用Vim宏、sed、awk或类似的批处理工具
  • 需要使用正则表达式的基于模式的文本转换

Approach Strategy

实施策略

Phase 1: Understand the Transformation

阶段1:明确转换需求

Before writing any transformation logic:
  1. Assess file size first - Check file size with
    ls -lh
    or
    wc -l
    before attempting to read. Avoid reading multi-million line files directly.
  2. Sample strategically - Extract samples from multiple locations:
    • Beginning:
      head -n 100 input.csv > sample_head.csv
    • Middle:
      sed -n '500000,500100p' input.csv > sample_middle.csv
    • End:
      tail -n 100 input.csv > sample_tail.csv
  3. Compare input and expected output - Identify all transformations needed:
    • Column reordering or removal
    • Delimiter changes
    • Case transformations
    • Whitespace handling
    • Value appending or prepending
    • Format conversions
  4. Verify structural assumptions:
    • Consistent column count across all rows
    • Presence of header rows
    • Empty lines or malformed rows
    • Special characters that might break regex patterns
在编写任何转换逻辑之前:
  1. 先评估文件大小 - 在尝试读取文件前,使用
    ls -lh
    wc -l
    查看文件大小。避免直接读取数百万行的文件。
  2. 针对性采样 - 从文件的不同位置提取样本:
    • 文件开头:
      head -n 100 input.csv > sample_head.csv
    • 文件中间:
      sed -n '500000,500100p' input.csv > sample_middle.csv
    • 文件末尾:
      tail -n 100 input.csv > sample_tail.csv
  3. 对比输入与预期输出 - 明确所有需要的转换操作:
    • 列的重新排序或删除
    • 分隔符修改
    • 大小写转换
    • 空白字符处理
    • 值的追加或前置
    • 格式转换
  4. 验证结构假设
    • 所有行的列数保持一致
    • 是否存在表头行
    • 是否有空行或格式错误的行
    • 是否存在可能破坏正则表达式的特殊字符

Phase 2: Design the Solution

阶段2:设计解决方案

When designing transformations:
  1. Break complex transformations into discrete steps - Each step should handle one logical transformation. This improves debuggability and allows independent testing.
  2. Choose the right tool for the scale:
    • Vim macros: Excellent for complex, multi-step transformations; efficient keystroke counting
    • sed: Fast for simple substitutions across large files
    • awk: Powerful for column manipulation and conditional logic
    • Perl/Python: For complex logic that exceeds regex capabilities
  3. Design for efficiency:
    • Minimize the number of passes through the file
    • Use line-based operations (
      :%normal!
      in Vim) rather than iterating with explicit loops
    • Leverage built-in commands (e.g.,
      gU
      for uppercase in Vim) over manual character manipulation
  4. Document design decisions - Record why specific approaches were chosen, especially when multiple valid alternatives exist.
在设计转换方案时:
  1. 将复杂转换拆分为独立步骤 - 每个步骤仅处理一项逻辑转换。这有助于提升可调试性,并支持独立测试。
  2. 根据规模选择合适的工具
    • Vim宏:非常适合复杂的多步骤转换,按键效率高
    • sed:在大型文件中执行简单替换速度极快
    • awk:在列处理和条件逻辑方面功能强大
    • Perl/Python:适用于超出正则表达式能力范围的复杂逻辑
  3. 以效率为核心设计
    • 减少文件的读取次数
    • 使用基于行的操作(如Vim中的
      :%normal!
      ),而非显式循环遍历
    • 优先使用内置命令(如Vim中用于转换为大写的
      gU
      ),而非手动逐字符操作
  4. 记录设计决策 - 记录选择特定方案的原因,尤其是存在多种有效替代方案时。

Phase 3: Test Incrementally

阶段3:增量测试

  1. Create a test sample - Use a small subset (100-1000 lines) for initial testing:
    bash
    head -n 100 input.csv > test_input.csv
    head -n 100 expected.csv > test_expected.csv
  2. Test each transformation independently - Verify each macro or command produces correct output before combining.
  3. Verify with diff - Use byte-for-byte comparison:
    bash
    diff test_output.csv test_expected.csv
  4. Check for edge cases in test output:
    • First and last lines transformed correctly
    • Lines with varying content lengths handled
    • Special characters preserved or transformed as expected
  1. 创建测试样本 - 使用小样本(100-1000行)进行初始测试:
    bash
    head -n 100 input.csv > test_input.csv
    head -n 100 expected.csv > test_expected.csv
  2. 独立测试每个转换步骤 - 在组合步骤前,验证每个宏或命令的输出是否正确。
  3. 使用diff工具验证 - 进行逐字节对比:
    bash
    diff test_output.csv test_expected.csv
  4. 检查测试输出中的边缘情况
    • 首行和末行转换正确
    • 不同内容长度的行处理正常
    • 特殊字符按预期保留或转换

Phase 4: Execute with Safeguards

阶段4:带防护措施执行

  1. Create backups before in-place modifications:
    bash
    cp input.csv input.csv.backup
  2. Set appropriate timeouts - For million-row files, allow sufficient processing time (e.g., 2-5 minutes depending on complexity).
  3. Monitor progress when possible - Use tools that show progress or check intermediate output.
  4. Verify final output:
    • Confirm row count matches:
      wc -l output.csv
    • Run diff against expected output
    • Spot-check samples from different file locations
  1. 在原地修改前创建备份
    bash
    cp input.csv input.csv.backup
  2. 设置合理的超时时间 - 对于百万行级别的文件,预留足够的处理时间(例如根据复杂度预留2-5分钟)。
  3. 尽可能监控进度 - 使用可显示进度的工具,或检查中间输出。
  4. 验证最终输出
    • 确认行数匹配:
      wc -l output.csv
    • 与预期输出进行diff对比
    • 抽查文件不同位置的样本

Vim-Specific Guidance

Vim专属指南

Macro Design Principles

宏设计原则

  • Register allocation: Use distinct registers (a, b, c) for different transformation stages
  • Keystroke efficiency: Prefer built-in commands over character-by-character operations
  • Regex patterns: Use non-greedy patterns and explicit delimiters to avoid over-matching
  • 寄存器分配:为不同的转换阶段使用不同的寄存器(a、b、c等)
  • 按键效率:优先使用内置命令,而非逐字符操作
  • 正则表达式模式:使用非贪婪模式和明确的分隔符,避免过度匹配

Common Vim Patterns for Large Files

适用于大型文件的常见Vim模式

TaskApproach
Apply macro to all lines
:%normal! @a
Uppercase transformation
gU
motion or
\U
in substitution
Column manipulationCapture groups with
\(\)
and backreferences
\1
,
\2
Delimiter replacement
:s/old_delim/new_delim/g
Whitespace removal
:s/\s\+//g
任务实现方法
为所有行应用宏
:%normal! @a
转换为大写使用
gU
动作或替换中的
\U
列处理使用
\(\)
捕获分组,并通过
\1
\2
进行反向引用
替换分隔符
:s/old_delim/new_delim/g
移除空白字符
:s/\s\+//g

Escaping in Vim Scripts

Vim脚本中的转义规则

When using
setreg()
for macro definitions:
  • Escape backslashes:
    \\
    for literal backslash
  • Use
    \r
    for carriage return
  • Special characters may need double-escaping
在使用
setreg()
定义宏时:
  • 反斜杠需要转义:使用
    \\
    表示字面意义上的反斜杠
  • 使用
    \r
    表示回车符
  • 特殊字符可能需要双重转义

Verification Checklist

验证 Checklist

Before considering the task complete:
  • Output file exists and is non-empty
  • Row count matches expected count
  • Byte-for-byte diff passes against expected output (if available)
  • Spot-check samples from beginning, middle, and end of file
  • Any constraints (keystroke limits, command restrictions) are satisfied
  • Tool exited with success code (exit code 0)
在确认任务完成前,请检查以下项:
  • 输出文件存在且非空
  • 行数与预期一致
  • 与预期输出的逐字节对比通过(若预期输出存在)
  • 抽查文件开头、中间和末尾的样本
  • 满足所有约束条件(如按键次数限制、命令限制)
  • 工具执行成功(退出码为0)

Common Pitfalls

常见陷阱

PitfallPrevention
Reading large files directlyAlways check file size first; use head/tail/sed for sampling
No backup before in-place editCreate backup copy before any modification
Testing only on first few linesSample from multiple file locations
Assuming uniform structureVerify structure with samples from different positions
Regex over-matchingUse explicit delimiters and non-greedy quantifiers
Insufficient timeoutCalculate expected processing time for file size
Not verifying exit codesCheck tool exit status after operations
陷阱预防措施
直接读取大型文件始终先检查文件大小;使用head/tail/sed工具进行采样
原地编辑前未创建备份在进行任何修改前创建备份文件
仅测试前几行从文件的多个位置采样测试
假设文件结构统一使用不同位置的样本验证文件结构
正则表达式过度匹配使用明确的分隔符和非贪婪量词
超时时间不足根据文件大小计算预期处理时间
未验证退出码操作完成后检查工具的退出状态

Efficiency Considerations

效率优化要点

When keystroke or operation counts matter:
  1. Count accurately - Understand what constitutes a "keystroke" in the specific context (escape sequences, special keys)
  2. Combine operations - A single regex substitution may replace multiple simpler operations
  3. Use built-in commands - Native commands are typically more efficient than manual equivalents
  4. Minimize redundancy - Avoid repeated file reads or redundant transformations
当按键或操作次数受限需关注时:
  1. 精准计数 - 明确特定场景下"按键"的定义(如转义序列、特殊按键)
  2. 合并操作 - 单个正则表达式替换可替代多个简单操作
  3. 使用内置命令 - 原生命令通常比手动操作更高效
  4. 减少冗余 - 避免重复读取文件或执行冗余转换