reshard-c4-data
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseReshard C4 Data
C4数据Reshard
Overview
概述
This skill provides guidance for data resharding tasks where files must be reorganized across a directory structure while respecting constraints such as maximum file size limits and maximum items per directory. These tasks require careful consideration of how constraints apply recursively to all levels of the output structure.
本技能为数据resharding任务提供指导,此类任务需要在目录结构间重组文件,同时遵守最大文件大小限制和每个目录最大项目数等约束条件。执行这些任务时,需仔细考虑约束条件如何递归应用于输出结构的所有层级。
Critical Concept: Recursive Constraint Application
关键概念:约束条件的递归应用
When a constraint states "maximum N files or folders in each directory," this applies to ALL directories in the output structure, including:
- The root output directory
- Any intermediate grouping directories
- The final shard/leaf directories
Common Mistake: Interpreting directory constraints as applying only to leaf directories while ignoring the root and intermediate levels.
当约束条件规定“每个目录最多N个文件或文件夹”时,这适用于输出结构中的所有目录,包括:
- 根输出目录
- 任何中间分组目录
- 最终的shard/叶子目录
常见误区: 将目录约束条件仅解读为适用于叶子目录,而忽略根目录和中间层级。
Mathematical Validation
数学验证
Before implementing, perform constraint arithmetic:
- Calculate the total number of output items (files, shards, or groups)
- If items exceed the per-directory limit, hierarchical nesting is required
- Recursively apply this calculation at each level
Example: For 9,898 files with a 30-item-per-directory limit:
- Files per shard: 30 → ~330 shards needed
- Shards per group: 30 → ~11 groups needed
- Groups in root: 11 → Compliant (under 30)
This yields a three-level hierarchy:
output/group_XXXX/shard_XXXX/files在实施前,需执行约束条件的算术验证:
- 计算输出项目的总数(文件、shard或分组)
- 如果项目数超过每个目录的限制,则需要分层嵌套
- 在每个层级递归应用此计算
示例: 对于9898个文件,每个目录最多30个项目的限制:
- 每个shard的文件数:30 → 约需要330个shard
- 每个分组的shard数:30 → 约需要11个分组
- 根目录中的分组数:11 → 符合要求(少于30)
最终形成三级层级结构:
output/group_XXXX/shard_XXXX/filesApproach for Resharding Tasks
Resharding任务的实施步骤
Step 1: Analyze Input Data
步骤1:分析输入数据
- Inventory all input files (count, sizes, directory structure)
- Identify files exceeding size limits that require splitting
- Calculate total storage requirements
- 盘点所有输入文件(数量、大小、目录结构)
- 识别超出大小限制、需要拆分的文件
- 计算总存储需求
Step 2: Validate Constraint Mathematics
步骤2:验证约束条件的数学合理性
For each constraint, verify compliance at all directory levels:
total_files = count(input_files) + count(split_chunks)
shards_needed = ceil(total_files / max_files_per_shard)
if shards_needed > max_items_per_directory:
groups_needed = ceil(shards_needed / max_items_per_directory)
# Continue nesting until root directory complies针对每个约束条件,验证所有目录层级的合规性:
total_files = count(input_files) + count(split_chunks)
shards_needed = ceil(total_files / max_files_per_shard)
if shards_needed > max_items_per_directory:
groups_needed = ceil(shards_needed / max_items_per_directory)
# 继续嵌套直到根目录符合要求Step 3: Design Hierarchical Output Structure
步骤3:设计分层输出结构
Create a directory hierarchy that satisfies constraints at every level:
output/
.metadata.json # Reconstruction metadata
group_0000/
shard_0000/
file_001.txt
file_002.txt
... (up to max_files_per_shard)
shard_0001/
... (up to max_items_per_directory shards)
group_0001/
... (up to max_items_per_directory groups)创建一个在每个层级都满足约束条件的目录层级结构:
output/
.metadata.json # 重建元数据
group_0000/
shard_0000/
file_001.txt
file_002.txt
... (up to max_files_per_shard)
shard_0001/
... (up to max_items_per_directory shards)
group_0001/
... (up to max_items_per_directory groups)Step 4: Implement File Distribution
步骤4:实施文件分发
- Split oversized files into chunks that comply with size limits
- Distribute files/chunks across shards evenly
- Track original file mappings in metadata for reconstruction
- Use checksums to verify data integrity
- 将超大文件拆分为符合大小限制的块
- 将文件/块均匀分发到各个shard中
- 在元数据中跟踪原始文件映射,以便后续重建
- 使用校验和验证数据完整性
Step 5: Generate Reconstruction Metadata
步骤5:生成重建元数据
Include metadata that enables reversing the resharding:
- Original file paths and their shard locations
- Split file chunk mappings
- Checksums for integrity verification
包含可撤销resharding操作的元数据:
- 原始文件路径及其所在的shard位置
- 拆分文件的块映射
- 用于完整性验证的校验和
Verification Strategies
验证策略
Constraint Verification Checklist
约束条件验证清单
Execute these checks before declaring success:
- Root directory item count: must be ≤ limit
ls <output_dir> | wc -l - All intermediate directories: Recursively verify each directory level
- All leaf directories: Verify shard contents
- File size limits: Check no single file exceeds the maximum
- Data integrity: Verify checksums match originals
在宣布任务成功前,执行以下检查:
- 根目录项目数: 的结果必须≤限制值
ls <output_dir> | wc -l - 所有中间目录: 递归验证每个目录层级
- 所有叶子目录: 验证shard内容
- 文件大小限制: 检查是否有单个文件超出最大值
- 数据完整性: 验证校验和与原始文件匹配
Automated Verification Script Pattern
自动化验证脚本模板
python
def verify_directory_constraints(path, max_items):
"""Recursively verify all directories comply with item limit."""
for root, dirs, files in os.walk(path):
item_count = len(dirs) + len(files)
if item_count > max_items:
return False, f"{root} has {item_count} items (max: {max_items})"
return True, "All directories compliant"python
def verify_directory_constraints(path, max_items):
"""Recursively verify all directories comply with item limit."""
for root, dirs, files in os.walk(path):
item_count = len(dirs) + len(files)
if item_count > max_items:
return False, f"{root} has {item_count} items (max: {max_items})"
return True, "All directories compliant"Common Verification Failures
常见验证失败情况
| Symptom | Likely Cause | Solution |
|---|---|---|
| Root directory exceeds limit | Flat shard structure | Add grouping hierarchy |
| Checksums don't match | File corruption during split | Re-implement split logic with verification |
| Missing files in reconstruction | Incomplete metadata | Audit metadata generation |
| 症状 | 可能原因 | 解决方案 |
|---|---|---|
| 根目录超出限制 | 扁平shard结构 | 添加分组层级 |
| 校验和不匹配 | 拆分过程中文件损坏 | 重新实现带有验证机制的拆分逻辑 |
| 重建时缺少文件 | 元数据不完整 | 审核元数据生成流程 |
Common Pitfalls
常见陷阱
Pitfall 1: Partial Constraint Interpretation
陷阱1:约束条件的部分解读
Mistake: Applying "max N items per directory" only to leaf directories.
Prevention: Explicitly verify every directory level in the output structure.
错误: 仅将“每个目录最多N个项目”应用于叶子目录。
预防措施: 明确验证输出结构中的每个目录层级。
Pitfall 2: Ignoring Metadata in Item Counts
陷阱2:忽略元数据在项目计数中的影响
Mistake: Forgetting that metadata files (e.g., ) count toward directory limits.
.metadata.jsonPrevention: Include metadata files in constraint calculations.
错误: 忘记元数据文件(如 )会占用目录限制的名额。
.metadata.json预防措施: 在约束条件计算中包含元数据文件。
Pitfall 3: False Confidence from Partial Testing
陷阱3:局部测试导致的错误信心
Mistake: Concluding success after verifying data integrity but not structural constraints.
Prevention: Create separate verification steps for each constraint type.
错误: 仅验证数据完整性后就宣布任务成功,未验证结构约束条件。
预防措施: 为每种约束条件创建独立的验证步骤。
Pitfall 4: Underestimating Scale
陷阱4:低估规模影响
Mistake: Testing with small datasets that don't trigger hierarchical nesting requirements.
Prevention: Calculate expected output structure mathematically before implementation.
错误: 使用不会触发分层嵌套需求的小型数据集进行测试。
预防措施: 在实施前通过数学计算确定预期的输出结构。
Testing Recommendations
测试建议
- Unit test constraint logic with edge cases (exactly at limits, one over, etc.)
- Test with representative scale that triggers all hierarchy levels
- Verify round-trip integrity by reconstructing and comparing checksums
- Check all constraint types independently before integration testing
- 对约束条件逻辑进行单元测试,覆盖边缘情况(刚好达到限制、超出一个等)
- 使用具有代表性规模的数据集测试,触发所有层级的嵌套
- 验证往返完整性:通过重建并对比校验和来验证
- 在集成测试前独立检查所有类型的约束条件