reshard-c4-data

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Reshard C4 Data

C4数据Reshard

Overview

概述

This skill provides guidance for data resharding tasks where files must be reorganized across a directory structure while respecting constraints such as maximum file size limits and maximum items per directory. These tasks require careful consideration of how constraints apply recursively to all levels of the output structure.

本技能为数据resharding任务提供指导，此类任务需要在目录结构间重组文件，同时遵守最大文件大小限制和每个目录最大项目数等约束条件。执行这些任务时，需仔细考虑约束条件如何递归应用于输出结构的所有层级。

Critical Concept: Recursive Constraint Application

关键概念：约束条件的递归应用

When a constraint states "maximum N files or folders in each directory," this applies to ALL directories in the output structure, including:

The root output directory
Any intermediate grouping directories
The final shard/leaf directories

Common Mistake: Interpreting directory constraints as applying only to leaf directories while ignoring the root and intermediate levels.

当约束条件规定“每个目录最多N个文件或文件夹”时，这适用于输出结构中的所有目录，包括：

常见误区： 将目录约束条件仅解读为适用于叶子目录，而忽略根目录和中间层级。

Mathematical Validation

数学验证

Before implementing, perform constraint arithmetic:

Calculate the total number of output items (files, shards, or groups)
If items exceed the per-directory limit, hierarchical nesting is required
Recursively apply this calculation at each level

Example: For 9,898 files with a 30-item-per-directory limit:

Files per shard: 30 → ~330 shards needed
Shards per group: 30 → ~11 groups needed
Groups in root: 11 → Compliant (under 30)

This yields a three-level hierarchy:

output/group_XXXX/shard_XXXX/files

在实施前，需执行约束条件的算术验证：

计算输出项目的总数（文件、shard或分组）
如果项目数超过每个目录的限制，则需要分层嵌套
在每个层级递归应用此计算

示例： 对于9898个文件，每个目录最多30个项目的限制：

每个shard的文件数：30 → 约需要330个shard
每个分组的shard数：30 → 约需要11个分组
根目录中的分组数：11 → 符合要求（少于30）

最终形成三级层级结构：

output/group_XXXX/shard_XXXX/files

Approach for Resharding Tasks

Resharding任务的实施步骤

Step 1: Analyze Input Data

步骤1：分析输入数据

Inventory all input files (count, sizes, directory structure)
Identify files exceeding size limits that require splitting
Calculate total storage requirements

盘点所有输入文件（数量、大小、目录结构）
识别超出大小限制、需要拆分的文件
计算总存储需求

Step 2: Validate Constraint Mathematics

步骤2：验证约束条件的数学合理性

For each constraint, verify compliance at all directory levels:

total_files = count(input_files) + count(split_chunks)
shards_needed = ceil(total_files / max_files_per_shard)

if shards_needed > max_items_per_directory:
    groups_needed = ceil(shards_needed / max_items_per_directory)
    # Continue nesting until root directory complies

针对每个约束条件，验证所有目录层级的合规性：

total_files = count(input_files) + count(split_chunks)
shards_needed = ceil(total_files / max_files_per_shard)

if shards_needed > max_items_per_directory:
    groups_needed = ceil(shards_needed / max_items_per_directory)
    # 继续嵌套直到根目录符合要求

Step 3: Design Hierarchical Output Structure

步骤3：设计分层输出结构

Create a directory hierarchy that satisfies constraints at every level:

output/
  .metadata.json          # Reconstruction metadata
  group_0000/
    shard_0000/
      file_001.txt
      file_002.txt
      ... (up to max_files_per_shard)
    shard_0001/
    ... (up to max_items_per_directory shards)
  group_0001/
  ... (up to max_items_per_directory groups)

创建一个在每个层级都满足约束条件的目录层级结构：

output/
  .metadata.json          # 重建元数据
  group_0000/
    shard_0000/
      file_001.txt
      file_002.txt
      ... (up to max_files_per_shard)
    shard_0001/
    ... (up to max_items_per_directory shards)
  group_0001/
  ... (up to max_items_per_directory groups)

Step 4: Implement File Distribution

步骤4：实施文件分发

Split oversized files into chunks that comply with size limits
Distribute files/chunks across shards evenly
Track original file mappings in metadata for reconstruction
Use checksums to verify data integrity

将超大文件拆分为符合大小限制的块
将文件/块均匀分发到各个shard中
在元数据中跟踪原始文件映射，以便后续重建
使用校验和验证数据完整性

Step 5: Generate Reconstruction Metadata

步骤5：生成重建元数据

Include metadata that enables reversing the resharding:

Original file paths and their shard locations
Split file chunk mappings
Checksums for integrity verification

包含可撤销resharding操作的元数据：

原始文件路径及其所在的shard位置
拆分文件的块映射
用于完整性验证的校验和

Verification Strategies

验证策略

Constraint Verification Checklist

约束条件验证清单

Execute these checks before declaring success:

Root directory item count:
```
ls <output_dir> | wc -l
```
must be ≤ limit
All intermediate directories: Recursively verify each directory level
All leaf directories: Verify shard contents
File size limits: Check no single file exceeds the maximum
Data integrity: Verify checksums match originals

在宣布任务成功前，执行以下检查：

根目录项目数：
```
ls <output_dir> | wc -l
```
的结果必须≤限制值
所有中间目录： 递归验证每个目录层级
所有叶子目录： 验证shard内容
文件大小限制： 检查是否有单个文件超出最大值
数据完整性： 验证校验和与原始文件匹配

Automated Verification Script Pattern

自动化验证脚本模板

python

def verify_directory_constraints(path, max_items):
    """Recursively verify all directories comply with item limit."""
    for root, dirs, files in os.walk(path):
        item_count = len(dirs) + len(files)
        if item_count > max_items:
            return False, f"{root} has {item_count} items (max: {max_items})"
    return True, "All directories compliant"

python

def verify_directory_constraints(path, max_items):
    """Recursively verify all directories comply with item limit."""
    for root, dirs, files in os.walk(path):
        item_count = len(dirs) + len(files)
        if item_count > max_items:
            return False, f"{root} has {item_count} items (max: {max_items})"
    return True, "All directories compliant"

Common Verification Failures

常见验证失败情况

Symptom	Likely Cause	Solution
Root directory exceeds limit	Flat shard structure	Add grouping hierarchy
Checksums don't match	File corruption during split	Re-implement split logic with verification
Missing files in reconstruction	Incomplete metadata	Audit metadata generation

症状	可能原因	解决方案
根目录超出限制	扁平shard结构	添加分组层级
校验和不匹配	拆分过程中文件损坏	重新实现带有验证机制的拆分逻辑
重建时缺少文件	元数据不完整	审核元数据生成流程

Common Pitfalls

常见陷阱

Pitfall 1: Partial Constraint Interpretation

陷阱1：约束条件的部分解读

Mistake: Applying "max N items per directory" only to leaf directories.

Prevention: Explicitly verify every directory level in the output structure.

错误： 仅将“每个目录最多N个项目”应用于叶子目录。

预防措施： 明确验证输出结构中的每个目录层级。

Pitfall 2: Ignoring Metadata in Item Counts

陷阱2：忽略元数据在项目计数中的影响

Mistake: Forgetting that metadata files (e.g.,

.metadata.json

) count toward directory limits.

Prevention: Include metadata files in constraint calculations.

错误： 忘记元数据文件（如

.metadata.json

）会占用目录限制的名额。

预防措施： 在约束条件计算中包含元数据文件。

Pitfall 3: False Confidence from Partial Testing

陷阱3：局部测试导致的错误信心

Mistake: Concluding success after verifying data integrity but not structural constraints.

Prevention: Create separate verification steps for each constraint type.

错误： 仅验证数据完整性后就宣布任务成功，未验证结构约束条件。

预防措施： 为每种约束条件创建独立的验证步骤。

Pitfall 4: Underestimating Scale

陷阱4：低估规模影响

Mistake: Testing with small datasets that don't trigger hierarchical nesting requirements.

Prevention: Calculate expected output structure mathematically before implementation.

错误： 使用不会触发分层嵌套需求的小型数据集进行测试。

预防措施： 在实施前通过数学计算确定预期的输出结构。

Testing Recommendations

测试建议

Unit test constraint logic with edge cases (exactly at limits, one over, etc.)
Test with representative scale that triggers all hierarchy levels
Verify round-trip integrity by reconstructing and comparing checksums
Check all constraint types independently before integration testing

对约束条件逻辑进行单元测试，覆盖边缘情况（刚好达到限制、超出一个等）
使用具有代表性规模的数据集测试，触发所有层级的嵌套
验证往返完整性：通过重建并对比校验和来验证
在集成测试前独立检查所有类型的约束条件