rtl-document-translation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

RTL Document Translation Skill

RTL文档翻译技能

Translate structured business documents to right-to-left (RTL) languages while maintaining pixel-perfect formatting, colors, table structures, and professional appearance.

将结构化商务文档翻译为从右到左（RTL）语言，同时保持像素级精确的格式、颜色、表格结构和专业外观。

When to Use This Skill

何时使用该技能

Invoke this skill when the user requests:

Translating DOCX files to Arabic, Hebrew, Urdu, or other RTL languages
Preserving exact document structure (tables, sections, formatting)
Maintaining colors, backgrounds, and visual styling
Converting business/financial documents to RTL formats
Creating RTL versions that match English originals exactly

Do NOT use for:

Simple text translation (use translation APIs directly)
Creating new documents from scratch
PDF-only workflows (this skill works with DOCX)

当用户提出以下需求时调用该技能：

将DOCX文件翻译为阿拉伯语、希伯来语、乌尔都语或其他RTL语言
精确保留文档结构（表格、章节、格式）
保留颜色、背景和视觉样式
将商务/财务文档转换为RTL格式
创建与英文原版完全一致的RTL版本

请勿用于：

简单文本翻译（直接使用翻译API即可）
从头创建新文档
仅PDF的工作流（该技能仅适用于DOCX）

Core Methodology

核心方法

1. Phased Approach (Critical)

1. 分阶段处理（关键）

Phase 1: Analysis → Phase 2: Translation Dictionary → Phase 3: Document Generation → Phase 4: Verification

Never skip directly to generation. Structure analysis prevents catastrophic errors like:

Splitting multi-line cells into multiple rows
Missing table dimensions
Incorrect section orientations

阶段1：分析 → 阶段2：翻译词典 → 阶段3：文档生成 → 阶段4：验证

切勿直接跳转到生成阶段。结构分析可避免以下严重错误：

将多行单元格拆分为多个行
遗漏表格尺寸
章节方向设置错误

2. RTL Formatting (3 Levels)

2. RTL格式设置（三个层级）

RTL documents require THREE distinct formatting levels:

Level 1 - Text Direction:

python

paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = True

Level 2 - Text Alignment:

python

paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT

Level 3 - Layout Direction: For data/financial tables: Keep columns in LEFT-TO-RIGHT order

Temporal sequences (Month 1, 2, 3...) progress L→R
Row labels stay in same positions as English
Only TEXT WITHIN cells is RTL

Example: Month headers should be:

[الشهر] [1] [2] [3] [4]  ← Correct (columns L→R, text RTL)
[4] [3] [2] [1] [الشهر]  ← Wrong (mirrored columns)

RTL文档需要三个不同的格式层级：

层级1 - 文本方向：

python

paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = True

层级2 - 文本对齐：

python

paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT

层级3 - 布局方向： 对于数据/财务表格：保持列的从左到右顺序

时间序列（第1月、第2月、第3月...）按左→右排列
行标签位置与英文原版一致
仅单元格内的文本采用RTL格式

示例：月份表头应为：

[الشهر] [1] [2] [3] [4]  ← 正确（列左→右，文本RTL）
[4] [3] [2] [1] [الشهر]  ← 错误（列镜像反转）

Implementation Patterns

实现模式

Pattern 1: Background Color Detection

模式1：背景颜色检测

Problem: Simple attribute access fails Solution: Use XML traversal

python

from docx.oxml.ns import qn

def get_cell_background(cell):
    """Reliably extract cell background color"""
    tc = cell._element
    tcPr = tc.tcPr if hasattr(tc, 'tcPr') and tc.tcPr is not None else None

    if tcPr is None:
        return None

    # CRITICAL: Use findall(), not direct attribute access
    shd_list = tcPr.findall(qn('w:shd'))
    for shd in shd_list:
        fill = shd.get(qn('w:fill'))
        if fill and fill != 'auto':
            return fill.upper()

    return None

Why:

tcPr.shading

doesn't work consistently. XML traversal is bulletproof.

问题： 直接属性访问失败 解决方案： 使用XML遍历

python

from docx.oxml.ns import qn

def get_cell_background(cell):
    """Reliably extract cell background color"""
    tc = cell._element
    tcPr = tc.tcPr if hasattr(tc, 'tcPr') and tc.tcPr is not None else None

    if tcPr is None:
        return None

    # CRITICAL: Use findall(), not direct attribute access
    shd_list = tcPr.findall(qn('w:shd'))
    for shd in shd_list:
        fill = shd.get(qn('w:fill'))
        if fill and fill != 'auto':
            return fill.upper()

    return None

原因：

tcPr.shading

无法持续正常工作。XML遍历是可靠的解决方案。

Pattern 2: Set Cell Background

模式2：设置单元格背景

python

from docx.oxml import OxmlElement

def set_cell_background(cell, rgb_hex):
    """Set cell background color (e.g., 'CC0029' for red)"""
    tc = cell._element
    tcPr = tc.get_or_add_tcPr()

    # Remove existing shading
    for shd in tcPr.findall(qn('w:shd')):
        tcPr.remove(shd)

    # Add new shading
    shd = OxmlElement('w:shd')
    shd.set(qn('w:fill'), rgb_hex)
    tcPr.append(shd)

python

from docx.oxml import OxmlElement

def set_cell_background(cell, rgb_hex):
    """Set cell background color (e.g., 'CC0029' for red)"""
    tc = cell._element
    tcPr = tc.get_or_add_tcPr()

    # Remove existing shading
    for shd in tcPr.findall(qn('w:shd')):
        tcPr.remove(shd)

    # Add new shading
    shd = OxmlElement('w:shd')
    shd.set(qn('w:fill'), rgb_hex)
    tcPr.append(shd)

Pattern 3: Quote Normalization

模式3：引号规范化

Problem: DOCX files contain curly quotes (U+201C, U+201D) that break dictionary lookups

Solution: Multi-pass normalization

python

def normalize_text(text):
    """Normalize quotes and unicode spaces for reliable matching"""
    # Convert curly quotes → straight quotes
    text = text.replace('\u201c', '"').replace('\u201d', '"')
    text = text.replace('\u2018', "'").replace('\u2019', "'")

    # Normalize unicode spaces → regular spaces
    text = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text)

    return text.strip()

问题： DOCX文件包含弯引号（U+201C、U+201D），会破坏词典查找

解决方案： 多轮规范化

python

def normalize_text(text):
    """Normalize quotes and unicode spaces for reliable matching"""
    # Convert curly quotes → straight quotes
    text = text.replace('\u201c', '"').replace('\u201d', '"')
    text = text.replace('\u2018', "'").replace('\u2019', "'")

    # Normalize unicode spaces → regular spaces
    text = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text)

    return text.strip()

Pattern 4: Multi-Pass Translation Matching

模式4：多轮翻译匹配

Problem: Exact string matches fail due to whitespace variations, quotes, formatting

Solution: Progressive fallback strategy

python

def translate_text(text, translation_dict):
    """Multi-pass translation with normalization fallbacks"""
    if not text or not text.strip():
        return text

    # Pass 1: Exact match
    if text in translation_dict:
        return translation_dict[text]

    # Pass 2: Stripped
    if text.strip() in translation_dict:
        return translation_dict[text.strip()]

    # Pass 3: Normalized quotes
    normalized_quotes = text.replace('\u201c', '"').replace('\u201d', '"')
    normalized_quotes = normalized_quotes.replace('\u2018', "'").replace('\u2019', "'")
    if normalized_quotes in translation_dict:
        return translation_dict[normalized_quotes]

    # Pass 4: Stripped + normalized
    if normalized_quotes.strip() in translation_dict:
        return translation_dict[normalized_quotes.strip()]

    # Pass 5: Unicode spaces
    cleaned = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text).strip()
    if cleaned in translation_dict:
        return translation_dict[cleaned]

    # Pass 6: Combined (quotes + spaces)
    cleaned_quotes = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', normalized_quotes).strip()
    if cleaned_quotes in translation_dict:
        return translation_dict[cleaned_quotes]

    # Pass 7: Normalized whitespace (collapse multiple spaces)
    normalized_ws = ' '.join(text.split())
    if normalized_ws in translation_dict:
        return translation_dict[normalized_ws]

    # No match found - return as-is
    return text

Success Rate: 95%+ vs 60% with exact-match-only

问题： 由于空格变化、引号、格式问题，精确字符串匹配失败

解决方案： 渐进式回退策略

python

def translate_text(text, translation_dict):
    """Multi-pass translation with normalization fallbacks"""
    if not text or not text.strip():
        return text

    # Pass 1: Exact match
    if text in translation_dict:
        return translation_dict[text]

    # Pass 2: Stripped
    if text.strip() in translation_dict:
        return translation_dict[text.strip()]

    # Pass 3: Normalized quotes
    normalized_quotes = text.replace('\u201c', '"').replace('\u201d', '"')
    normalized_quotes = normalized_quotes.replace('\u2018', "'").replace('\u2019', "'")
    if normalized_quotes in translation_dict:
        return translation_dict[normalized_quotes]

    # Pass 4: Stripped + normalized
    if normalized_quotes.strip() in translation_dict:
        return translation_dict[normalized_quotes.strip()]

    # Pass 5: Unicode spaces
    cleaned = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text).strip()
    if cleaned in translation_dict:
        return translation_dict[cleaned]

    # Pass 6: Combined (quotes + spaces)
    cleaned_quotes = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', normalized_quotes).strip()
    if cleaned_quotes in translation_dict:
        return translation_dict[cleaned_quotes]

    # Pass 7: Normalized whitespace (collapse multiple spaces)
    normalized_ws = ' '.join(text.split())
    if normalized_ws in translation_dict:
        return translation_dict[normalized_ws]

    # No match found - return as-is
    return text

成功率： 95%+，相比仅精确匹配的60%大幅提升

Pattern 5: RTL Cell Formatting

模式5：RTL单元格格式设置

python

def apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False, text_color=None):
    """Apply complete RTL formatting to table cell"""
    # Clear cell
    cell.text = ''

    # Add paragraph with Arabic text
    paragraph = cell.paragraphs[0]
    run = paragraph.add_run(arabic_text)

    # RTL text direction (Level 1)
    paragraph.paragraph_format.bidi = True
    run.font.rtl = True
    run.font.complex_script = True

    # Right alignment (Level 2)
    paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT

    # Font settings
    run.font.name = 'Simplified Arabic'  # or 'Times New Roman' for formal docs
    run._element.rPr.rFonts.set(qn('w:ascii'), 'Simplified Arabic')
    run._element.rPr.rFonts.set(qn('w:hAnsi'), 'Simplified Arabic')
    run._element.rPr.rFonts.set(qn('w:cs'), 'Simplified Arabic')
    run.font.size = Pt(font_size)

    if bold:
        run.font.bold = True

    if text_color:
        run.font.color.rgb = RGBColor(*text_color)

    return cell

python

def apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False, text_color=None):
    """Apply complete RTL formatting to table cell"""
    # Clear cell
    cell.text = ''

    # Add paragraph with Arabic text
    paragraph = cell.paragraphs[0]
    run = paragraph.add_run(arabic_text)

    # RTL text direction (Level 1)
    paragraph.paragraph_format.bidi = True
    run.font.rtl = True
    run.font.complex_script = True

    # Right alignment (Level 2)
    paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT

    # Font settings
    run.font.name = 'Simplified Arabic'  # or 'Times New Roman' for formal docs
    run._element.rPr.rFonts.set(qn('w:ascii'), 'Simplified Arabic')
    run._element.rPr.rFonts.set(qn('w:hAnsi'), 'Simplified Arabic')
    run._element.rPr.rFonts.set(qn('w:cs'), 'Simplified Arabic')
    run.font.size = Pt(font_size)

    if bold:
        run.font.bold = True

    if text_color:
        run.font.color.rgb = RGBColor(*text_color)

    return cell

Pattern 6: Auto-Correct White Text on Dark Backgrounds

模式6：自动修正深色背景上的白色文本

Problem: Text becomes invisible on dark backgrounds

Solution: Auto-detect and correct

python

def apply_colors_to_cell(cell, eng_cell, ar_text, font_size=10, bold=False):
    """Apply colors with auto-correction for visibility"""
    # Get background color
    bg_color = get_cell_background(eng_cell)

    # Get text color from English
    text_color = None
    if eng_cell.paragraphs and eng_cell.paragraphs[0].runs:
        for run in eng_cell.paragraphs[0].runs:
            if run.font.color and run.font.color.rgb:
                rgb = run.font.color.rgb
                text_color = (rgb[0], rgb[1], rgb[2])
                break

    # AUTO-CORRECTION: Set white text for dark backgrounds
    if bg_color and bg_color in ['CC0029', 'C00000', '000000']:  # Red/black
        text_color = (255, 255, 255)  # White

    # Apply formatting
    apply_rtl_to_cell(cell, ar_text, font_size, bold, text_color)

    # Set background
    if bg_color:
        set_cell_background(cell, bg_color)

问题： 文本在深色背景上不可见

解决方案： 自动检测并修正

python

def apply_colors_to_cell(cell, eng_cell, ar_text, font_size=10, bold=False):
    """Apply colors with auto-correction for visibility"""
    # Get background color
    bg_color = get_cell_background(eng_cell)

    # Get text color from English
    text_color = None
    if eng_cell.paragraphs and eng_cell.paragraphs[0].runs:
        for run in eng_cell.paragraphs[0].runs:
            if run.font.color and run.font.color.rgb:
                rgb = run.font.color.rgb
                text_color = (rgb[0], rgb[1], rgb[2])
                break

    # AUTO-CORRECTION: Set white text for dark backgrounds
    if bg_color and bg_color in ['CC0029', 'C00000', '000000']:  # Red/black
        text_color = (255, 255, 255)  # White

    # Apply formatting
    apply_rtl_to_cell(cell, ar_text, font_size, bold, text_color)

    # Set background
    if bg_color:
        set_cell_background(cell, bg_color)

Pattern 7: Nested Table Content Extraction ⭐

模式7：嵌套表格内容提取 ⭐

Problem:

cell.text

property doesn't include text from nested tables within the cell. This causes cells with forms, checklists, or complex layouts to appear empty.

Detection:

python

if cell.tables:
    print(f"Cell contains {len(cell.tables)} nested table(s)")

Solution: Extract content from nested tables using

cell.tables

property

python

def extract_cell_content_with_nested_tables(cell):
    """
    Extract all text from a cell, including text from nested tables.

    Handles Word documents that use nested tables for:
    - Checklists with options
    - Forms with checkboxes
    - Complex multi-row cell layouts
    """
    text_parts = []

    # Get direct paragraph text (not inside nested tables)
    for para in cell.paragraphs:
        para_text = para.text.strip()
        if para_text:
            text_parts.append(para_text)

    # Get content from nested tables
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                # Extract text from first column only (skip checkbox/form columns)
                if nested_row.cells:
                    first_col_text = nested_row.cells[0].text.strip()
                    # Filter out checkbox characters
                    if first_col_text and first_col_text not in ['⁮', '☐', '☑', '☒']:
                        text_parts.append(first_col_text)

    return '\n'.join(text_parts) if text_parts else ''

Usage in Translation Workflow:

python

undefined

问题：

cell.text

属性不包含单元格内嵌套表格中的文本。这会导致包含表单、复选框或复杂布局的单元格显示为空。

检测：

python

if cell.tables:
    print(f"Cell contains {len(cell.tables)} nested table(s)")

解决方案： 使用

cell.tables

属性提取嵌套表格中的内容

python

def extract_cell_content_with_nested_tables(cell):
    """
    Extract all text from a cell, including text from nested tables.

    Handles Word documents that use nested tables for:
    - Checklists with options
    - Forms with checkboxes
    - Complex multi-row cell layouts
    """
    text_parts = []

    # Get direct paragraph text (not inside nested tables)
    for para in cell.paragraphs:
        para_text = para.text.strip()
        if para_text:
            text_parts.append(para_text)

    # Get content from nested tables
    if cell.tables:
        for nested_table in cell.tables:
            for nested_row in nested_table.rows:
                # Extract text from first column only (skip checkbox/form columns)
                if nested_row.cells:
                    first_col_text = nested_row.cells[0].text.strip()
                    # Filter out checkbox characters
                    if first_col_text and first_col_text not in ['⁮', '☐', '☑', '☒']:
                        text_parts.append(first_col_text)

    return '\n'.join(text_parts) if text_parts else ''

在翻译工作流中的用法：

python

undefined

Instead of:

eng_text = eng_cell.text # ❌ Misses nested table content

Use:

eng_text = extract_cell_content_with_nested_tables(eng_cell) # ✓ Gets all content ar_text = translate_text(eng_text)


**Why This Matters:**
- Government forms often use nested tables for checkbox grids
- Evaluation forms use nested tables for rating scales
- Business checklists embed options in nested tables
- Without this, translated documents have empty cells

eng_text = extract_cell_content_with_nested_tables(eng_cell) # ✓ Gets all content ar_text = translate_text(eng_text)


**重要性：**
- 政府表单通常使用嵌套表格构建复选框网格
- 评估表单使用嵌套表格构建评分量表
- 商务检查表将选项嵌入嵌套表格
- 若不处理此问题，翻译后的文档会出现空单元格

Font Recommendations by Document Type

按文档类型推荐字体

Document Type	Recommended Font	Rationale
Financial/Business	Simplified Arabic	Better number/table rendering
Academic/Formal	Times New Roman	Traditional, paragraph-friendly
Technical	Arial Unicode MS	Wide character support
Avoid	Arial	Poor Arabic rendering quality

文档类型	推荐字体	理由
财务/商务	Simplified Arabic	数字和表格渲染效果更佳
学术/正式	Times New Roman	传统风格，段落可读性强
技术文档	Arial Unicode MS	字符支持范围广
避免使用	Arial	阿拉伯语渲染质量差

Complete Workflow

完整工作流

Step 1: Structure Analysis

步骤1：结构分析

python

def analyze_document(docx_path):
    doc = Document(docx_path)

    structure = {
        'sections': [],
        'tables': [],
        'paragraphs': len(doc.paragraphs),
        'colors': {'text': {}, 'backgrounds': {}},
        'fonts': {}
    }

    # Analyze sections
    for idx, section in enumerate(doc.sections):
        structure['sections'].append({
            'index': idx,
            'orientation': 'portrait' if section.page_width < section.page_height else 'landscape',
            'width': section.page_width.inches,
            'height': section.page_height.inches
        })

    # Analyze tables
    for idx, table in enumerate(doc.tables):
        table_info = {
            'index': idx,
            'rows': len(table.rows),
            'cols': len(table.columns),
            'multiline_cells': []
        }

        # Detect multi-line cells
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if '\n' in cell.text:
                    table_info['multiline_cells'].append({
                        'row': r_idx,
                        'col': c_idx,
                        'content': cell.text
                    })

        structure['tables'].append(table_info)

    return structure

python

def analyze_document(docx_path):
    doc = Document(docx_path)

    structure = {
        'sections': [],
        'tables': [],
        'paragraphs': len(doc.paragraphs),
        'colors': {'text': {}, 'backgrounds': {}},
        'fonts': {}
    }

    # Analyze sections
    for idx, section in enumerate(doc.sections):
        structure['sections'].append({
            'index': idx,
            'orientation': 'portrait' if section.page_width < section.page_height else 'landscape',
            'width': section.page_width.inches,
            'height': section.page_height.inches
        })

    # Analyze tables
    for idx, table in enumerate(doc.tables):
        table_info = {
            'index': idx,
            'rows': len(table.rows),
            'cols': len(table.columns),
            'multiline_cells': []
        }

        # Detect multi-line cells
        for r_idx, row in enumerate(table.rows):
            for c_idx, cell in enumerate(row.cells):
                if '\n' in cell.text:
                    table_info['multiline_cells'].append({
                        'row': r_idx,
                        'col': c_idx,
                        'content': cell.text
                    })

        structure['tables'].append(table_info)

    return structure

Step 2: Translation Dictionary Creation

步骤2：翻译词典创建

python

def create_translation_dictionary(docx_files, target_language='arabic'):
    """Extract unique texts and create translation map"""
    unique_texts = set()

    for docx_path in docx_files:
        doc = Document(docx_path)

        # Extract from paragraphs
        for para in doc.paragraphs:
            if para.text.strip():
                unique_texts.add(para.text.strip())

        # Extract from tables
        for table in doc.tables:
            for row in table.rows:
                for cell in row.cells:
                    if cell.text.strip():
                        unique_texts.add(cell.text.strip())

    # Create translation map
    translations = {}
    for text in unique_texts:
        # Call translation API or load from file
        arabic_text = translate_via_api(text, target_language)
        translations[text] = arabic_text

        # Also add normalized versions
        normalized = normalize_text(text)
        if normalized != text:
            translations[normalized] = arabic_text

    return translations

python

def create_translation_dictionary(docx_files, target_language='arabic'):
    """Extract unique texts and create translation map"""
    unique_texts = set()

    for docx_path in docx_files:
        doc = Document(docx_path)

        # Extract from paragraphs
        for para in doc.paragraphs:
            if para.text.strip():
                unique_texts.add(para.text.strip())

        # Extract from tables
        for table in doc.tables:
            for row in table.rows:
                for cell in row.cells:
                    if cell.text.strip():
                        unique_texts.add(cell.text.strip())

    # Create translation map
    translations = {}
    for text in unique_texts:
        # Call translation API or load from file
        arabic_text = translate_via_api(text, target_language)
        translations[text] = arabic_text

        # Also add normalized versions
        normalized = normalize_text(text)
        if normalized != text:
            translations[normalized] = arabic_text

    return translations

Step 3: Document Generation

步骤3：文档生成

See REFERENCE.md for complete implementation example.

详见REFERENCE.md获取完整实现示例。

Step 4: Verification

步骤4：验证

python

def verify_arabic_document(ar_docx_path, eng_docx_path, translation_dict):
    """Comprehensive verification checks"""
    ar_doc = Document(ar_docx_path)
    eng_doc = Document(eng_docx_path)

    results = {
        'structure': 'PASS',
        'alignment': 'PASS',
        'english_scan': 'PASS',
        'colors': 'PASS',
        'issues': []
    }

    # 1. Structure match
    if len(ar_doc.sections) != len(eng_doc.sections):
        results['structure'] = 'FAIL'
        results['issues'].append(f"Section count mismatch")

    if len(ar_doc.tables) != len(eng_doc.tables):
        results['structure'] = 'FAIL'
        results['issues'].append(f"Table count mismatch")

    # 2. Alignment check
    total_cells = 0
    right_aligned = 0
    for table in ar_doc.tables:
        for row in table.rows:
            for cell in row.cells:
                total_cells += 1
                if cell.paragraphs[0].alignment == WD_ALIGN_PARAGRAPH.RIGHT:
                    right_aligned += 1

    if right_aligned != total_cells:
        results['alignment'] = 'FAIL'
        results['issues'].append(f"Only {right_aligned}/{total_cells} cells right-aligned")

    # 3. English word scan
    allowed_english = get_allowed_english(translation_dict)
    unauthorized = scan_for_english(ar_doc, allowed_english)

    if unauthorized:
        results['english_scan'] = 'FAIL'
        results['issues'].extend([f"English found: {w}" for w in unauthorized])

    return results

python

def verify_arabic_document(ar_docx_path, eng_docx_path, translation_dict):
    """Comprehensive verification checks"""
    ar_doc = Document(ar_docx_path)
    eng_doc = Document(eng_docx_path)

    results = {
        'structure': 'PASS',
        'alignment': 'PASS',
        'english_scan': 'PASS',
        'colors': 'PASS',
        'issues': []
    }

    # 1. Structure match
    if len(ar_doc.sections) != len(eng_doc.sections):
        results['structure'] = 'FAIL'
        results['issues'].append(f"Section count mismatch")

    if len(ar_doc.tables) != len(eng_doc.tables):
        results['structure'] = 'FAIL'
        results['issues'].append(f"Table count mismatch")

    # 2. Alignment check
    total_cells = 0
    right_aligned = 0
    for table in ar_doc.tables:
        for row in table.rows:
            for cell in row.cells:
                total_cells += 1
                if cell.paragraphs[0].alignment == WD_ALIGN_PARAGRAPH.RIGHT:
                    right_aligned += 1

    if right_aligned != total_cells:
        results['alignment'] = 'FAIL'
        results['issues'].append(f"Only {right_aligned}/{total_cells} cells right-aligned")

    # 3. English word scan
    allowed_english = get_allowed_english(translation_dict)
    unauthorized = scan_for_english(ar_doc, allowed_english)

    if unauthorized:
        results['english_scan'] = 'FAIL'
        results['issues'].extend([f"English found: {w}" for w in unauthorized])

    return results

Common Pitfalls and Solutions

常见陷阱与解决方案

Pitfall 1: Splitting Multi-Line Cells

陷阱1：拆分多行单元格

Wrong:

python

undefined

错误做法：

python

undefined

Treats "A\n\nEstimated costs" as multiple rows

lines = cell.text.split('\n') for line in lines: new_row = table.add_row() # ❌ Creates extra rows


**Right:**
```python

lines = cell.text.split('\n') for line in lines: new_row = table.add_row() # ❌ Creates extra rows


**正确做法：**
```python

Preserves multi-line content in single cell

ar_cell.text = translate_text(eng_cell.text) # ✓ Keeps \n intact

undefined

ar_cell.text = translate_text(eng_cell.text) # ✓ Keeps \n intact

undefined

Pitfall 2: Partial Translation

陷阱2：部分翻译

Wrong: "التدفق النقدي forecast" (mixed Arabic/English)

Right: "توقعات التدفق النقدي" (fully translated)

Cause: Dictionary missing compound phrases Solution: Extract full phrases, not word-by-word

错误示例： "التدفق النقدي forecast"（阿拉伯语/英语混合）

正确示例： "توقعات التدفق النقدي"（完全翻译）

原因： 词典中缺少复合短语 解决方案： 提取完整短语，而非逐词翻译

Pitfall 3: Forgetting RTL for New Cells

陷阱3：新单元格未设置RTL

Wrong:

python

new_para = doc.add_paragraph(arabic_text)  # ❌ Missing RTL

Right:

python

new_para = doc.add_paragraph()
run = new_para.add_run(arabic_text)
new_para.paragraph_format.bidi = True
run.font.rtl = True
new_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT  # ✓ Complete RTL

错误做法：

python

new_para = doc.add_paragraph(arabic_text)  # ❌ Missing RTL

正确做法：

python

new_para = doc.add_paragraph()
run = new_para.add_run(arabic_text)
new_para.paragraph_format.bidi = True
run.font.rtl = True
new_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT  # ✓ Complete RTL

Pitfall 4: Not Checking Visual Output

陷阱4：未检查视觉输出

Problem: Automated checks pass but visual appearance is wrong

Solution: Always generate comparison images:

python

undefined

问题： 自动化检查通过，但视觉外观错误

解决方案： 始终生成对比图片：

python

undefined

Convert to PDF then images

subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', ar_docx]) subprocess.run(['pdftoppm', '-png', 'output.pdf', 'comparison'])

undefined

subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', ar_docx]) subprocess.run(['pdftoppm', '-png', 'output.pdf', 'comparison'])

undefined

Quick Reference: Essential Functions

快速参考：核心函数

python

undefined

python

undefined

1. Get cell background

bg = get_cell_background(cell)

2. Set cell background

set_cell_background(cell, 'CC0029')

3. Normalize text

normalized = normalize_text(text)

4. Multi-pass translation

arabic = translate_text(english, translation_dict)

5. Apply RTL to cell

apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False)

6. Apply colors with auto-correction

apply_colors_to_cell(cell, eng_cell, ar_text)

7. Verify document

results = verify_arabic_document(ar_doc, eng_doc, trans_dict)

undefined

results = verify_arabic_document(ar_doc, eng_doc, trans_dict)

undefined

Success Criteria

成功标准

Additional Resources

附加资源

See REFERENCE.md for:

Complete code examples
Real-world document templates
Troubleshooting guide
Advanced patterns

详见REFERENCE.md获取：

完整代码示例
真实世界文档模板
故障排除指南
进阶模式