rtl-document-translation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseRTL Document Translation Skill
RTL文档翻译技能
Translate structured business documents to right-to-left (RTL) languages while maintaining pixel-perfect formatting, colors, table structures, and professional appearance.
将结构化商务文档翻译为从右到左(RTL)语言,同时保持像素级精确的格式、颜色、表格结构和专业外观。
When to Use This Skill
何时使用该技能
Invoke this skill when the user requests:
- Translating DOCX files to Arabic, Hebrew, Urdu, or other RTL languages
- Preserving exact document structure (tables, sections, formatting)
- Maintaining colors, backgrounds, and visual styling
- Converting business/financial documents to RTL formats
- Creating RTL versions that match English originals exactly
Do NOT use for:
- Simple text translation (use translation APIs directly)
- Creating new documents from scratch
- PDF-only workflows (this skill works with DOCX)
当用户提出以下需求时调用该技能:
- 将DOCX文件翻译为阿拉伯语、希伯来语、乌尔都语或其他RTL语言
- 精确保留文档结构(表格、章节、格式)
- 保留颜色、背景和视觉样式
- 将商务/财务文档转换为RTL格式
- 创建与英文原版完全一致的RTL版本
请勿用于:
- 简单文本翻译(直接使用翻译API即可)
- 从头创建新文档
- 仅PDF的工作流(该技能仅适用于DOCX)
Core Methodology
核心方法
1. Phased Approach (Critical)
1. 分阶段处理(关键)
Phase 1: Analysis → Phase 2: Translation Dictionary → Phase 3: Document Generation → Phase 4: Verification
Never skip directly to generation. Structure analysis prevents catastrophic errors like:
- Splitting multi-line cells into multiple rows
- Missing table dimensions
- Incorrect section orientations
阶段1:分析 → 阶段2:翻译词典 → 阶段3:文档生成 → 阶段4:验证
切勿直接跳转到生成阶段。结构分析可避免以下严重错误:
- 将多行单元格拆分为多个行
- 遗漏表格尺寸
- 章节方向设置错误
2. RTL Formatting (3 Levels)
2. RTL格式设置(三个层级)
RTL documents require THREE distinct formatting levels:
Level 1 - Text Direction:
python
paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = TrueLevel 2 - Text Alignment:
python
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHTLevel 3 - Layout Direction:
For data/financial tables: Keep columns in LEFT-TO-RIGHT order
- Temporal sequences (Month 1, 2, 3...) progress L→R
- Row labels stay in same positions as English
- Only TEXT WITHIN cells is RTL
Example: Month headers should be:
[الشهر] [1] [2] [3] [4] ← Correct (columns L→R, text RTL)
[4] [3] [2] [1] [الشهر] ← Wrong (mirrored columns)RTL文档需要三个不同的格式层级:
层级1 - 文本方向:
python
paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = True层级2 - 文本对齐:
python
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT层级3 - 布局方向:
对于数据/财务表格:保持列的从左到右顺序
- 时间序列(第1月、第2月、第3月...)按左→右排列
- 行标签位置与英文原版一致
- 仅单元格内的文本采用RTL格式
示例:月份表头应为:
[الشهر] [1] [2] [3] [4] ← 正确(列左→右,文本RTL)
[4] [3] [2] [1] [الشهر] ← 错误(列镜像反转)Implementation Patterns
实现模式
Pattern 1: Background Color Detection
模式1:背景颜色检测
Problem: Simple attribute access fails
Solution: Use XML traversal
python
from docx.oxml.ns import qn
def get_cell_background(cell):
"""Reliably extract cell background color"""
tc = cell._element
tcPr = tc.tcPr if hasattr(tc, 'tcPr') and tc.tcPr is not None else None
if tcPr is None:
return None
# CRITICAL: Use findall(), not direct attribute access
shd_list = tcPr.findall(qn('w:shd'))
for shd in shd_list:
fill = shd.get(qn('w:fill'))
if fill and fill != 'auto':
return fill.upper()
return NoneWhy: doesn't work consistently. XML traversal is bulletproof.
tcPr.shading问题: 直接属性访问失败
解决方案: 使用XML遍历
python
from docx.oxml.ns import qn
def get_cell_background(cell):
"""Reliably extract cell background color"""
tc = cell._element
tcPr = tc.tcPr if hasattr(tc, 'tcPr') and tc.tcPr is not None else None
if tcPr is None:
return None
# CRITICAL: Use findall(), not direct attribute access
shd_list = tcPr.findall(qn('w:shd'))
for shd in shd_list:
fill = shd.get(qn('w:fill'))
if fill and fill != 'auto':
return fill.upper()
return None原因: 无法持续正常工作。XML遍历是可靠的解决方案。
tcPr.shadingPattern 2: Set Cell Background
模式2:设置单元格背景
python
from docx.oxml import OxmlElement
def set_cell_background(cell, rgb_hex):
"""Set cell background color (e.g., 'CC0029' for red)"""
tc = cell._element
tcPr = tc.get_or_add_tcPr()
# Remove existing shading
for shd in tcPr.findall(qn('w:shd')):
tcPr.remove(shd)
# Add new shading
shd = OxmlElement('w:shd')
shd.set(qn('w:fill'), rgb_hex)
tcPr.append(shd)python
from docx.oxml import OxmlElement
def set_cell_background(cell, rgb_hex):
"""Set cell background color (e.g., 'CC0029' for red)"""
tc = cell._element
tcPr = tc.get_or_add_tcPr()
# Remove existing shading
for shd in tcPr.findall(qn('w:shd')):
tcPr.remove(shd)
# Add new shading
shd = OxmlElement('w:shd')
shd.set(qn('w:fill'), rgb_hex)
tcPr.append(shd)Pattern 3: Quote Normalization
模式3:引号规范化
Problem: DOCX files contain curly quotes (U+201C, U+201D) that break dictionary lookups
Solution: Multi-pass normalization
python
def normalize_text(text):
"""Normalize quotes and unicode spaces for reliable matching"""
# Convert curly quotes → straight quotes
text = text.replace('\u201c', '"').replace('\u201d', '"')
text = text.replace('\u2018', "'").replace('\u2019', "'")
# Normalize unicode spaces → regular spaces
text = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text)
return text.strip()问题: DOCX文件包含弯引号(U+201C、U+201D),会破坏词典查找
解决方案: 多轮规范化
python
def normalize_text(text):
"""Normalize quotes and unicode spaces for reliable matching"""
# Convert curly quotes → straight quotes
text = text.replace('\u201c', '"').replace('\u201d', '"')
text = text.replace('\u2018', "'").replace('\u2019', "'")
# Normalize unicode spaces → regular spaces
text = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text)
return text.strip()Pattern 4: Multi-Pass Translation Matching
模式4:多轮翻译匹配
Problem: Exact string matches fail due to whitespace variations, quotes, formatting
Solution: Progressive fallback strategy
python
def translate_text(text, translation_dict):
"""Multi-pass translation with normalization fallbacks"""
if not text or not text.strip():
return text
# Pass 1: Exact match
if text in translation_dict:
return translation_dict[text]
# Pass 2: Stripped
if text.strip() in translation_dict:
return translation_dict[text.strip()]
# Pass 3: Normalized quotes
normalized_quotes = text.replace('\u201c', '"').replace('\u201d', '"')
normalized_quotes = normalized_quotes.replace('\u2018', "'").replace('\u2019', "'")
if normalized_quotes in translation_dict:
return translation_dict[normalized_quotes]
# Pass 4: Stripped + normalized
if normalized_quotes.strip() in translation_dict:
return translation_dict[normalized_quotes.strip()]
# Pass 5: Unicode spaces
cleaned = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text).strip()
if cleaned in translation_dict:
return translation_dict[cleaned]
# Pass 6: Combined (quotes + spaces)
cleaned_quotes = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', normalized_quotes).strip()
if cleaned_quotes in translation_dict:
return translation_dict[cleaned_quotes]
# Pass 7: Normalized whitespace (collapse multiple spaces)
normalized_ws = ' '.join(text.split())
if normalized_ws in translation_dict:
return translation_dict[normalized_ws]
# No match found - return as-is
return textSuccess Rate: 95%+ vs 60% with exact-match-only
问题: 由于空格变化、引号、格式问题,精确字符串匹配失败
解决方案: 渐进式回退策略
python
def translate_text(text, translation_dict):
"""Multi-pass translation with normalization fallbacks"""
if not text or not text.strip():
return text
# Pass 1: Exact match
if text in translation_dict:
return translation_dict[text]
# Pass 2: Stripped
if text.strip() in translation_dict:
return translation_dict[text.strip()]
# Pass 3: Normalized quotes
normalized_quotes = text.replace('\u201c', '"').replace('\u201d', '"')
normalized_quotes = normalized_quotes.replace('\u2018', "'").replace('\u2019', "'")
if normalized_quotes in translation_dict:
return translation_dict[normalized_quotes]
# Pass 4: Stripped + normalized
if normalized_quotes.strip() in translation_dict:
return translation_dict[normalized_quotes.strip()]
# Pass 5: Unicode spaces
cleaned = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', text).strip()
if cleaned in translation_dict:
return translation_dict[cleaned]
# Pass 6: Combined (quotes + spaces)
cleaned_quotes = re.sub(r'[\u2002\u2003\u2009\u200A\u00A0]+', ' ', normalized_quotes).strip()
if cleaned_quotes in translation_dict:
return translation_dict[cleaned_quotes]
# Pass 7: Normalized whitespace (collapse multiple spaces)
normalized_ws = ' '.join(text.split())
if normalized_ws in translation_dict:
return translation_dict[normalized_ws]
# No match found - return as-is
return text成功率: 95%+,相比仅精确匹配的60%大幅提升
Pattern 5: RTL Cell Formatting
模式5:RTL单元格格式设置
python
def apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False, text_color=None):
"""Apply complete RTL formatting to table cell"""
# Clear cell
cell.text = ''
# Add paragraph with Arabic text
paragraph = cell.paragraphs[0]
run = paragraph.add_run(arabic_text)
# RTL text direction (Level 1)
paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = True
# Right alignment (Level 2)
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT
# Font settings
run.font.name = 'Simplified Arabic' # or 'Times New Roman' for formal docs
run._element.rPr.rFonts.set(qn('w:ascii'), 'Simplified Arabic')
run._element.rPr.rFonts.set(qn('w:hAnsi'), 'Simplified Arabic')
run._element.rPr.rFonts.set(qn('w:cs'), 'Simplified Arabic')
run.font.size = Pt(font_size)
if bold:
run.font.bold = True
if text_color:
run.font.color.rgb = RGBColor(*text_color)
return cellpython
def apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False, text_color=None):
"""Apply complete RTL formatting to table cell"""
# Clear cell
cell.text = ''
# Add paragraph with Arabic text
paragraph = cell.paragraphs[0]
run = paragraph.add_run(arabic_text)
# RTL text direction (Level 1)
paragraph.paragraph_format.bidi = True
run.font.rtl = True
run.font.complex_script = True
# Right alignment (Level 2)
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT
# Font settings
run.font.name = 'Simplified Arabic' # or 'Times New Roman' for formal docs
run._element.rPr.rFonts.set(qn('w:ascii'), 'Simplified Arabic')
run._element.rPr.rFonts.set(qn('w:hAnsi'), 'Simplified Arabic')
run._element.rPr.rFonts.set(qn('w:cs'), 'Simplified Arabic')
run.font.size = Pt(font_size)
if bold:
run.font.bold = True
if text_color:
run.font.color.rgb = RGBColor(*text_color)
return cellPattern 6: Auto-Correct White Text on Dark Backgrounds
模式6:自动修正深色背景上的白色文本
Problem: Text becomes invisible on dark backgrounds
Solution: Auto-detect and correct
python
def apply_colors_to_cell(cell, eng_cell, ar_text, font_size=10, bold=False):
"""Apply colors with auto-correction for visibility"""
# Get background color
bg_color = get_cell_background(eng_cell)
# Get text color from English
text_color = None
if eng_cell.paragraphs and eng_cell.paragraphs[0].runs:
for run in eng_cell.paragraphs[0].runs:
if run.font.color and run.font.color.rgb:
rgb = run.font.color.rgb
text_color = (rgb[0], rgb[1], rgb[2])
break
# AUTO-CORRECTION: Set white text for dark backgrounds
if bg_color and bg_color in ['CC0029', 'C00000', '000000']: # Red/black
text_color = (255, 255, 255) # White
# Apply formatting
apply_rtl_to_cell(cell, ar_text, font_size, bold, text_color)
# Set background
if bg_color:
set_cell_background(cell, bg_color)问题: 文本在深色背景上不可见
解决方案: 自动检测并修正
python
def apply_colors_to_cell(cell, eng_cell, ar_text, font_size=10, bold=False):
"""Apply colors with auto-correction for visibility"""
# Get background color
bg_color = get_cell_background(eng_cell)
# Get text color from English
text_color = None
if eng_cell.paragraphs and eng_cell.paragraphs[0].runs:
for run in eng_cell.paragraphs[0].runs:
if run.font.color and run.font.color.rgb:
rgb = run.font.color.rgb
text_color = (rgb[0], rgb[1], rgb[2])
break
# AUTO-CORRECTION: Set white text for dark backgrounds
if bg_color and bg_color in ['CC0029', 'C00000', '000000']: # Red/black
text_color = (255, 255, 255) # White
# Apply formatting
apply_rtl_to_cell(cell, ar_text, font_size, bold, text_color)
# Set background
if bg_color:
set_cell_background(cell, bg_color)Pattern 7: Nested Table Content Extraction ⭐
模式7:嵌套表格内容提取 ⭐
Problem: property doesn't include text from nested tables within the cell. This causes cells with forms, checklists, or complex layouts to appear empty.
cell.textDetection:
python
if cell.tables:
print(f"Cell contains {len(cell.tables)} nested table(s)")Solution: Extract content from nested tables using property
cell.tablespython
def extract_cell_content_with_nested_tables(cell):
"""
Extract all text from a cell, including text from nested tables.
Handles Word documents that use nested tables for:
- Checklists with options
- Forms with checkboxes
- Complex multi-row cell layouts
"""
text_parts = []
# Get direct paragraph text (not inside nested tables)
for para in cell.paragraphs:
para_text = para.text.strip()
if para_text:
text_parts.append(para_text)
# Get content from nested tables
if cell.tables:
for nested_table in cell.tables:
for nested_row in nested_table.rows:
# Extract text from first column only (skip checkbox/form columns)
if nested_row.cells:
first_col_text = nested_row.cells[0].text.strip()
# Filter out checkbox characters
if first_col_text and first_col_text not in ['', '☐', '☑', '☒']:
text_parts.append(first_col_text)
return '\n'.join(text_parts) if text_parts else ''Usage in Translation Workflow:
python
undefined问题: 属性不包含单元格内嵌套表格中的文本。这会导致包含表单、复选框或复杂布局的单元格显示为空。
cell.text检测:
python
if cell.tables:
print(f"Cell contains {len(cell.tables)} nested table(s)")解决方案: 使用 属性提取嵌套表格中的内容
cell.tablespython
def extract_cell_content_with_nested_tables(cell):
"""
Extract all text from a cell, including text from nested tables.
Handles Word documents that use nested tables for:
- Checklists with options
- Forms with checkboxes
- Complex multi-row cell layouts
"""
text_parts = []
# Get direct paragraph text (not inside nested tables)
for para in cell.paragraphs:
para_text = para.text.strip()
if para_text:
text_parts.append(para_text)
# Get content from nested tables
if cell.tables:
for nested_table in cell.tables:
for nested_row in nested_table.rows:
# Extract text from first column only (skip checkbox/form columns)
if nested_row.cells:
first_col_text = nested_row.cells[0].text.strip()
# Filter out checkbox characters
if first_col_text and first_col_text not in ['', '☐', '☑', '☒']:
text_parts.append(first_col_text)
return '\n'.join(text_parts) if text_parts else ''在翻译工作流中的用法:
python
undefinedInstead of:
Instead of:
eng_text = eng_cell.text # ❌ Misses nested table content
eng_text = eng_cell.text # ❌ Misses nested table content
Use:
Use:
eng_text = extract_cell_content_with_nested_tables(eng_cell) # ✓ Gets all content
ar_text = translate_text(eng_text)
**Why This Matters:**
- Government forms often use nested tables for checkbox grids
- Evaluation forms use nested tables for rating scales
- Business checklists embed options in nested tables
- Without this, translated documents have empty cellseng_text = extract_cell_content_with_nested_tables(eng_cell) # ✓ Gets all content
ar_text = translate_text(eng_text)
**重要性:**
- 政府表单通常使用嵌套表格构建复选框网格
- 评估表单使用嵌套表格构建评分量表
- 商务检查表将选项嵌入嵌套表格
- 若不处理此问题,翻译后的文档会出现空单元格Font Recommendations by Document Type
按文档类型推荐字体
| Document Type | Recommended Font | Rationale |
|---|---|---|
| Financial/Business | Simplified Arabic | Better number/table rendering |
| Academic/Formal | Times New Roman | Traditional, paragraph-friendly |
| Technical | Arial Unicode MS | Wide character support |
| Avoid | Arial | Poor Arabic rendering quality |
| 文档类型 | 推荐字体 | 理由 |
|---|---|---|
| 财务/商务 | Simplified Arabic | 数字和表格渲染效果更佳 |
| 学术/正式 | Times New Roman | 传统风格,段落可读性强 |
| 技术文档 | Arial Unicode MS | 字符支持范围广 |
| 避免使用 | Arial | 阿拉伯语渲染质量差 |
Complete Workflow
完整工作流
Step 1: Structure Analysis
步骤1:结构分析
python
def analyze_document(docx_path):
doc = Document(docx_path)
structure = {
'sections': [],
'tables': [],
'paragraphs': len(doc.paragraphs),
'colors': {'text': {}, 'backgrounds': {}},
'fonts': {}
}
# Analyze sections
for idx, section in enumerate(doc.sections):
structure['sections'].append({
'index': idx,
'orientation': 'portrait' if section.page_width < section.page_height else 'landscape',
'width': section.page_width.inches,
'height': section.page_height.inches
})
# Analyze tables
for idx, table in enumerate(doc.tables):
table_info = {
'index': idx,
'rows': len(table.rows),
'cols': len(table.columns),
'multiline_cells': []
}
# Detect multi-line cells
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if '\n' in cell.text:
table_info['multiline_cells'].append({
'row': r_idx,
'col': c_idx,
'content': cell.text
})
structure['tables'].append(table_info)
return structurepython
def analyze_document(docx_path):
doc = Document(docx_path)
structure = {
'sections': [],
'tables': [],
'paragraphs': len(doc.paragraphs),
'colors': {'text': {}, 'backgrounds': {}},
'fonts': {}
}
# Analyze sections
for idx, section in enumerate(doc.sections):
structure['sections'].append({
'index': idx,
'orientation': 'portrait' if section.page_width < section.page_height else 'landscape',
'width': section.page_width.inches,
'height': section.page_height.inches
})
# Analyze tables
for idx, table in enumerate(doc.tables):
table_info = {
'index': idx,
'rows': len(table.rows),
'cols': len(table.columns),
'multiline_cells': []
}
# Detect multi-line cells
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if '\n' in cell.text:
table_info['multiline_cells'].append({
'row': r_idx,
'col': c_idx,
'content': cell.text
})
structure['tables'].append(table_info)
return structureStep 2: Translation Dictionary Creation
步骤2:翻译词典创建
python
def create_translation_dictionary(docx_files, target_language='arabic'):
"""Extract unique texts and create translation map"""
unique_texts = set()
for docx_path in docx_files:
doc = Document(docx_path)
# Extract from paragraphs
for para in doc.paragraphs:
if para.text.strip():
unique_texts.add(para.text.strip())
# Extract from tables
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
if cell.text.strip():
unique_texts.add(cell.text.strip())
# Create translation map
translations = {}
for text in unique_texts:
# Call translation API or load from file
arabic_text = translate_via_api(text, target_language)
translations[text] = arabic_text
# Also add normalized versions
normalized = normalize_text(text)
if normalized != text:
translations[normalized] = arabic_text
return translationspython
def create_translation_dictionary(docx_files, target_language='arabic'):
"""Extract unique texts and create translation map"""
unique_texts = set()
for docx_path in docx_files:
doc = Document(docx_path)
# Extract from paragraphs
for para in doc.paragraphs:
if para.text.strip():
unique_texts.add(para.text.strip())
# Extract from tables
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
if cell.text.strip():
unique_texts.add(cell.text.strip())
# Create translation map
translations = {}
for text in unique_texts:
# Call translation API or load from file
arabic_text = translate_via_api(text, target_language)
translations[text] = arabic_text
# Also add normalized versions
normalized = normalize_text(text)
if normalized != text:
translations[normalized] = arabic_text
return translationsStep 3: Document Generation
步骤3:文档生成
See REFERENCE.md for complete implementation example.
详见REFERENCE.md获取完整实现示例。
Step 4: Verification
步骤4:验证
python
def verify_arabic_document(ar_docx_path, eng_docx_path, translation_dict):
"""Comprehensive verification checks"""
ar_doc = Document(ar_docx_path)
eng_doc = Document(eng_docx_path)
results = {
'structure': 'PASS',
'alignment': 'PASS',
'english_scan': 'PASS',
'colors': 'PASS',
'issues': []
}
# 1. Structure match
if len(ar_doc.sections) != len(eng_doc.sections):
results['structure'] = 'FAIL'
results['issues'].append(f"Section count mismatch")
if len(ar_doc.tables) != len(eng_doc.tables):
results['structure'] = 'FAIL'
results['issues'].append(f"Table count mismatch")
# 2. Alignment check
total_cells = 0
right_aligned = 0
for table in ar_doc.tables:
for row in table.rows:
for cell in row.cells:
total_cells += 1
if cell.paragraphs[0].alignment == WD_ALIGN_PARAGRAPH.RIGHT:
right_aligned += 1
if right_aligned != total_cells:
results['alignment'] = 'FAIL'
results['issues'].append(f"Only {right_aligned}/{total_cells} cells right-aligned")
# 3. English word scan
allowed_english = get_allowed_english(translation_dict)
unauthorized = scan_for_english(ar_doc, allowed_english)
if unauthorized:
results['english_scan'] = 'FAIL'
results['issues'].extend([f"English found: {w}" for w in unauthorized])
return resultspython
def verify_arabic_document(ar_docx_path, eng_docx_path, translation_dict):
"""Comprehensive verification checks"""
ar_doc = Document(ar_docx_path)
eng_doc = Document(eng_docx_path)
results = {
'structure': 'PASS',
'alignment': 'PASS',
'english_scan': 'PASS',
'colors': 'PASS',
'issues': []
}
# 1. Structure match
if len(ar_doc.sections) != len(eng_doc.sections):
results['structure'] = 'FAIL'
results['issues'].append(f"Section count mismatch")
if len(ar_doc.tables) != len(eng_doc.tables):
results['structure'] = 'FAIL'
results['issues'].append(f"Table count mismatch")
# 2. Alignment check
total_cells = 0
right_aligned = 0
for table in ar_doc.tables:
for row in table.rows:
for cell in row.cells:
total_cells += 1
if cell.paragraphs[0].alignment == WD_ALIGN_PARAGRAPH.RIGHT:
right_aligned += 1
if right_aligned != total_cells:
results['alignment'] = 'FAIL'
results['issues'].append(f"Only {right_aligned}/{total_cells} cells right-aligned")
# 3. English word scan
allowed_english = get_allowed_english(translation_dict)
unauthorized = scan_for_english(ar_doc, allowed_english)
if unauthorized:
results['english_scan'] = 'FAIL'
results['issues'].extend([f"English found: {w}" for w in unauthorized])
return resultsCommon Pitfalls and Solutions
常见陷阱与解决方案
Pitfall 1: Splitting Multi-Line Cells
陷阱1:拆分多行单元格
Wrong:
python
undefined错误做法:
python
undefinedTreats "A\n\nEstimated costs" as multiple rows
Treats "A\n\nEstimated costs" as multiple rows
lines = cell.text.split('\n')
for line in lines:
new_row = table.add_row() # ❌ Creates extra rows
**Right:**
```pythonlines = cell.text.split('\n')
for line in lines:
new_row = table.add_row() # ❌ Creates extra rows
**正确做法:**
```pythonPreserves multi-line content in single cell
Preserves multi-line content in single cell
ar_cell.text = translate_text(eng_cell.text) # ✓ Keeps \n intact
undefinedar_cell.text = translate_text(eng_cell.text) # ✓ Keeps \n intact
undefinedPitfall 2: Partial Translation
陷阱2:部分翻译
Wrong: "التدفق النقدي forecast" (mixed Arabic/English)
Right: "توقعات التدفق النقدي" (fully translated)
Cause: Dictionary missing compound phrases
Solution: Extract full phrases, not word-by-word
错误示例: "التدفق النقدي forecast"(阿拉伯语/英语混合)
正确示例: "توقعات التدفق النقدي"(完全翻译)
原因: 词典中缺少复合短语
解决方案: 提取完整短语,而非逐词翻译
Pitfall 3: Forgetting RTL for New Cells
陷阱3:新单元格未设置RTL
Wrong:
python
new_para = doc.add_paragraph(arabic_text) # ❌ Missing RTLRight:
python
new_para = doc.add_paragraph()
run = new_para.add_run(arabic_text)
new_para.paragraph_format.bidi = True
run.font.rtl = True
new_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT # ✓ Complete RTL错误做法:
python
new_para = doc.add_paragraph(arabic_text) # ❌ Missing RTL正确做法:
python
new_para = doc.add_paragraph()
run = new_para.add_run(arabic_text)
new_para.paragraph_format.bidi = True
run.font.rtl = True
new_para.alignment = WD_ALIGN_PARAGRAPH.RIGHT # ✓ Complete RTLPitfall 4: Not Checking Visual Output
陷阱4:未检查视觉输出
Problem: Automated checks pass but visual appearance is wrong
Solution: Always generate comparison images:
python
undefined问题: 自动化检查通过,但视觉外观错误
解决方案: 始终生成对比图片:
python
undefinedConvert to PDF then images
Convert to PDF then images
subprocess.run(['soffice', '--headless', '--convert-to', 'pdf', ar_docx])
subprocess.run(['pdftoppm', '-png', 'output.pdf', 'comparison'])
undefinedsubprocess.run(['soffice', '--headless', '--convert-to', 'pdf', ar_docx])
subprocess.run(['pdftoppm', '-png', 'output.pdf', 'comparison'])
undefinedQuick Reference: Essential Functions
快速参考:核心函数
python
undefinedpython
undefined1. Get cell background
1. Get cell background
bg = get_cell_background(cell)
bg = get_cell_background(cell)
2. Set cell background
2. Set cell background
set_cell_background(cell, 'CC0029')
set_cell_background(cell, 'CC0029')
3. Normalize text
3. Normalize text
normalized = normalize_text(text)
normalized = normalize_text(text)
4. Multi-pass translation
4. Multi-pass translation
arabic = translate_text(english, translation_dict)
arabic = translate_text(english, translation_dict)
5. Apply RTL to cell
5. Apply RTL to cell
apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False)
apply_rtl_to_cell(cell, arabic_text, font_size=10, bold=False)
6. Apply colors with auto-correction
6. Apply colors with auto-correction
apply_colors_to_cell(cell, eng_cell, ar_text)
apply_colors_to_cell(cell, eng_cell, ar_text)
7. Verify document
7. Verify document
results = verify_arabic_document(ar_doc, eng_doc, trans_dict)
undefinedresults = verify_arabic_document(ar_doc, eng_doc, trans_dict)
undefinedSuccess Criteria
成功标准
Before considering translation complete:
- Structure matches exactly (sections, tables, dimensions)
- All text right-aligned and RTL-formatted
- No unauthorized English words found
- All colors/backgrounds preserved
- Visual comparison shows matching layout
- Multi-line cells preserved (not split)
- PDF generated successfully
在确认翻译完成前,需满足以下条件:
- 结构完全匹配(章节、表格、尺寸)
- 所有文本右对齐并设置RTL格式
- 未发现未授权的英文词汇
- 所有颜色/背景均被保留
- 视觉对比显示布局一致
- 多行单元格被保留(未拆分)
- 成功生成PDF
Additional Resources
附加资源
See REFERENCE.md for:
- Complete code examples
- Real-world document templates
- Troubleshooting guide
- Advanced patterns
详见REFERENCE.md获取:
- 完整代码示例
- 真实世界文档模板
- 故障排除指南
- 进阶模式