pandoc-pdf-generation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Pandoc PDF Generation Best Practices

Pandoc PDF生成最佳实践

Overview

概述

This skill documents lessons learned from generating PDF documents from markdown using Pandoc, drawing from experiences with MkDocs HTML generation and applying systematic validation approaches.

本文档记录了使用Pandoc从Markdown生成PDF文档的经验总结,借鉴了MkDocs生成HTML的实践,并应用了系统化的验证方法。

Critical Differences: Pandoc vs Python-Markdown

关键差异:Pandoc vs Python-Markdown

Supported Features

支持的特性

FeaturePython-Markdown (MkDocs)Pandoc (PDF)
Roman numerals (
i.
,
ii.
)
❌ Not supported✅ Supported
Grid tables⚠️ Needs extension✅ Native support
LaTeX commands (
\pagebreak
)
❌ Renders as text✅ Native support
Nested list indent4 spaces (strict)More flexible
Footnotes continuation4-space indent requiredMore flexible
Key Insight: Pandoc is MORE capable than Python-Markdown, but this means markdown that works for PDF might break in MkDocs!
特性Python-Markdown(MkDocs)Pandoc(PDF)
罗马数字(
i.
ii.
❌ 不支持✅ 支持
网格表格⚠️ 需要扩展✅ 原生支持
LaTeX命令(
\pagebreak
❌ 以文本形式渲染✅ 原生支持
嵌套列表缩进4个空格(严格要求)更灵活
脚注续行需要4个空格缩进更灵活
核心要点: Pandoc的功能比Python-Markdown更丰富,但这意味着适用于PDF的Markdown在MkDocs中可能无法正常渲染!

Cross-Renderer Compatibility ✅

跨渲染器兼容性 ✅

Good News: Some formatting rules work consistently across both renderers!
Blank Line Rules (Universal):
  • ✅ Blank line after bold labels before lists - Works in both MkDocs and Pandoc
  • ✅ Blank line after plain text labels before lists - Works in both MkDocs and Pandoc
  • ✅ Blank line after HTML anchors before headers - Works in both MkDocs and Pandoc
  • ✅ Blank lines between consecutive metadata fields - Works in both MkDocs and Pandoc
Validation Method:
bash
undefined
好消息: 部分格式规则在两种渲染器中都能一致生效!
通用空行规则:
  • ✅ 加粗标签与列表之间需空一行 - 在MkDocs和Pandoc中均生效
  • ✅ 普通文本标签与列表之间需空一行 - 在MkDocs和Pandoc中均生效
  • ✅ HTML锚点与标题之间需空一行 - 在MkDocs和Pandoc中均生效
  • ✅ 连续元数据字段之间需空一行 - 在MkDocs和Pandoc中均生效
验证方法:
bash
undefined

Generate both outputs

Generate both outputs

mkdocs build --clean ./scripts/generate-pdf.sh
mkdocs build --clean ./scripts/generate-pdf.sh

Check MkDocs HTML rendering

Check MkDocs HTML rendering

grep -A 5 "For complete details, see:" site/soc2-type1/index.html
grep -A 5 "For complete details, see:" site/soc2-type1/index.html

Should show: <ul><li>...</li></ul>

Should show: <ul><li>...</li></ul>

Check Pandoc PDF rendering

Check Pandoc PDF rendering

pdftotext output/Documentation.pdf - | grep -A 5 "For complete details, see:"
pdftotext output/Documentation.pdf - | grep -A 5 "For complete details, see:"

Should show: • Bullet point

Should show: • Bullet point


**Implication:** Fix markdown once, works for both HTML and PDF! This makes maintaining shared source files much easier.

---

**意义:** 只需修复一次Markdown,即可同时适配HTML和PDF!这大大简化了共享源文件的维护工作。

---

Shared Markdown Source Strategy

共享Markdown源文件策略

The Challenge

挑战

When using same markdown files for both MkDocs (HTML) and Pandoc (PDF):
Option 1: Optimize for MkDocs (Current Approach)
  • ✅ Clean HTML rendering
  • ⚠️ PDF might have issues
  • 3-space indents, no
    \pagebreak
    , etc.
Option 2: Optimize for Pandoc
  • ✅ Perfect PDF output
  • ❌ MkDocs rendering breaks
Option 3: Separate Sources (Best for large projects)
  • Maintain
    docs/
    for MkDocs
  • Maintain
    pdf-source/
    for PDF
  • Use scripts to sync common content
Option 4: Conditional Formatting (Advanced)
  • Use Pandoc filters to handle differences
  • Use MkDocs plugins for HTML-specific needs
  • Keep single source, transform during build

当使用同一Markdown文件同时生成MkDocs(HTML)和Pandoc(PDF)输出时:
方案1:优先适配MkDocs(当前方案)
  • ✅ HTML渲染效果整洁
  • ⚠️ PDF可能出现问题
  • 使用3空格缩进,不使用
    \pagebreak
方案2:优先适配Pandoc
  • ✅ PDF输出完美
  • ❌ MkDocs渲染失效
方案3:分离源文件(大型项目最佳选择)
  • 维护
    docs/
    目录用于MkDocs
  • 维护
    pdf-source/
    目录用于PDF
  • 使用脚本同步通用内容
方案4:条件格式化(进阶方案)
  • 使用Pandoc过滤器处理差异
  • 使用MkDocs插件满足HTML特定需求
  • 保留单一源文件,在构建阶段进行转换

PDF Generation Testing Workflow

PDF生成测试工作流

Phase 1: Generate PDF (2 minutes)

阶段1:生成PDF(2分钟)

bash
./scripts/generate-pdf.sh
Check for errors:
  • LaTeX errors (process exits non-zero)
  • Missing file errors
  • Font warnings (informational, not critical)
bash
./scripts/generate-pdf.sh
错误检查:
  • LaTeX错误(进程返回非零值)
  • 文件缺失错误
  • 字体警告(仅信息提示,非严重问题)

Phase 2: Visual Inspection (10 minutes)

阶段2:视觉检查(10分钟)

CRITICAL: Actually open and read the PDF!
bash
open output/Documentation.pdf
Checklist:
  • Cover page renders correctly
  • TOC is accurate and complete
  • All section headers are styled as headers (not plain text or literal
    ##
    )
  • Bullet lists render as bullets (not inline text with dashes)
  • Numbered lists render correctly (not inline text)
  • Bold labels before lists have proper spacing
  • Plain text labels before lists have proper spacing
  • Metadata fields appear on separate lines (not run together)
  • Tables fit on pages (no overflow)
  • Code blocks are formatted correctly
  • Page breaks are reasonable (not mid-paragraph)
  • Footnotes work (if applicable)
  • No missing content
  • Font rendering acceptable
  • Total page count reasonable
关键步骤:务必打开并阅读PDF!
bash
open output/Documentation.pdf
检查清单:
  • 封面渲染正确
  • 目录准确完整
  • 所有章节标题均以标题样式呈现(非普通文本或显示字面
    ##
  • 项目符号列表以项目符号格式渲染(非带短横线的行内文本)
  • 编号列表渲染正确(非行内文本)
  • 加粗标签与列表之间间距正确
  • 普通文本标签与列表之间间距正确
  • 元数据字段显示在单独行(未合并)
  • 表格完整显示在页面内(无溢出)
  • 代码块格式正确
  • 分页合理(未在段落中间分页)
  • 脚注正常工作(若有)
  • 无内容缺失
  • 字体渲染效果可接受
  • 总页数合理

Phase 3: Specific Checks (5 minutes)

阶段3:专项检查(5分钟)

Check specific sections user mentioned:
For example, if user says "these should be bullet points":
  1. Find the section in PDF
  2. Compare to markdown source
  3. Verify markdown has proper bullets:
    markdown
    **Access Removal:**
    - Item one
    - Item two
  4. Check PDF rendering matches markdown intent
检查用户提及的特定章节:
例如,若用户反馈“这些应该是项目符号列表”:
  1. 在PDF中找到对应章节
  2. 与Markdown源文件对比
  3. 验证Markdown中的项目符号格式正确:
    markdown
    **Access Removal:**
    - Item one
    - Item two
  4. 确认PDF渲染效果符合Markdown的预期

Phase 4: Commit (only if passes)

阶段4:提交(仅当通过所有检查时)

bash
git add output/Documentation.pdf
git commit -m "docs: regenerate PDF with [specific improvements]"

bash
git add output/Documentation.pdf
git commit -m "docs: regenerate PDF with [specific improvements]"

Common PDF Issues and Solutions

常见PDF问题及解决方案

Issue 1: Headers Render as Plain Text

问题1:标题以普通文本形式渲染

Symptom: Text that should be headers (H2, H3) appears as regular paragraphs in PDF.
Root Cause: Markdown not properly formatted for Pandoc.
Check markdown:
markdown
undefined
症状: 应作为标题(H2、H3)的文本在PDF中显示为普通段落。
根本原因: Markdown格式不符合Pandoc要求。
检查Markdown:
markdown
undefined

✅ CORRECT - Header

✅ 正确格式 - 标题

User Identification and Authentication

User Identification and Authentication

❌ WRONG - Plain text

❌ 错误格式 - 普通文本

User Identification and Authentication

**Solution:** Ensure headers have `##` prefix, blank line before and after.

---
User Identification and Authentication

**解决方案:** 确保标题带有`##`前缀,前后均有空行。

---

Issue 2: Bullets Render as Plain Text

问题2:项目符号以普通文本形式渲染

Symptom: Text shows dashes/bullets as characters, not formatted lists.
Root Cause:
  1. Missing blank line before list
  2. Incorrect indentation
  3. Markdown not recognized as list
Check markdown:
markdown
undefined
症状: 文本中的短横线/项目符号以字符形式显示,未格式化为列表。
根本原因:
  1. 列表前缺少空行
  2. 缩进错误
  3. Markdown未被识别为列表
检查Markdown:
markdown
undefined

✅ CORRECT

✅ 正确格式

Access Removal:
  • Termination: Immediate revocation
  • Role change: Adjusted within 5 days
Access Removal:
  • Termination: Immediate revocation
  • Role change: Adjusted within 5 days

❌ WRONG - No blank line

❌ 错误格式 - 缺少空行

Access Removal:
  • Termination: Immediate revocation

**Solution:**
1. Add blank line before list
2. Verify proper indentation (0 spaces for root-level)
3. Use consistent markers (`-` or `*`)

---
Access Removal:
  • Termination: Immediate revocation

**解决方案:**
1. 在列表前添加空行
2. 验证缩进正确(根级别列表缩进为0空格)
3. 使用统一的列表标记符(`-`或`*`)

---

Issue 3: Font Warnings for Unicode Characters

问题3:Unicode字符的字体警告

Symptom:
[WARNING] Missing character: There is no ├ (U+251C) in font [lmmono10-regular]
Root Cause: Default LaTeX font doesn't support all Unicode characters (box-drawing, emojis, etc.)
Solutions:
Option 1: Change Font
yaml
undefined
症状:
[WARNING] Missing character: There is no ├ (U+251C) in font [lmmono10-regular]
根本原因: 默认LaTeX字体不支持部分Unicode字符(如方框绘图字符、表情符号等)
解决方案:
方案1:更换字体
yaml
undefined

In pandoc command

In pandoc command

--pdf-engine=xelatex --variable mainfont="DejaVu Sans"

**Option 2: Remove Special Characters**
```bash
--pdf-engine=xelatex --variable mainfont="DejaVu Sans"

**方案2:移除特殊字符**
```bash

Replace tree diagrams with ASCII

Replace tree diagrams with ASCII

sed -i '' 's/├/+/g' file.md sed -i '' 's/─/-/g' file.md

**Option 3: Accept Warnings**
- If characters are cosmetic (tree diagrams)
- If they don't affect content comprehension
- Document as "known limitation"

---
sed -i '' 's/├/+/g' file.md sed -i '' 's/─/-/g' file.md

**方案3:接受警告**
- 若字符仅为装饰性(如树形图)
- 若不影响内容理解
- 记录为“已知限制”

---

Issue 4: Tables Don't Fit on Page

问题4:表格超出页面范围

Symptom: Tables overflow page width, text cut off.
Solutions:
Option 1: Rotate Table (Landscape)
markdown
\begin{landscape}
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data  | Data  | Data  |
\end{landscape}
Option 2: Smaller Font in Table
markdown
\small
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data  | Data  | Data  |
\normalsize
Option 3: Redesign Table
  • Split into multiple tables
  • Use abbreviations
  • Rotate headers vertically

症状: 表格宽度超出页面,文本被截断。
解决方案:
方案1:旋转表格(横向)
markdown
\begin{landscape}
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data  | Data  | Data  |
\end{landscape}
方案2:表格使用更小字体
markdown
\small
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data  | Data  | Data  |
\normalsize
方案3:重新设计表格
  • 拆分为多个表格
  • 使用缩写
  • 垂直旋转表头

Issue 5: Bad Page Breaks

问题5:分页不合理

Symptom: Headers at bottom of page, orphaned content.
Solutions:
Option 1: Manual Page Breaks
markdown
\pagebreak
症状: 标题出现在页面底部,内容孤立。
解决方案:
方案1:手动分页
markdown
\pagebreak

Next Section

Next Section


**Option 2: Pandoc Variables**
```bash
--variable pagestyle=headings
--variable geometry:margin=1in
Option 3: LaTeX Penalties
latex
\widowpenalty=10000
\clubpenalty=10000


**方案2:Pandoc变量配置**
```bash
--variable pagestyle=headings
--variable geometry:margin=1in
方案3:LaTeX惩罚参数
latex
\widowpenalty=10000
\clubpenalty=10000

Issue 6: Bold Labels Before Lists Render Inline

问题6:加粗标签后的列表行内显示

Symptom: Bold labels followed by lists render as inline text instead of separate formatted list.
Example in PDF:
Technology Changes: - New system implementations - Software upgrades - Infrastructure modifications
Root Cause: Pandoc requires blank line after bold labels (format:
**Label:**
) before lists.
Check markdown:
markdown
undefined
症状: 加粗标签后的列表以行内文本形式显示,而非独立的格式化列表。
PDF示例:
Technology Changes: - New system implementations - Software upgrades - Infrastructure modifications
根本原因: Pandoc要求加粗标签(格式:
**Label:**
)与列表之间需空一行。
检查Markdown:
markdown
undefined

❌ WRONG - No blank line

❌ 错误格式 - 缺少空行

Technology Changes:
  • New system implementations
  • Software upgrades
Technology Changes:
  • New system implementations
  • Software upgrades

✅ CORRECT - Blank line after label

✅ 正确格式 - 标签后有空行

Technology Changes:
  • New system implementations
  • Software upgrades

**Solution:** Add blank line between bold label and list.

**Automated Detection:**
```bash
Technology Changes:
  • New system implementations
  • Software upgrades

**解决方案:** 在加粗标签与列表之间添加空行。

**自动检测:**
```bash

Find all bold labels immediately followed by lists

Find all bold labels immediately followed by lists

grep -n '^**[^]:**$' file.md | while read line; do num=$(echo $line | cut -d: -f1) next=$((num + 1)) nextline=$(sed -n "${next}p" file.md) if [[ $nextline =~ ^[-*] ]]; then echo "Line $num: Missing blank line after bold label" fi done

**Automated Fix:** Use `fix_pandoc_lists.py` script (see Automation section below).

---
grep -n '^**[^]:**$' file.md | while read line; do num=$(echo $line | cut -d: -f1) next=$((num + 1)) nextline=$(sed -n "${next}p" file.md) if [[ $nextline =~ ^[-*] ]]; then echo "Line $num: Missing blank line after bold label" fi done

**自动修复:** 使用`fix_pandoc_lists.py`脚本(见下文自动化章节)。

---

Issue 7: Headers Show Literal
##
Characters

问题7:标题显示字面
##
字符

Symptom: Headers render as plain text with literal
##
characters visible.
Example in PDF:
undefined
症状: 标题以普通文本形式渲染,显示字面
##
字符。
PDF示例:
undefined

Fraud Risk Assessment

Fraud Risk Assessment


**Root Cause:** Pandoc requires blank line after HTML anchor tags before markdown headers.

**Check markdown:**
```markdown

**根本原因:** Pandoc要求HTML锚点标签与Markdown标题之间需空一行。

**检查Markdown:**
```markdown

❌ WRONG - No blank line after anchor

❌ 错误格式 - 锚点后缺少空行

<a name="fraud-risk"></a>
<a name="fraud-risk"></a>

Fraud Risk Assessment

Fraud Risk Assessment

✅ CORRECT - Blank line after anchor

✅ 正确格式 - 锚点后有空行

<a name="fraud-risk"></a>
<a name="fraud-risk"></a>

Fraud Risk Assessment

Fraud Risk Assessment


**Why This Happens:** Pandoc treats HTML and markdown as separate contexts. Without blank line, it doesn't recognize the `##` as a markdown header.

**Solution:** Add blank line between HTML anchor and header.

**Automated Detection:**
```bash

**解决方案:** 在HTML锚点与标题之间添加空行。

**自动检测:**
```bash

Find anchors immediately followed by headers

Find anchors immediately followed by headers

grep -n '^<a name=' file.md | while read line; do num=$(echo $line | cut -d: -f1) next=$((num + 1)) nextline=$(sed -n "${next}p" file.md) if [[ $nextline =~ ^## ]]; then echo "Line $num: Missing blank line after anchor" fi done

**Automated Fix:** Use `fix_pandoc_anchors.py` script (see Automation section below).

---
grep -n '^<a name=' file.md | while read line; do num=$(echo $line | cut -d: -f1) next=$((num + 1)) nextline=$(sed -n "${next}p" file.md) if [[ $nextline =~ ^## ]]; then echo "Line $num: Missing blank line after anchor" fi done

**自动修复:** 使用`fix_pandoc_anchors.py`脚本(见下文自动化章节)。

---

Issue 8: Metadata Fields Run Together

问题8:元数据字段合并显示

Symptom: Consecutive metadata fields render on single line instead of separate lines.
Example in PDF:
Title: Report Name Author: Your Name Date: January 2025
Root Cause: Pandoc requires blank lines between consecutive paragraphs. Without them, it merges lines into continuous text flow.
Check markdown:
markdown
undefined
症状: 连续的元数据字段显示在同一行,而非单独行。
PDF示例:
Title: Report Name Author: Your Name Date: January 2025
根本原因: Pandoc要求连续段落类元素之间需空行,否则会将行合并为连续文本流。
检查Markdown:
markdown
undefined

❌ WRONG - No blank lines between

❌ 错误格式 - 之间缺少空行

Organization: Example Corp Audit Type: SOC 2 Type 1 Scope: Security (CC1-CC9)
Organization: Example Corp Audit Type: SOC 2 Type 1 Scope: Security (CC1-CC9)

✅ CORRECT - Blank lines between each

✅ 正确格式 - 每个字段之间有空行

Organization: Example Corp
Audit Type: SOC 2 Type 1
Scope: Security (CC1-CC9)

**Solution:** Add blank lines between consecutive bold label lines.

**Automated Detection:**
```bash
Organization: Example Corp
Audit Type: SOC 2 Type 1
Scope: Security (CC1-CC9)

**解决方案:** 在连续的加粗标签行之间添加空行。

**自动检测:**
```bash

Find consecutive bold label lines

Find consecutive bold label lines

grep -n '^**[^]:** ' file.md |
awk 'NR > 1 && $1 == prev+1 {print "Lines " prev "-" $1 ": Consecutive bold labels"} {prev=$1}'

**Automated Fix:** Use `fix_pandoc_metadata.py` script (see Automation section below).

---
grep -n '^**[^]:** ' file.md |
awk 'NR > 1 && $1 == prev+1 {print "Lines " prev "-" $1 ": Consecutive bold labels"} {prev=$1}'

**自动修复:** 使用`fix_pandoc_metadata.py`脚本(见下文自动化章节)。

---

Issue 9: Plain Text Labels Before Lists Render Inline

问题9:普通文本标签后的列表行内显示

Symptom: Plain text (not bold) ending with colon followed by list renders inline.
Example in PDF:
The security program aligns with: - SOC 2 - ISO 27001 - NIST Framework
Root Cause: Same as Issue 6, but for plain text labels instead of bold.
Check markdown:
markdown
undefined
症状: 以冒号结尾的普通文本(非加粗)后的列表以行内形式显示。
PDF示例:
The security program aligns with: - SOC 2 - ISO 27001 - NIST Framework
根本原因: 与问题6相同,但针对普通文本标签而非加粗标签。
检查Markdown:
markdown
undefined

❌ WRONG - No blank line

❌ 错误格式 - 缺少空行

The security program aligns with:
  • SOC 2 Trust Services Criteria
  • ISO 27001 control framework
The security program aligns with:
  • SOC 2 Trust Services Criteria
  • ISO 27001 control framework

✅ CORRECT - Blank line after plain text label

✅ 正确格式 - 普通文本标签后有空行

The security program aligns with:
  • SOC 2 Trust Services Criteria
  • ISO 27001 control framework

**Solution:** Add blank line after any text ending with `:` when followed by list.

**Automated Fix:** Enhanced `fix_pandoc_lists.py` handles both bold and plain text labels.

---
The security program aligns with:
  • SOC 2 Trust Services Criteria
  • ISO 27001 control framework

**解决方案:** 当以冒号结尾的文本后跟随列表时,添加空行。

**自动修复:** 增强版`fix_pandoc_lists.py`可同时处理加粗和普通文本标签。

---

Automation: Fix Scripts

自动化:修复脚本

Script 1: fix_pandoc_lists.py

脚本1:fix_pandoc_lists.py

Purpose: Fix bold and plain text labels before lists.
Usage:
bash
python3 fix_pandoc_lists.py
What it fixes:
  • Bold labels before lists:
    **Label:**
    → blank line → list
  • Plain text labels before lists:
    Text:
    → blank line → list
Example output:
Processing 03-risk-assessment.md...
  Line 186: Added blank line after '**Technology Changes:**'
  Line 265: Added blank line after 'The security program aligns with:'
  ✅ Fixed 03-risk-assessment.md
Script location: Project root directory

用途: 修复加粗和普通文本标签后的列表格式问题。
使用方法:
bash
python3 fix_pandoc_lists.py
修复内容:
  • 加粗标签与列表之间:
    **Label:**
    → 空行 → 列表
  • 普通文本标签与列表之间:
    Text:
    → 空行 → 列表
输出示例:
Processing 03-risk-assessment.md...
  Line 186: Added blank line after '**Technology Changes:**'
  Line 265: Added blank line after 'The security program aligns with:'
  ✅ Fixed 03-risk-assessment.md
脚本位置: 项目根目录

Script 2: fix_pandoc_anchors.py

脚本2:fix_pandoc_anchors.py

Purpose: Fix HTML anchors before headers.
Usage:
bash
python3 fix_pandoc_anchors.py
What it fixes:
  • <a name="..."></a>
    → blank line →
    ## Header
Example output:
Processing 03-risk-assessment.md...
  Line 141: Added blank line after '<a name="fraud-risk"></a>'
  ✅ Fixed 03-risk-assessment.md
Script location: Project root directory

用途: 修复HTML锚点后的标题格式问题。
使用方法:
bash
python3 fix_pandoc_anchors.py
修复内容:
  • <a name="..."></a>
    → 空行 →
    ## Header
输出示例:
Processing 03-risk-assessment.md...
  Line 141: Added blank line after '<a name="fraud-risk"></a>'
  ✅ Fixed 03-risk-assessment.md
脚本位置: 项目根目录

Script 3: fix_pandoc_metadata.py

脚本3:fix_pandoc_metadata.py

Purpose: Fix consecutive bold label metadata fields.
Usage:
bash
python3 fix_pandoc_metadata.py
What it fixes:
  • Consecutive
    **Label:** value
    lines → add blank lines between them
Example output:
Processing index.md...
  Line 3: Added blank line after '**Organization:** Example Corp'
  Line 4: Added blank line after '**Audit Type:** SOC 2 Type 1'
  ✅ Fixed index.md
Script location: Project root directory

用途: 修复连续的加粗标签元数据字段格式问题。
使用方法:
bash
python3 fix_pandoc_metadata.py
修复内容:
  • 连续的
    **Label:** value
    行 → 在行之间添加空行
输出示例:
Processing index.md...
  Line 3: Added blank line after '**Organization:** Example Corp'
  Line 4: Added blank line after '**Audit Type:** SOC 2 Type 1'
  ✅ Fixed index.md
脚本位置: 项目根目录

Running All Fix Scripts

运行所有修复脚本

Complete fix workflow:
bash
undefined
完整修复工作流:
bash
undefined

Fix all Pandoc formatting issues

Fix all Pandoc formatting issues

python3 fix_pandoc_lists.py # Lists after labels python3 fix_pandoc_anchors.py # Anchors before headers python3 fix_pandoc_metadata.py # Consecutive metadata
python3 fix_pandoc_lists.py # Lists after labels python3 fix_pandoc_anchors.py # Anchors before headers python3 fix_pandoc_metadata.py # Consecutive metadata

Regenerate PDF

Regenerate PDF

./scripts/generate-pdf.sh
./scripts/generate-pdf.sh

Visual verification

Visual verification

open output/Documentation.pdf

**When to run:**
- After adding new content with lists
- After modifying metadata sections
- After adding HTML anchors
- Before committing PDFs
- When user reports inline rendering issues

---
open output/Documentation.pdf

**运行时机:**
- 添加带列表的新内容后
- 修改元数据章节后
- 添加HTML锚点后
- 提交PDF前
- 用户反馈行内渲染问题时

---

Pandoc Command Reference

Pandoc命令参考

Basic PDF Generation

基础PDF生成

bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex
bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex

With TOC and Sections

带目录和章节编号

bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex \
  --toc \
  --toc-depth=3 \
  --number-sections
bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex \
  --toc \
  --toc-depth=3 \
  --number-sections

With Metadata

带元数据

bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex \
  --metadata title="Document Title" \
  --metadata author="Author Name" \
  --metadata date="$(date +%Y-%m-%d)"
bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex \
  --metadata title="Document Title" \
  --metadata author="Author Name" \
  --metadata date="$(date +%Y-%m-%d)"

With Custom Template

自定义模板

bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex \
  --template=custom-template.tex

bash
pandoc file.md -o output.pdf \
  --from markdown \
  --to pdf \
  --pdf-engine=xelatex \
  --template=custom-template.tex

Testing Checklist Template

测试清单模板

Copy this checklist for each PDF generation:
markdown
undefined
复制以下清单用于每次PDF生成测试:
markdown
undefined

PDF Generation Test - [DATE]

PDF Generation Test - [DATE]

Generation Phase

Generation Phase

  • Script runs without errors
  • PDF file created
  • File size reasonable (< 10MB for typical docs)
  • Script runs without errors
  • PDF file created
  • File size reasonable (< 10MB for typical docs)

Visual Inspection Phase

Visual Inspection Phase

  • Opened PDF and scrolled through ALL pages
  • Cover page correct
  • TOC complete and accurate
  • All headers styled correctly (no literal
    ##
    )
  • All bullets formatted as lists (not inline)
  • All numbered lists formatted correctly (not inline)
  • Bold/plain labels before lists properly spaced
  • Metadata fields on separate lines (not run together)
  • All tables fit on pages
  • No obviously bad page breaks
  • No missing content
  • Font rendering acceptable
  • Opened PDF and scrolled through ALL pages
  • Cover page correct
  • TOC complete and accurate
  • All headers styled correctly (no literal
    ##
    )
  • All bullets formatted as lists (not inline)
  • All numbered lists formatted correctly (not inline)
  • Bold/plain labels before lists properly spaced
  • Metadata fields on separate lines (not run together)
  • All tables fit on pages
  • No obviously bad page breaks
  • No missing content
  • Font rendering acceptable

Specific Checks (from user feedback)

Specific Checks (from user feedback)

  • [Specific section] renders correctly
  • [Specific formatting] matches intent
  • [Specific issue] is fixed
  • [Specific section] renders correctly
  • [Specific formatting] matches intent
  • [Specific issue] is fixed

Final Validation

Final Validation

  • PDF matches markdown source intent
  • All user-reported issues addressed
  • Ready for commit
Issues Found: [List any issues] Next Steps: [What needs fixing]

---
  • PDF matches markdown source intent
  • All user-reported issues addressed
  • Ready for commit
Issues Found: [List any issues] Next Steps: [What needs fixing]

---

Automation: PDF Testing Script

自动化:PDF测试脚本

Create:
scripts/test-pdf.sh
bash
#!/bin/bash
创建:
scripts/test-pdf.sh
bash
#!/bin/bash

Test PDF generation and basic quality checks

Test PDF generation and basic quality checks

set -e
set -e

Generate PDF

Generate PDF

./scripts/generate-pdf.sh
PDF="output/Documentation.pdf"
./scripts/generate-pdf.sh
PDF="output/Documentation.pdf"

Check file exists

Check file exists

if [ ! -f "$PDF" ]; then echo "❌ PDF not generated" exit 1 fi
if [ ! -f "$PDF" ]; then echo "❌ PDF not generated" exit 1 fi

Check file size (should be between 100KB and 10MB)

Check file size (should be between 100KB and 10MB)

SIZE=$(stat -f%z "$PDF" 2>/dev/null || stat -c%s "$PDF") if [ $SIZE -lt 100000 ]; then echo "⚠️ WARNING: PDF seems too small ($SIZE bytes)" elif [ $SIZE -gt 10000000 ]; then echo "⚠️ WARNING: PDF seems too large ($SIZE bytes)" else echo "✅ PDF size OK: $(numfmt --to=iec-i --suffix=B $SIZE)" fi
SIZE=$(stat -f%z "$PDF" 2>/dev/null || stat -c%s "$PDF") if [ $SIZE -lt 100000 ]; then echo "⚠️ WARNING: PDF seems too small ($SIZE bytes)" elif [ $SIZE -gt 10000000 ]; then echo "⚠️ WARNING: PDF seems too large ($SIZE bytes)" else echo "✅ PDF size OK: $(numfmt --to=iec-i --suffix=B $SIZE)" fi

Check page count (using pdfinfo if available)

Check page count (using pdfinfo if available)

if command -v pdfinfo &> /dev/null; then PAGES=$(pdfinfo "$PDF" | grep "Pages:" | awk '{print $2}') echo "📄 Pages: $PAGES"
if [ $PAGES -lt 50 ]; then
    echo "⚠️  WARNING: Expected ~89 pages, got $PAGES"
fi
fi
echo "" echo "✅ Basic checks passed!" echo "📋 Next: Open PDF and visually inspect" echo " open $PDF"

---
if command -v pdfinfo &> /dev/null; then PAGES=$(pdfinfo "$PDF" | grep "Pages:" | awk '{print $2}') echo "📄 Pages: $PAGES"
if [ $PAGES -lt 50 ]; then
    echo "⚠️  WARNING: Expected ~89 pages, got $PAGES"
fi
fi
echo "" echo "✅ Basic checks passed!" echo "📋 Next: Open PDF and visually inspect" echo " open $PDF"

---

Key Takeaways

核心要点总结

  1. Different renderers = different rules - Pandoc ≠ Python-Markdown
  2. Visual inspection required - Terminal success ≠ correct PDF
  3. Blank lines are critical - Pandoc needs blank lines between different markdown elements
  4. Test locally before committing - Generate, open, review
  5. Same workflow as MkDocs - Systematic testing, not assumptions
  6. Font limitations are real - Accept or configure around them
  7. Markdown intent matters - Source should express desired structure
  8. Create testing checklists - Catch issues systematically
  9. Automate fixes - Create scripts for common formatting issues
  10. HTML and markdown need separation - Always blank line after HTML elements

  1. 不同渲染器 = 不同规则 - Pandoc ≠ Python-Markdown
  2. 必须进行视觉检查 - 终端执行成功 ≠ PDF正确
  3. 空行至关重要 - Pandoc需要在不同Markdown元素之间添加空行
  4. 提交前本地测试 - 生成、打开、审核
  5. 与MkDocs工作流一致 - 系统化测试,而非主观假设
  6. 字体限制真实存在 - 接受限制或配置规避
  7. Markdown的意图很重要 - 源文件应清晰表达预期结构
  8. 创建测试清单 - 系统化发现问题
  9. 自动化修复 - 针对常见格式问题编写脚本
  10. HTML与Markdown需分离 - HTML元素后务必添加空行

Common Pandoc Gotchas Summary

Pandoc常见陷阱总结

The "Blank Line Rule": Pandoc requires blank lines in these situations:
  • After bold/plain text labels before lists
  • After HTML tags before markdown headers
  • Between consecutive paragraph-like elements
  • Before and after headers
Quick Check Commands:
bash
undefined
“空行规则”: Pandoc在以下场景需要空行:
  • 加粗/普通文本标签与列表之间
  • HTML标签与Markdown标题之间
  • 连续段落类元素之间
  • 标题前后
快速检查命令:
bash
undefined

Check for labels before lists (no blank line)

Check for labels before lists (no blank line)

grep -B1 '^[-*] ' file.md | grep ':$' | grep -v '^--$'
grep -B1 '^[-*] ' file.md | grep ':$' | grep -v '^--$'

Check for anchors before headers (no blank line)

Check for anchors before headers (no blank line)

grep -A1 '^<a name=' file.md | grep '^##'
grep -A1 '^<a name=' file.md | grep '^##'

Check for consecutive bold labels

Check for consecutive bold labels

grep '^**[^]:** ' file.md | uniq -c | grep -v '^ *1 '

**When in doubt:** Add a blank line. Pandoc almost never complains about too many blank lines.

---
grep '^**[^]:** ' file.md | uniq -c | grep -v '^ *1 '

**存疑时:** 添加空行。Pandoc几乎不会因空行过多而报错。

---

Real-World Example

实际案例

Project: Large documentation set Files: 15 markdown files Issues Found: 469 formatting problems across 4 categories
Fixes Applied:
  • 376 labels before lists (Issues 6 & 9)
  • 45 anchors before headers (Issue 7)
  • 62 consecutive metadata fields (Issue 8)
Time Investment:
  • Discovery: ~2 hours (user feedback + testing)
  • Script development: ~1 hour (3 scripts)
  • Execution: ~5 minutes (automated)
  • Verification: ~10 minutes (visual PDF review)
ROI: 3 hours invested, automated solution for future. All issues fixed in 5 minutes.

项目: 大型文档集 文件: 15个Markdown文件 发现问题: 4类共469个格式问题
修复内容:
  • 376个标签与列表格式问题(问题6和9)
  • 45个锚点与标题格式问题(问题7)
  • 62个连续元数据字段格式问题(问题8)
时间投入:
  • 问题发现:约2小时(用户反馈+测试)
  • 脚本开发:约1小时(3个脚本)
  • 执行修复:约5分钟(自动化)
  • 验证:约10分钟(PDF视觉审核)
投资回报: 投入3小时,获得可复用的自动化解决方案,后续所有问题可在5分钟内修复。

References

参考资料