pandoc-pdf-generation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePandoc PDF Generation Best Practices
Pandoc PDF生成最佳实践
Overview
概述
This skill documents lessons learned from generating PDF documents from markdown using Pandoc, drawing from experiences with MkDocs HTML generation and applying systematic validation approaches.
本文档记录了使用Pandoc从Markdown生成PDF文档的经验总结,借鉴了MkDocs生成HTML的实践,并应用了系统化的验证方法。
Critical Differences: Pandoc vs Python-Markdown
关键差异:Pandoc vs Python-Markdown
Supported Features
支持的特性
| Feature | Python-Markdown (MkDocs) | Pandoc (PDF) |
|---|---|---|
Roman numerals ( | ❌ Not supported | ✅ Supported |
| Grid tables | ⚠️ Needs extension | ✅ Native support |
LaTeX commands ( | ❌ Renders as text | ✅ Native support |
| Nested list indent | 4 spaces (strict) | More flexible |
| Footnotes continuation | 4-space indent required | More flexible |
Key Insight: Pandoc is MORE capable than Python-Markdown, but this means markdown that works for PDF might break in MkDocs!
| 特性 | Python-Markdown(MkDocs) | Pandoc(PDF) |
|---|---|---|
罗马数字( | ❌ 不支持 | ✅ 支持 |
| 网格表格 | ⚠️ 需要扩展 | ✅ 原生支持 |
LaTeX命令( | ❌ 以文本形式渲染 | ✅ 原生支持 |
| 嵌套列表缩进 | 4个空格(严格要求) | 更灵活 |
| 脚注续行 | 需要4个空格缩进 | 更灵活 |
核心要点: Pandoc的功能比Python-Markdown更丰富,但这意味着适用于PDF的Markdown在MkDocs中可能无法正常渲染!
Cross-Renderer Compatibility ✅
跨渲染器兼容性 ✅
Good News: Some formatting rules work consistently across both renderers!
Blank Line Rules (Universal):
- ✅ Blank line after bold labels before lists - Works in both MkDocs and Pandoc
- ✅ Blank line after plain text labels before lists - Works in both MkDocs and Pandoc
- ✅ Blank line after HTML anchors before headers - Works in both MkDocs and Pandoc
- ✅ Blank lines between consecutive metadata fields - Works in both MkDocs and Pandoc
Validation Method:
bash
undefined好消息: 部分格式规则在两种渲染器中都能一致生效!
通用空行规则:
- ✅ 加粗标签与列表之间需空一行 - 在MkDocs和Pandoc中均生效
- ✅ 普通文本标签与列表之间需空一行 - 在MkDocs和Pandoc中均生效
- ✅ HTML锚点与标题之间需空一行 - 在MkDocs和Pandoc中均生效
- ✅ 连续元数据字段之间需空一行 - 在MkDocs和Pandoc中均生效
验证方法:
bash
undefinedGenerate both outputs
Generate both outputs
mkdocs build --clean
./scripts/generate-pdf.sh
mkdocs build --clean
./scripts/generate-pdf.sh
Check MkDocs HTML rendering
Check MkDocs HTML rendering
grep -A 5 "For complete details, see:" site/soc2-type1/index.html
grep -A 5 "For complete details, see:" site/soc2-type1/index.html
Should show: <ul><li>...</li></ul>
Should show: <ul><li>...</li></ul>
Check Pandoc PDF rendering
Check Pandoc PDF rendering
pdftotext output/Documentation.pdf - | grep -A 5 "For complete details, see:"
pdftotext output/Documentation.pdf - | grep -A 5 "For complete details, see:"
Should show: • Bullet point
Should show: • Bullet point
**Implication:** Fix markdown once, works for both HTML and PDF! This makes maintaining shared source files much easier.
---
**意义:** 只需修复一次Markdown,即可同时适配HTML和PDF!这大大简化了共享源文件的维护工作。
---Shared Markdown Source Strategy
共享Markdown源文件策略
The Challenge
挑战
When using same markdown files for both MkDocs (HTML) and Pandoc (PDF):
Option 1: Optimize for MkDocs (Current Approach)
- ✅ Clean HTML rendering
- ⚠️ PDF might have issues
- 3-space indents, no , etc.
\pagebreak
Option 2: Optimize for Pandoc
- ✅ Perfect PDF output
- ❌ MkDocs rendering breaks
Option 3: Separate Sources (Best for large projects)
- Maintain for MkDocs
docs/ - Maintain for PDF
pdf-source/ - Use scripts to sync common content
Option 4: Conditional Formatting (Advanced)
- Use Pandoc filters to handle differences
- Use MkDocs plugins for HTML-specific needs
- Keep single source, transform during build
当使用同一Markdown文件同时生成MkDocs(HTML)和Pandoc(PDF)输出时:
方案1:优先适配MkDocs(当前方案)
- ✅ HTML渲染效果整洁
- ⚠️ PDF可能出现问题
- 使用3空格缩进,不使用等
\pagebreak
方案2:优先适配Pandoc
- ✅ PDF输出完美
- ❌ MkDocs渲染失效
方案3:分离源文件(大型项目最佳选择)
- 维护目录用于MkDocs
docs/ - 维护目录用于PDF
pdf-source/ - 使用脚本同步通用内容
方案4:条件格式化(进阶方案)
- 使用Pandoc过滤器处理差异
- 使用MkDocs插件满足HTML特定需求
- 保留单一源文件,在构建阶段进行转换
PDF Generation Testing Workflow
PDF生成测试工作流
Phase 1: Generate PDF (2 minutes)
阶段1:生成PDF(2分钟)
bash
./scripts/generate-pdf.shCheck for errors:
- LaTeX errors (process exits non-zero)
- Missing file errors
- Font warnings (informational, not critical)
bash
./scripts/generate-pdf.sh错误检查:
- LaTeX错误(进程返回非零值)
- 文件缺失错误
- 字体警告(仅信息提示,非严重问题)
Phase 2: Visual Inspection (10 minutes)
阶段2:视觉检查(10分钟)
CRITICAL: Actually open and read the PDF!
bash
open output/Documentation.pdfChecklist:
- Cover page renders correctly
- TOC is accurate and complete
- All section headers are styled as headers (not plain text or literal )
## - Bullet lists render as bullets (not inline text with dashes)
- Numbered lists render correctly (not inline text)
- Bold labels before lists have proper spacing
- Plain text labels before lists have proper spacing
- Metadata fields appear on separate lines (not run together)
- Tables fit on pages (no overflow)
- Code blocks are formatted correctly
- Page breaks are reasonable (not mid-paragraph)
- Footnotes work (if applicable)
- No missing content
- Font rendering acceptable
- Total page count reasonable
关键步骤:务必打开并阅读PDF!
bash
open output/Documentation.pdf检查清单:
- 封面渲染正确
- 目录准确完整
- 所有章节标题均以标题样式呈现(非普通文本或显示字面)
## - 项目符号列表以项目符号格式渲染(非带短横线的行内文本)
- 编号列表渲染正确(非行内文本)
- 加粗标签与列表之间间距正确
- 普通文本标签与列表之间间距正确
- 元数据字段显示在单独行(未合并)
- 表格完整显示在页面内(无溢出)
- 代码块格式正确
- 分页合理(未在段落中间分页)
- 脚注正常工作(若有)
- 无内容缺失
- 字体渲染效果可接受
- 总页数合理
Phase 3: Specific Checks (5 minutes)
阶段3:专项检查(5分钟)
Check specific sections user mentioned:
For example, if user says "these should be bullet points":
- Find the section in PDF
- Compare to markdown source
- Verify markdown has proper bullets:
markdown
**Access Removal:** - Item one - Item two - Check PDF rendering matches markdown intent
检查用户提及的特定章节:
例如,若用户反馈“这些应该是项目符号列表”:
- 在PDF中找到对应章节
- 与Markdown源文件对比
- 验证Markdown中的项目符号格式正确:
markdown
**Access Removal:** - Item one - Item two - 确认PDF渲染效果符合Markdown的预期
Phase 4: Commit (only if passes)
阶段4:提交(仅当通过所有检查时)
bash
git add output/Documentation.pdf
git commit -m "docs: regenerate PDF with [specific improvements]"bash
git add output/Documentation.pdf
git commit -m "docs: regenerate PDF with [specific improvements]"Common PDF Issues and Solutions
常见PDF问题及解决方案
Issue 1: Headers Render as Plain Text
问题1:标题以普通文本形式渲染
Symptom: Text that should be headers (H2, H3) appears as regular paragraphs in PDF.
Root Cause: Markdown not properly formatted for Pandoc.
Check markdown:
markdown
undefined症状: 应作为标题(H2、H3)的文本在PDF中显示为普通段落。
根本原因: Markdown格式不符合Pandoc要求。
检查Markdown:
markdown
undefined✅ CORRECT - Header
✅ 正确格式 - 标题
User Identification and Authentication
User Identification and Authentication
❌ WRONG - Plain text
❌ 错误格式 - 普通文本
User Identification and Authentication
**Solution:** Ensure headers have `##` prefix, blank line before and after.
---User Identification and Authentication
**解决方案:** 确保标题带有`##`前缀,前后均有空行。
---Issue 2: Bullets Render as Plain Text
问题2:项目符号以普通文本形式渲染
Symptom: Text shows dashes/bullets as characters, not formatted lists.
Root Cause:
- Missing blank line before list
- Incorrect indentation
- Markdown not recognized as list
Check markdown:
markdown
undefined症状: 文本中的短横线/项目符号以字符形式显示,未格式化为列表。
根本原因:
- 列表前缺少空行
- 缩进错误
- Markdown未被识别为列表
检查Markdown:
markdown
undefined✅ CORRECT
✅ 正确格式
Access Removal:
- Termination: Immediate revocation
- Role change: Adjusted within 5 days
Access Removal:
- Termination: Immediate revocation
- Role change: Adjusted within 5 days
❌ WRONG - No blank line
❌ 错误格式 - 缺少空行
Access Removal:
- Termination: Immediate revocation
**Solution:**
1. Add blank line before list
2. Verify proper indentation (0 spaces for root-level)
3. Use consistent markers (`-` or `*`)
---Access Removal:
- Termination: Immediate revocation
**解决方案:**
1. 在列表前添加空行
2. 验证缩进正确(根级别列表缩进为0空格)
3. 使用统一的列表标记符(`-`或`*`)
---Issue 3: Font Warnings for Unicode Characters
问题3:Unicode字符的字体警告
Symptom:
[WARNING] Missing character: There is no ├ (U+251C) in font [lmmono10-regular]Root Cause: Default LaTeX font doesn't support all Unicode characters (box-drawing, emojis, etc.)
Solutions:
Option 1: Change Font
yaml
undefined症状:
[WARNING] Missing character: There is no ├ (U+251C) in font [lmmono10-regular]根本原因: 默认LaTeX字体不支持部分Unicode字符(如方框绘图字符、表情符号等)
解决方案:
方案1:更换字体
yaml
undefinedIn pandoc command
In pandoc command
--pdf-engine=xelatex
--variable mainfont="DejaVu Sans"
**Option 2: Remove Special Characters**
```bash--pdf-engine=xelatex
--variable mainfont="DejaVu Sans"
**方案2:移除特殊字符**
```bashReplace tree diagrams with ASCII
Replace tree diagrams with ASCII
sed -i '' 's/├/+/g' file.md
sed -i '' 's/─/-/g' file.md
**Option 3: Accept Warnings**
- If characters are cosmetic (tree diagrams)
- If they don't affect content comprehension
- Document as "known limitation"
---sed -i '' 's/├/+/g' file.md
sed -i '' 's/─/-/g' file.md
**方案3:接受警告**
- 若字符仅为装饰性(如树形图)
- 若不影响内容理解
- 记录为“已知限制”
---Issue 4: Tables Don't Fit on Page
问题4:表格超出页面范围
Symptom: Tables overflow page width, text cut off.
Solutions:
Option 1: Rotate Table (Landscape)
markdown
\begin{landscape}
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data | Data | Data |
\end{landscape}Option 2: Smaller Font in Table
markdown
\small
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data | Data | Data |
\normalsizeOption 3: Redesign Table
- Split into multiple tables
- Use abbreviations
- Rotate headers vertically
症状: 表格宽度超出页面,文本被截断。
解决方案:
方案1:旋转表格(横向)
markdown
\begin{landscape}
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data | Data | Data |
\end{landscape}方案2:表格使用更小字体
markdown
\small
| Col 1 | Col 2 | Col 3 |
|-------|-------|-------|
| Data | Data | Data |
\normalsize方案3:重新设计表格
- 拆分为多个表格
- 使用缩写
- 垂直旋转表头
Issue 5: Bad Page Breaks
问题5:分页不合理
Symptom: Headers at bottom of page, orphaned content.
Solutions:
Option 1: Manual Page Breaks
markdown
\pagebreak症状: 标题出现在页面底部,内容孤立。
解决方案:
方案1:手动分页
markdown
\pagebreakNext Section
Next Section
**Option 2: Pandoc Variables**
```bash
--variable pagestyle=headings
--variable geometry:margin=1inOption 3: LaTeX Penalties
latex
\widowpenalty=10000
\clubpenalty=10000
**方案2:Pandoc变量配置**
```bash
--variable pagestyle=headings
--variable geometry:margin=1in方案3:LaTeX惩罚参数
latex
\widowpenalty=10000
\clubpenalty=10000Issue 6: Bold Labels Before Lists Render Inline
问题6:加粗标签后的列表行内显示
Symptom: Bold labels followed by lists render as inline text instead of separate formatted list.
Example in PDF:
Technology Changes: - New system implementations - Software upgrades - Infrastructure modificationsRoot Cause: Pandoc requires blank line after bold labels (format: ) before lists.
**Label:**Check markdown:
markdown
undefined症状: 加粗标签后的列表以行内文本形式显示,而非独立的格式化列表。
PDF示例:
Technology Changes: - New system implementations - Software upgrades - Infrastructure modifications根本原因: Pandoc要求加粗标签(格式:)与列表之间需空一行。
**Label:**检查Markdown:
markdown
undefined❌ WRONG - No blank line
❌ 错误格式 - 缺少空行
Technology Changes:
- New system implementations
- Software upgrades
Technology Changes:
- New system implementations
- Software upgrades
✅ CORRECT - Blank line after label
✅ 正确格式 - 标签后有空行
Technology Changes:
- New system implementations
- Software upgrades
**Solution:** Add blank line between bold label and list.
**Automated Detection:**
```bashTechnology Changes:
- New system implementations
- Software upgrades
**解决方案:** 在加粗标签与列表之间添加空行。
**自动检测:**
```bashFind all bold labels immediately followed by lists
Find all bold labels immediately followed by lists
grep -n '^**[^]:**$' file.md | while read line; do
num=$(echo $line | cut -d: -f1)
next=$((num + 1))
nextline=$(sed -n "${next}p" file.md)
if [[ $nextline =~ ^[-*] ]]; then
echo "Line $num: Missing blank line after bold label"
fi
done
**Automated Fix:** Use `fix_pandoc_lists.py` script (see Automation section below).
---grep -n '^**[^]:**$' file.md | while read line; do
num=$(echo $line | cut -d: -f1)
next=$((num + 1))
nextline=$(sed -n "${next}p" file.md)
if [[ $nextline =~ ^[-*] ]]; then
echo "Line $num: Missing blank line after bold label"
fi
done
**自动修复:** 使用`fix_pandoc_lists.py`脚本(见下文自动化章节)。
---Issue 7: Headers Show Literal ##
Characters
##问题7:标题显示字面##
字符
##Symptom: Headers render as plain text with literal characters visible.
##Example in PDF:
undefined症状: 标题以普通文本形式渲染,显示字面字符。
##PDF示例:
undefinedFraud Risk Assessment
Fraud Risk Assessment
**Root Cause:** Pandoc requires blank line after HTML anchor tags before markdown headers.
**Check markdown:**
```markdown
**根本原因:** Pandoc要求HTML锚点标签与Markdown标题之间需空一行。
**检查Markdown:**
```markdown❌ WRONG - No blank line after anchor
❌ 错误格式 - 锚点后缺少空行
<a name="fraud-risk"></a>
<a name="fraud-risk"></a>
Fraud Risk Assessment
Fraud Risk Assessment
✅ CORRECT - Blank line after anchor
✅ 正确格式 - 锚点后有空行
<a name="fraud-risk"></a>
<a name="fraud-risk"></a>
Fraud Risk Assessment
Fraud Risk Assessment
**Why This Happens:** Pandoc treats HTML and markdown as separate contexts. Without blank line, it doesn't recognize the `##` as a markdown header.
**Solution:** Add blank line between HTML anchor and header.
**Automated Detection:**
```bash
**解决方案:** 在HTML锚点与标题之间添加空行。
**自动检测:**
```bashFind anchors immediately followed by headers
Find anchors immediately followed by headers
grep -n '^<a name=' file.md | while read line; do
num=$(echo $line | cut -d: -f1)
next=$((num + 1))
nextline=$(sed -n "${next}p" file.md)
if [[ $nextline =~ ^## ]]; then
echo "Line $num: Missing blank line after anchor"
fi
done
**Automated Fix:** Use `fix_pandoc_anchors.py` script (see Automation section below).
---grep -n '^<a name=' file.md | while read line; do
num=$(echo $line | cut -d: -f1)
next=$((num + 1))
nextline=$(sed -n "${next}p" file.md)
if [[ $nextline =~ ^## ]]; then
echo "Line $num: Missing blank line after anchor"
fi
done
**自动修复:** 使用`fix_pandoc_anchors.py`脚本(见下文自动化章节)。
---Issue 8: Metadata Fields Run Together
问题8:元数据字段合并显示
Symptom: Consecutive metadata fields render on single line instead of separate lines.
Example in PDF:
Title: Report Name Author: Your Name Date: January 2025Root Cause: Pandoc requires blank lines between consecutive paragraphs. Without them, it merges lines into continuous text flow.
Check markdown:
markdown
undefined症状: 连续的元数据字段显示在同一行,而非单独行。
PDF示例:
Title: Report Name Author: Your Name Date: January 2025根本原因: Pandoc要求连续段落类元素之间需空行,否则会将行合并为连续文本流。
检查Markdown:
markdown
undefined❌ WRONG - No blank lines between
❌ 错误格式 - 之间缺少空行
Organization: Example Corp
Audit Type: SOC 2 Type 1
Scope: Security (CC1-CC9)
Organization: Example Corp
Audit Type: SOC 2 Type 1
Scope: Security (CC1-CC9)
✅ CORRECT - Blank lines between each
✅ 正确格式 - 每个字段之间有空行
Organization: Example Corp
Audit Type: SOC 2 Type 1
Scope: Security (CC1-CC9)
**Solution:** Add blank lines between consecutive bold label lines.
**Automated Detection:**
```bashOrganization: Example Corp
Audit Type: SOC 2 Type 1
Scope: Security (CC1-CC9)
**解决方案:** 在连续的加粗标签行之间添加空行。
**自动检测:**
```bashFind consecutive bold label lines
Find consecutive bold label lines
grep -n '^**[^]:** ' file.md |
awk 'NR > 1 && $1 == prev+1 {print "Lines " prev "-" $1 ": Consecutive bold labels"} {prev=$1}'
awk 'NR > 1 && $1 == prev+1 {print "Lines " prev "-" $1 ": Consecutive bold labels"} {prev=$1}'
**Automated Fix:** Use `fix_pandoc_metadata.py` script (see Automation section below).
---grep -n '^**[^]:** ' file.md |
awk 'NR > 1 && $1 == prev+1 {print "Lines " prev "-" $1 ": Consecutive bold labels"} {prev=$1}'
awk 'NR > 1 && $1 == prev+1 {print "Lines " prev "-" $1 ": Consecutive bold labels"} {prev=$1}'
**自动修复:** 使用`fix_pandoc_metadata.py`脚本(见下文自动化章节)。
---Issue 9: Plain Text Labels Before Lists Render Inline
问题9:普通文本标签后的列表行内显示
Symptom: Plain text (not bold) ending with colon followed by list renders inline.
Example in PDF:
The security program aligns with: - SOC 2 - ISO 27001 - NIST FrameworkRoot Cause: Same as Issue 6, but for plain text labels instead of bold.
Check markdown:
markdown
undefined症状: 以冒号结尾的普通文本(非加粗)后的列表以行内形式显示。
PDF示例:
The security program aligns with: - SOC 2 - ISO 27001 - NIST Framework根本原因: 与问题6相同,但针对普通文本标签而非加粗标签。
检查Markdown:
markdown
undefined❌ WRONG - No blank line
❌ 错误格式 - 缺少空行
The security program aligns with:
- SOC 2 Trust Services Criteria
- ISO 27001 control framework
The security program aligns with:
- SOC 2 Trust Services Criteria
- ISO 27001 control framework
✅ CORRECT - Blank line after plain text label
✅ 正确格式 - 普通文本标签后有空行
The security program aligns with:
- SOC 2 Trust Services Criteria
- ISO 27001 control framework
**Solution:** Add blank line after any text ending with `:` when followed by list.
**Automated Fix:** Enhanced `fix_pandoc_lists.py` handles both bold and plain text labels.
---The security program aligns with:
- SOC 2 Trust Services Criteria
- ISO 27001 control framework
**解决方案:** 当以冒号结尾的文本后跟随列表时,添加空行。
**自动修复:** 增强版`fix_pandoc_lists.py`可同时处理加粗和普通文本标签。
---Automation: Fix Scripts
自动化:修复脚本
Script 1: fix_pandoc_lists.py
脚本1:fix_pandoc_lists.py
Purpose: Fix bold and plain text labels before lists.
Usage:
bash
python3 fix_pandoc_lists.pyWhat it fixes:
- Bold labels before lists: → blank line → list
**Label:** - Plain text labels before lists: → blank line → list
Text:
Example output:
Processing 03-risk-assessment.md...
Line 186: Added blank line after '**Technology Changes:**'
Line 265: Added blank line after 'The security program aligns with:'
✅ Fixed 03-risk-assessment.mdScript location: Project root directory
用途: 修复加粗和普通文本标签后的列表格式问题。
使用方法:
bash
python3 fix_pandoc_lists.py修复内容:
- 加粗标签与列表之间:→ 空行 → 列表
**Label:** - 普通文本标签与列表之间:→ 空行 → 列表
Text:
输出示例:
Processing 03-risk-assessment.md...
Line 186: Added blank line after '**Technology Changes:**'
Line 265: Added blank line after 'The security program aligns with:'
✅ Fixed 03-risk-assessment.md脚本位置: 项目根目录
Script 2: fix_pandoc_anchors.py
脚本2:fix_pandoc_anchors.py
Purpose: Fix HTML anchors before headers.
Usage:
bash
python3 fix_pandoc_anchors.pyWhat it fixes:
- → blank line →
<a name="..."></a>## Header
Example output:
Processing 03-risk-assessment.md...
Line 141: Added blank line after '<a name="fraud-risk"></a>'
✅ Fixed 03-risk-assessment.mdScript location: Project root directory
用途: 修复HTML锚点后的标题格式问题。
使用方法:
bash
python3 fix_pandoc_anchors.py修复内容:
- → 空行 →
<a name="..."></a>## Header
输出示例:
Processing 03-risk-assessment.md...
Line 141: Added blank line after '<a name="fraud-risk"></a>'
✅ Fixed 03-risk-assessment.md脚本位置: 项目根目录
Script 3: fix_pandoc_metadata.py
脚本3:fix_pandoc_metadata.py
Purpose: Fix consecutive bold label metadata fields.
Usage:
bash
python3 fix_pandoc_metadata.pyWhat it fixes:
- Consecutive lines → add blank lines between them
**Label:** value
Example output:
Processing index.md...
Line 3: Added blank line after '**Organization:** Example Corp'
Line 4: Added blank line after '**Audit Type:** SOC 2 Type 1'
✅ Fixed index.mdScript location: Project root directory
用途: 修复连续的加粗标签元数据字段格式问题。
使用方法:
bash
python3 fix_pandoc_metadata.py修复内容:
- 连续的行 → 在行之间添加空行
**Label:** value
输出示例:
Processing index.md...
Line 3: Added blank line after '**Organization:** Example Corp'
Line 4: Added blank line after '**Audit Type:** SOC 2 Type 1'
✅ Fixed index.md脚本位置: 项目根目录
Running All Fix Scripts
运行所有修复脚本
Complete fix workflow:
bash
undefined完整修复工作流:
bash
undefinedFix all Pandoc formatting issues
Fix all Pandoc formatting issues
python3 fix_pandoc_lists.py # Lists after labels
python3 fix_pandoc_anchors.py # Anchors before headers
python3 fix_pandoc_metadata.py # Consecutive metadata
python3 fix_pandoc_lists.py # Lists after labels
python3 fix_pandoc_anchors.py # Anchors before headers
python3 fix_pandoc_metadata.py # Consecutive metadata
Regenerate PDF
Regenerate PDF
./scripts/generate-pdf.sh
./scripts/generate-pdf.sh
Visual verification
Visual verification
open output/Documentation.pdf
**When to run:**
- After adding new content with lists
- After modifying metadata sections
- After adding HTML anchors
- Before committing PDFs
- When user reports inline rendering issues
---open output/Documentation.pdf
**运行时机:**
- 添加带列表的新内容后
- 修改元数据章节后
- 添加HTML锚点后
- 提交PDF前
- 用户反馈行内渲染问题时
---Pandoc Command Reference
Pandoc命令参考
Basic PDF Generation
基础PDF生成
bash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatexbash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatexWith TOC and Sections
带目录和章节编号
bash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--toc \
--toc-depth=3 \
--number-sectionsbash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--toc \
--toc-depth=3 \
--number-sectionsWith Metadata
带元数据
bash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--metadata title="Document Title" \
--metadata author="Author Name" \
--metadata date="$(date +%Y-%m-%d)"bash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--metadata title="Document Title" \
--metadata author="Author Name" \
--metadata date="$(date +%Y-%m-%d)"With Custom Template
自定义模板
bash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--template=custom-template.texbash
pandoc file.md -o output.pdf \
--from markdown \
--to pdf \
--pdf-engine=xelatex \
--template=custom-template.texTesting Checklist Template
测试清单模板
Copy this checklist for each PDF generation:
markdown
undefined复制以下清单用于每次PDF生成测试:
markdown
undefinedPDF Generation Test - [DATE]
PDF Generation Test - [DATE]
Generation Phase
Generation Phase
- Script runs without errors
- PDF file created
- File size reasonable (< 10MB for typical docs)
- Script runs without errors
- PDF file created
- File size reasonable (< 10MB for typical docs)
Visual Inspection Phase
Visual Inspection Phase
- Opened PDF and scrolled through ALL pages
- Cover page correct
- TOC complete and accurate
- All headers styled correctly (no literal )
## - All bullets formatted as lists (not inline)
- All numbered lists formatted correctly (not inline)
- Bold/plain labels before lists properly spaced
- Metadata fields on separate lines (not run together)
- All tables fit on pages
- No obviously bad page breaks
- No missing content
- Font rendering acceptable
- Opened PDF and scrolled through ALL pages
- Cover page correct
- TOC complete and accurate
- All headers styled correctly (no literal )
## - All bullets formatted as lists (not inline)
- All numbered lists formatted correctly (not inline)
- Bold/plain labels before lists properly spaced
- Metadata fields on separate lines (not run together)
- All tables fit on pages
- No obviously bad page breaks
- No missing content
- Font rendering acceptable
Specific Checks (from user feedback)
Specific Checks (from user feedback)
- [Specific section] renders correctly
- [Specific formatting] matches intent
- [Specific issue] is fixed
- [Specific section] renders correctly
- [Specific formatting] matches intent
- [Specific issue] is fixed
Final Validation
Final Validation
- PDF matches markdown source intent
- All user-reported issues addressed
- Ready for commit
Issues Found: [List any issues]
Next Steps: [What needs fixing]
---- PDF matches markdown source intent
- All user-reported issues addressed
- Ready for commit
Issues Found: [List any issues]
Next Steps: [What needs fixing]
---Automation: PDF Testing Script
自动化:PDF测试脚本
Create:
scripts/test-pdf.shbash
#!/bin/bash创建:
scripts/test-pdf.shbash
#!/bin/bashTest PDF generation and basic quality checks
Test PDF generation and basic quality checks
set -e
set -e
Generate PDF
Generate PDF
./scripts/generate-pdf.sh
PDF="output/Documentation.pdf"
./scripts/generate-pdf.sh
PDF="output/Documentation.pdf"
Check file exists
Check file exists
if [ ! -f "$PDF" ]; then
echo "❌ PDF not generated"
exit 1
fi
if [ ! -f "$PDF" ]; then
echo "❌ PDF not generated"
exit 1
fi
Check file size (should be between 100KB and 10MB)
Check file size (should be between 100KB and 10MB)
SIZE=$(stat -f%z "$PDF" 2>/dev/null || stat -c%s "$PDF")
if [ $SIZE -lt 100000 ]; then
echo "⚠️ WARNING: PDF seems too small ($SIZE bytes)"
elif [ $SIZE -gt 10000000 ]; then
echo "⚠️ WARNING: PDF seems too large ($SIZE bytes)"
else
echo "✅ PDF size OK: $(numfmt --to=iec-i --suffix=B $SIZE)"
fi
SIZE=$(stat -f%z "$PDF" 2>/dev/null || stat -c%s "$PDF")
if [ $SIZE -lt 100000 ]; then
echo "⚠️ WARNING: PDF seems too small ($SIZE bytes)"
elif [ $SIZE -gt 10000000 ]; then
echo "⚠️ WARNING: PDF seems too large ($SIZE bytes)"
else
echo "✅ PDF size OK: $(numfmt --to=iec-i --suffix=B $SIZE)"
fi
Check page count (using pdfinfo if available)
Check page count (using pdfinfo if available)
if command -v pdfinfo &> /dev/null; then
PAGES=$(pdfinfo "$PDF" | grep "Pages:" | awk '{print $2}')
echo "📄 Pages: $PAGES"
if [ $PAGES -lt 50 ]; then
echo "⚠️ WARNING: Expected ~89 pages, got $PAGES"
fifi
echo ""
echo "✅ Basic checks passed!"
echo "📋 Next: Open PDF and visually inspect"
echo " open $PDF"
---if command -v pdfinfo &> /dev/null; then
PAGES=$(pdfinfo "$PDF" | grep "Pages:" | awk '{print $2}')
echo "📄 Pages: $PAGES"
if [ $PAGES -lt 50 ]; then
echo "⚠️ WARNING: Expected ~89 pages, got $PAGES"
fifi
echo ""
echo "✅ Basic checks passed!"
echo "📋 Next: Open PDF and visually inspect"
echo " open $PDF"
---Key Takeaways
核心要点总结
- Different renderers = different rules - Pandoc ≠ Python-Markdown
- Visual inspection required - Terminal success ≠ correct PDF
- Blank lines are critical - Pandoc needs blank lines between different markdown elements
- Test locally before committing - Generate, open, review
- Same workflow as MkDocs - Systematic testing, not assumptions
- Font limitations are real - Accept or configure around them
- Markdown intent matters - Source should express desired structure
- Create testing checklists - Catch issues systematically
- Automate fixes - Create scripts for common formatting issues
- HTML and markdown need separation - Always blank line after HTML elements
- 不同渲染器 = 不同规则 - Pandoc ≠ Python-Markdown
- 必须进行视觉检查 - 终端执行成功 ≠ PDF正确
- 空行至关重要 - Pandoc需要在不同Markdown元素之间添加空行
- 提交前本地测试 - 生成、打开、审核
- 与MkDocs工作流一致 - 系统化测试,而非主观假设
- 字体限制真实存在 - 接受限制或配置规避
- Markdown的意图很重要 - 源文件应清晰表达预期结构
- 创建测试清单 - 系统化发现问题
- 自动化修复 - 针对常见格式问题编写脚本
- HTML与Markdown需分离 - HTML元素后务必添加空行
Common Pandoc Gotchas Summary
Pandoc常见陷阱总结
The "Blank Line Rule":
Pandoc requires blank lines in these situations:
- After bold/plain text labels before lists
- After HTML tags before markdown headers
- Between consecutive paragraph-like elements
- Before and after headers
Quick Check Commands:
bash
undefined“空行规则”:
Pandoc在以下场景需要空行:
- 加粗/普通文本标签与列表之间
- HTML标签与Markdown标题之间
- 连续段落类元素之间
- 标题前后
快速检查命令:
bash
undefinedCheck for labels before lists (no blank line)
Check for labels before lists (no blank line)
grep -B1 '^[-*] ' file.md | grep ':$' | grep -v '^--$'
grep -B1 '^[-*] ' file.md | grep ':$' | grep -v '^--$'
Check for anchors before headers (no blank line)
Check for anchors before headers (no blank line)
grep -A1 '^<a name=' file.md | grep '^##'
grep -A1 '^<a name=' file.md | grep '^##'
Check for consecutive bold labels
Check for consecutive bold labels
grep '^**[^]:** ' file.md | uniq -c | grep -v '^ *1 '
**When in doubt:** Add a blank line. Pandoc almost never complains about too many blank lines.
---grep '^**[^]:** ' file.md | uniq -c | grep -v '^ *1 '
**存疑时:** 添加空行。Pandoc几乎不会因空行过多而报错。
---Real-World Example
实际案例
Project: Large documentation set
Files: 15 markdown files
Issues Found: 469 formatting problems across 4 categories
Fixes Applied:
- 376 labels before lists (Issues 6 & 9)
- 45 anchors before headers (Issue 7)
- 62 consecutive metadata fields (Issue 8)
Time Investment:
- Discovery: ~2 hours (user feedback + testing)
- Script development: ~1 hour (3 scripts)
- Execution: ~5 minutes (automated)
- Verification: ~10 minutes (visual PDF review)
ROI: 3 hours invested, automated solution for future. All issues fixed in 5 minutes.
项目: 大型文档集
文件: 15个Markdown文件
发现问题: 4类共469个格式问题
修复内容:
- 376个标签与列表格式问题(问题6和9)
- 45个锚点与标题格式问题(问题7)
- 62个连续元数据字段格式问题(问题8)
时间投入:
- 问题发现:约2小时(用户反馈+测试)
- 脚本开发:约1小时(3个脚本)
- 执行修复:约5分钟(自动化)
- 验证:约10分钟(PDF视觉审核)
投资回报: 投入3小时,获得可复用的自动化解决方案,后续所有问题可在5分钟内修复。
References
参考资料
- Pandoc User's Guide
- Pandoc PDF Options
- XeLaTeX Documentation
- LaTeX Font Selection
- Pandoc Markdown Spec
Status: Production-ready with automation scripts