document-conversion

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Document Conversion

文档转换

Convert documents into useful reading copies without weakening provenance. The native document remains the source of truth; Markdown is a derived convenience layer for search, synthesis, and LLM reading.

将文档转换为实用的可读副本，同时不损害来源可信度。原始文档始终是真实来源；Markdown是为搜索、合成和LLM读取而生成的便捷层。

Read First

必读内容

```
references/repository-contract.md
```

references/document-conversion-policy.md

```
references/pdf-markdown-policy.md
```
```
references/source-ledger.md
```

```
references/repository-contract.md
```

references/document-conversion-policy.md

```
references/pdf-markdown-policy.md
```
```
references/source-ledger.md
```

Workflow

工作流程

Identify document type, provenance, identifiers, language, and whether OCR is needed.
Preserve the native file under
```
sources/pdfs/
```
,
```
reports/
```
, or
```
data/raw/
```
.
Pick the least lossy converter available: Docling/Nutrient-style PDF tools,
```
pdftotext
```
, PyMuPDF, OCR, Pandoc, or a project-specific parser.
Save Markdown to
```
sources/markdown/<source_id>.md
```
when it is a research source.
Save extracted figures/tables/assets under
```
sources/assets/<source_id>/
```
.
Append a row to
```
sources/conversion-ledger.csv
```
.

Update

sources/source-ledger.csv

and

wiki/sources/<source_id>.md

确定文档类型、来源、标识符、语言以及是否需要OCR识别。
将原始文件保存到
```
sources/pdfs/
```
、
```
reports/
```
或
```
data/raw/
```
目录下。
选择损失最小的转换工具：Docling/Nutrient风格的PDF工具、
```
pdftotext
```
、PyMuPDF、OCR、Pandoc或项目专用解析器。
若为研究来源，将Markdown文件保存到
```
sources/markdown/<source_id>.md
```
。
将提取的图表/表格/资源保存到
```
sources/assets/<source_id>/
```
目录下。
向
```
sources/conversion-ledger.csv
```
中添加一行记录。

更新

sources/source-ledger.csv

和

wiki/sources/<source_id>.md

。

Quality Gate

质量把关

Before using converted text as evidence, sample-check:

title, authors, year, DOI/arXiv/PMID
headings and section order
tables, equations, captions, and footnotes
page numbers for exact claims
OCR errors in names, numbers, formulas, and citations

Mark the conversion as

good

usable-with-checks

, or

poor

. Exact quotes, numbers, metrics, and equations must be verified against the native file.

在将转换后的文本用作证据前，需抽样检查：

标题、作者、年份、DOI/arXiv/PMID
标题层级和章节顺序
表格、公式、标题说明和脚注
精确声明对应的页码
名称、数字、公式和引用中的OCR识别错误

将转换质量标记为

good

（良好）、

usable-with-checks

（需核查后使用）或

poor

（较差）。精确引用、数字、指标和公式必须与原始文件进行核对。

Do Not

禁止事项

Delete or overwrite the native source.
Treat OCR text as authoritative.
Mix extracted assets from multiple documents without source-specific folders.
Use a converted Markdown file for citation-sensitive evidence without a quality note.

删除或覆盖原始来源文件。
将OCR识别文本视为权威内容。
在未按来源划分文件夹的情况下，混合来自多个文档的提取资源。
在未添加质量说明的情况下，将转换后的Markdown文件用于对引用敏感的证据场景。