document-conversion

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Document Conversion

文档转换

Convert documents into useful reading copies without weakening provenance. The native document remains the source of truth; Markdown is a derived convenience layer for search, synthesis, and LLM reading.
将文档转换为实用的可读副本,同时不损害来源可信度。原始文档始终是真实来源;Markdown是为搜索、合成和LLM读取而生成的便捷层。

Read First

必读内容

  • references/repository-contract.md
  • references/document-conversion-policy.md
  • references/pdf-markdown-policy.md
  • references/source-ledger.md
  • references/repository-contract.md
  • references/document-conversion-policy.md
  • references/pdf-markdown-policy.md
  • references/source-ledger.md

Workflow

工作流程

  1. Identify document type, provenance, identifiers, language, and whether OCR is needed.
  2. Preserve the native file under
    sources/pdfs/
    ,
    reports/
    , or
    data/raw/
    .
  3. Pick the least lossy converter available: Docling/Nutrient-style PDF tools,
    pdftotext
    , PyMuPDF, OCR, Pandoc, or a project-specific parser.
  4. Save Markdown to
    sources/markdown/<source_id>.md
    when it is a research source.
  5. Save extracted figures/tables/assets under
    sources/assets/<source_id>/
    .
  6. Append a row to
    sources/conversion-ledger.csv
    .
  7. Update
    sources/source-ledger.csv
    and
    wiki/sources/<source_id>.md
    .
  1. 确定文档类型、来源、标识符、语言以及是否需要OCR识别。
  2. 将原始文件保存到
    sources/pdfs/
    reports/
    data/raw/
    目录下。
  3. 选择损失最小的转换工具:Docling/Nutrient风格的PDF工具、
    pdftotext
    、PyMuPDF、OCR、Pandoc或项目专用解析器。
  4. 若为研究来源,将Markdown文件保存到
    sources/markdown/<source_id>.md
  5. 将提取的图表/表格/资源保存到
    sources/assets/<source_id>/
    目录下。
  6. sources/conversion-ledger.csv
    中添加一行记录。
  7. 更新
    sources/source-ledger.csv
    wiki/sources/<source_id>.md

Quality Gate

质量把关

Before using converted text as evidence, sample-check:
  • title, authors, year, DOI/arXiv/PMID
  • headings and section order
  • tables, equations, captions, and footnotes
  • page numbers for exact claims
  • OCR errors in names, numbers, formulas, and citations
Mark the conversion as
good
,
usable-with-checks
, or
poor
. Exact quotes, numbers, metrics, and equations must be verified against the native file.
在将转换后的文本用作证据前,需抽样检查:
  • 标题、作者、年份、DOI/arXiv/PMID
  • 标题层级和章节顺序
  • 表格、公式、标题说明和脚注
  • 精确声明对应的页码
  • 名称、数字、公式和引用中的OCR识别错误
将转换质量标记为
good
(良好)、
usable-with-checks
(需核查后使用)或
poor
(较差)。精确引用、数字、指标和公式必须与原始文件进行核对。

Do Not

禁止事项

  • Delete or overwrite the native source.
  • Treat OCR text as authoritative.
  • Mix extracted assets from multiple documents without source-specific folders.
  • Use a converted Markdown file for citation-sensitive evidence without a quality note.
  • 删除或覆盖原始来源文件。
  • 将OCR识别文本视为权威内容。
  • 在未按来源划分文件夹的情况下,混合来自多个文档的提取资源。
  • 在未添加质量说明的情况下,将转换后的Markdown文件用于对引用敏感的证据场景。