document-conversion
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDocument Conversion
文档转换
Convert documents into useful reading copies without weakening provenance. The
native document remains the source of truth; Markdown is a derived convenience
layer for search, synthesis, and LLM reading.
将文档转换为实用的可读副本,同时不损害来源可信度。原始文档始终是真实来源;Markdown是为搜索、合成和LLM读取而生成的便捷层。
Read First
必读内容
references/repository-contract.mdreferences/document-conversion-policy.mdreferences/pdf-markdown-policy.mdreferences/source-ledger.md
references/repository-contract.mdreferences/document-conversion-policy.mdreferences/pdf-markdown-policy.mdreferences/source-ledger.md
Workflow
工作流程
- Identify document type, provenance, identifiers, language, and whether OCR is needed.
- Preserve the native file under ,
sources/pdfs/, orreports/.data/raw/ - Pick the least lossy converter available: Docling/Nutrient-style PDF tools,
, PyMuPDF, OCR, Pandoc, or a project-specific parser.
pdftotext - Save Markdown to when it is a research source.
sources/markdown/<source_id>.md - Save extracted figures/tables/assets under .
sources/assets/<source_id>/ - Append a row to .
sources/conversion-ledger.csv - Update and
sources/source-ledger.csv.wiki/sources/<source_id>.md
- 确定文档类型、来源、标识符、语言以及是否需要OCR识别。
- 将原始文件保存到、
sources/pdfs/或reports/目录下。data/raw/ - 选择损失最小的转换工具:Docling/Nutrient风格的PDF工具、、PyMuPDF、OCR、Pandoc或项目专用解析器。
pdftotext - 若为研究来源,将Markdown文件保存到。
sources/markdown/<source_id>.md - 将提取的图表/表格/资源保存到目录下。
sources/assets/<source_id>/ - 向中添加一行记录。
sources/conversion-ledger.csv - 更新和
sources/source-ledger.csv。wiki/sources/<source_id>.md
Quality Gate
质量把关
Before using converted text as evidence, sample-check:
- title, authors, year, DOI/arXiv/PMID
- headings and section order
- tables, equations, captions, and footnotes
- page numbers for exact claims
- OCR errors in names, numbers, formulas, and citations
Mark the conversion as , , or . Exact quotes,
numbers, metrics, and equations must be verified against the native file.
goodusable-with-checkspoor在将转换后的文本用作证据前,需抽样检查:
- 标题、作者、年份、DOI/arXiv/PMID
- 标题层级和章节顺序
- 表格、公式、标题说明和脚注
- 精确声明对应的页码
- 名称、数字、公式和引用中的OCR识别错误
将转换质量标记为(良好)、(需核查后使用)或(较差)。精确引用、数字、指标和公式必须与原始文件进行核对。
goodusable-with-checkspoorDo Not
禁止事项
- Delete or overwrite the native source.
- Treat OCR text as authoritative.
- Mix extracted assets from multiple documents without source-specific folders.
- Use a converted Markdown file for citation-sensitive evidence without a quality note.
- 删除或覆盖原始来源文件。
- 将OCR识别文本视为权威内容。
- 在未按来源划分文件夹的情况下,混合来自多个文档的提取资源。
- 在未添加质量说明的情况下,将转换后的Markdown文件用于对引用敏感的证据场景。