mineru-extract
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMinerU Extract (official API)
MinerU Extract(官方API)
Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown.
将MinerU作为上游“内容标准化工具”:向MinerU提交URL,轮询任务完成状态,下载结果压缩包,并提取主Markdown文件。
Quick start (MCP-aligned)
快速开始(对齐MCP)
We align to the MinerU MCP mental model, but we do not run an MCP server.
- Primary script (MCP-style):
scripts/mineru_parse_documents.py- Input: (comma/newline-separated)
--file-sources - Output: JSON contract on stdout:
{ ok, items, errors }
- Input:
- Low-level script (single URL):
scripts/mineru_extract.py
Auth:
- Set (Bearer token from mineru.net)
MINERU_TOKEN
Default model heuristic:
- URLs ending with →
.pdf/.doc/.ppt/.png/.jpgpipeline - Otherwise → (best for HTML pages like WeChat articles)
MinerU-HTML
我们遵循MinerU MCP的思维模型,但我们不运行MCP服务器。
- 主脚本(MCP风格):
scripts/mineru_parse_documents.py- 输入:(逗号/换行分隔)
--file-sources - 输出:标准输出的JSON协议:
{ ok, items, errors }
- 输入:
- 底层脚本(单URL):
scripts/mineru_extract.py
认证:
- 设置(来自mineru.net的Bearer令牌)
MINERU_TOKEN
默认模型启发式规则:
- 以结尾的URL → 使用
.pdf/.doc/.ppt/.png/.jpg模型pipeline - 其他URL → 使用(最适合微信文章等HTML页面)
MinerU-HTML
1) Configure token (skill-local)
1) 配置令牌(技能本地)
Put secrets in skill root (do not paste into chat outputs):
.envbash
undefined将密钥放在技能根目录的文件中(请勿粘贴到聊天输出中):
.envbash
undefinedIn the mineru-extract skill directory: .env
在mineru-extract技能目录下的.env文件
MINERU_TOKEN=your_token_here
MINERU_API_BASE=https://mineru.net
undefinedMINERU_TOKEN=your_token_here
MINERU_API_BASE=https://mineru.net
undefined2) Parse URL(s) → Markdown (recommended)
2) 解析URL(s) → Markdown(推荐方式)
MCP-style wrapper (returns JSON, optionally includes markdown text):
bash
python3 mineru-extract/scripts/mineru_parse_documents.py \
--file-sources "<URL1>\n<URL2>" \
--language ch \
--enable-ocr \
--model-version MinerU-HTMLIf you want the markdown content inline in the JSON (can be large):
bash
python3 mineru-extract/scripts/mineru_parse_documents.py \
--file-sources "<URL>" \
--model-version MinerU-HTML \
--emit-markdown --max-chars 20000Low-level (single URL, print markdown to stdout):
bash
python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.mdMCP风格的封装脚本(返回JSON,可选择包含Markdown文本):
bash
python3 mineru-extract/scripts/mineru_parse_documents.py \
--file-sources "<URL1>\n<URL2>" \
--language ch \
--enable-ocr \
--model-version MinerU-HTML如果希望Markdown内容内嵌在JSON中(内容可能较大):
bash
python3 mineru-extract/scripts/mineru_parse_documents.py \
--file-sources "<URL>" \
--model-version MinerU-HTML \
--emit-markdown --max-chars 20000底层调用(单URL,将Markdown打印到标准输出):
bash
python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.mdOutput
输出
The script always downloads + extracts the MinerU result zip to:
~/.openclaw/workspace/mineru/<task_id>/It writes:
result.zip- extracted files (Markdown + JSON + assets)
It prints a JSON summary to stderr with paths:
- ,
task_id,full_zip_url,out_dirmarkdown_path
脚本始终会将MinerU的结果压缩包下载并提取到:
~/.openclaw/workspace/mineru/<task_id>/它会写入:
result.zip- 提取后的文件(Markdown + JSON + 资源文件)
它会在标准错误输出中打印包含路径的JSON摘要:
- ,
task_id,full_zip_url,out_dirmarkdown_path
Parameters (common)
通用参数
- :
--model(HTML requirespipeline | vlm | MinerU-HTML)MinerU-HTML - : enable OCR (effective for
--ocr/--no-ocr/pipeline)vlm - : table recognition
--table/--no-table - : formula recognition
--formula/--no-formula --language ch|en|...- (non-HTML)
--page-ranges "2,4-6" - /
--timeout 600--poll-interval 2
- :
--model(HTML页面需使用pipeline | vlm | MinerU-HTML)MinerU-HTML - : 启用OCR(对
--ocr/--no-ocr/pipeline模型有效)vlm - : 表格识别
--table/--no-table - : 公式识别
--formula/--no-formula --language ch|en|...- (非HTML内容适用)
--page-ranges "2,4-6" - /
--timeout 600--poll-interval 2
Failure modes & fallbacks
失败模式与回退方案
- MinerU may fail to fetch some URLs (anti-bot / geo / login).
- Fallback: provide an HTML file or a PDF/long screenshot; then implement “upload + parse” flow with MinerU batch upload endpoints.
- Always report the failing URL + MinerU and keep an original-source link in outputs.
err_msg
- MinerU可能无法获取某些URL的内容(反爬/地域限制/需要登录)。
- 回退方案:提供HTML文件或PDF/长截图;然后使用MinerU批量上传端点实现“上传+解析”流程。
- 请始终报告失败的URL + MinerU的,并在输出中保留原始源链接。
err_msg
References
参考资料
- MinerU API docs: https://mineru.net/apiManage/docs
- MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/
- MinerU API文档:https://mineru.net/apiManage/docs
- MinerU输出文件说明:https://opendatalab.github.io/MinerU/reference/output_files/