feishu-doc-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseFeishu Doc Scraper
Feishu Doc Scraper
Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback, not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed.
将飞书/Lark源内容提取为高保真本地Markdown文件。优先使用lark-cli API——它通过程序化方式提取正文(无模型改写),可遍历合集的引用图谱,并通过错误码识别权限边界而非猜测。将渲染后的浏览器页面视为备选方案,而非可信源:在实际的合集爬取工作中,API路径始终能完成全部任务,而浏览器路径几乎从未被用到。
Scope (read this first)
适用范围(请先阅读)
This skill's contract is faithful per-source Markdown + a record of what was extracted. It does not decide how the resulting files are named, indexed, deduplicated against existing notes, or organized into a knowledge base — that belongs to the host PKM / the user's own conventions. Stopping at faithful extraction keeps this skill orthogonal and reusable. When the user wants the output filed into a vault, extract first, then hand the clean Markdown to their organizing workflow.
本技能的核心目标是按源内容生成高保真Markdown文件,并记录提取内容。它不负责决定生成文件的命名、索引、与现有笔记去重,或是组织到知识库中的方式——这些属于宿主PKM(个人知识管理系统)或用户自身的约定范畴。仅专注于高保真提取可确保本技能的独立性与可复用性。当用户希望将输出文件存入知识库时,需先完成提取,再将干净的Markdown文件移交至其整理工作流。
Choose the path
选择处理路径
Is the source a Feishu/Lark URL (wiki / docx / sheets / minutes / base)?
├── YES → is lark-cli installed and authenticated to that tenant?
│ ├── YES → PATH A: lark-cli API extraction (primary — start here)
│ │ └── hit code 131006 / 99991679 (permission denied)?
│ │ └── PATH B: owner-exported .docx → faithful Markdown
│ └── NO → install/auth lark-cli first (it is worth it); only if
│ truly impossible → PATH D: browser DOM fallback
├── the URL is a Minutes / 妙记 link, or a doc references one → PATH C: Minutes transcript
└── you were handed an exported .docx (not a URL) → PATH BA collection/hub is just a docx whose body references other docs — Path A handles it by recursively following the reference graph, not by visiting pages in a browser.
源内容是否为飞书/Lark URL(wiki / docx / 表格 / 妙记 / 多维表格)?
├── 是 → 是否已安装并认证lark-cli至对应租户?
│ ├── 是 → 路径A:lark-cli API提取(首选——从此处开始)
│ │ └── 是否遇到错误码131006 / 99991679(权限拒绝)?
│ │ └── 路径B:所有者导出的.docx文件转换为高保真Markdown
│ └── 否 → 先安装/认证lark-cli(值得投入);仅当确实无法安装时
│ → 路径D:浏览器DOM备选方案
├── URL为妙记链接,或文档中引用了妙记内容 → 路径C:妙记转写内容提取
└── 收到的是导出的.docx文件(而非URL) → 路径B合集/知识库本质是正文包含其他文档引用的docx文件——路径A通过递归遍历引用图谱处理此类内容,无需在浏览器中访问页面。
Path A — lark-cli API extraction (primary)
路径A — lark-cli API提取(首选)
Full command catalog, recursion engine, cross-tenant and personal-space nuances: references/lark-cli-api-extraction.md. The essentials for the common case:
1. Disable the proxy for Feishu domestic domains. Feishu's endpoints are direct-connect in mainland China; routing them through a local proxy leaks credentials through the proxy and gets DNS-hijacked. lark-cli itself warns about this. Always:
*.feishu.cnbash
export LARK_CLI_NO_PROXY=1This does not conflict with any "Claude/Anthropic domains must use the proxy" rule — Feishu is a different host and is direct.
2. Classify the URL, then resolve to a fetchable doc token.
- — a wiki node token is not a doc token. Resolve it first:
…/wiki/<node_token>bashlark-cli wiki spaces get_node --params '{"token":"<node_token>"}' # → .data.node.obj_token and .data.node.obj_type (e.g. "docx") - — already a doc token, fetch directly.
…/docx/<doc_token> - — spreadsheet, use the sheets commands (see reference).
…/sheets/<token> - — Minutes, go to Path C.
…/minutes/<token>
3. Fetch the body as Markdown — programmatically, never via the model.
bash
lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err完整命令目录、递归引擎、跨租户与个人空间细节:references/lark-cli-api-extraction.md。常见场景的核心步骤:
1. 禁用飞书国内域名的代理。飞书端点在中国大陆为直连模式;通过本地代理路由会导致凭证通过代理泄露,并遭遇DNS劫持。lark-cli本身会对此发出警告。请始终执行:
*.feishu.cnbash
export LARK_CLI_NO_PROXY=1此设置与“Claude/Anthropic域名必须使用代理”的规则无冲突——飞书是独立主机,采用直连方式。
2. 对URL进行分类,解析为可获取的文档令牌。
- —— wiki节点令牌并非文档令牌。需先解析:
…/wiki/<node_token>bashlark-cli wiki spaces get_node --params '{"token":"<node_token>"}' # → .data.node.obj_token 以及 .data.node.obj_type(例如"docx") - —— 已为文档令牌,可直接获取。
…/docx/<doc_token> - —— 电子表格,使用表格相关命令(详见参考文档)。
…/sheets/<token> - —— 妙记内容,跳转至路径C。
…/minutes/<token>
3. 以Markdown格式获取正文——通过程序化方式,绝不使用模型。
bash
lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.errbody is .data.markdown — extract with jq, do NOT retype or summarize it
正文位于.data.markdown字段——使用jq提取,请勿重新输入或总结
jq -r '.data.markdown' /tmp/fetch.json > source.md
Keep stdout and stderr separate. A harmless `[deprecated] docs +fetch with v1 API is deprecated` goes to stderr; piping `2>/dev/null` *and* `jq` together produced a false `Exit code 5` in practice — redirect to files and inspect, don't blind-pipe. The body must reach disk without passing through the model (paraphrasing silently corrupts source text — this is the single most important fidelity rule).
**4. If it's a collection/hub, follow the reference graph (BFS).** The hub body contains `<mention-doc>`, `<sheet>`, `<image>` tags and cross-tenant / Minutes / Tencent-Meeting URLs. Extract every reference, dispatch by type, fetch, and **repeat on each newly fetched doc until no new references remain** (leaf nodes). Use the bundled extractor so nothing is silently missed (a missed reference = a missing document, the #1 hub-scraping failure):
```bash
python3 scripts/feishu_extract_refs.py source.md # → JSON list of {type, token, title}Recursion loop, dispatch table, and the cross-tenant/ personal-space rules are in the reference.
my.feishu.cn5. Final residual-tag check (acceptance gate for collections). Every rich-media reference must have been resolved and rendered:
bash
grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "UNRESOLVED — keep recursing" || echo "clean"Must be empty before you stop.
jq -r '.data.markdown' /tmp/fetch.json > source.md
将标准输出与标准错误分开保存。无害的`[deprecated] docs +fetch with v1 API is deprecated`信息会输出到标准错误;若同时使用`2>/dev/null`与`jq`,实际操作中会产生错误的“退出码5”——请将输出重定向至文件并检查,不要盲目管道传输。正文必须直接写入磁盘,不得经过模型处理(改写会无声地破坏源文本——这是保真度的核心规则)。
**4. 若为合集/知识库,遍历引用图谱(广度优先搜索)**。合集正文包含`<mention-doc>`、`<sheet>`、`<image>`标签以及跨租户/妙记/腾讯会议URL。提取所有引用,按类型分发、获取内容,并**对每个新获取的文档重复此过程直至无新引用(叶子节点)**。使用内置的提取器确保无遗漏(遗漏引用=缺失文档,这是合集爬取失败的头号原因):
```bash
python3 scripts/feishu_extract_refs.py source.md # → 生成包含{type, token, title}的JSON列表递归循环、分发表以及跨租户/个人空间规则详见参考文档。
my.feishu.cn5. 最终残留标签检查(合集提取的验收标准)。所有富媒体引用必须已解析并渲染:
bash
grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "未解析完成——继续递归" || echo "已清理完成"必须返回空结果才可停止。
Path B — permission denied → owner-exported .docx
路径B — 权限拒绝 → 所有者导出的.docx文件
lark-cli wiki spaces get_nodecode 131006 … node permission denied, user needs read permission.docxw:shdlark-cli wiki spaces get_nodecode 131006 … node permission denied, user needs read permission.docxw:shdPath C — Feishu Minutes (妙记) transcript
路径C — 飞书妙记(Minutes)转写内容
lark-cli minuteslark-cli apilark-cli minuteslark-cli apiPath D — browser DOM fallback (last resort)
路径D — 浏览器DOM备选方案(最后手段)
Only when lark-cli genuinely cannot reach the content (no install possible, and the doc is not permission-walled). This is the old virtual-scroll / TOC-driven DOM capture workflow. It is slower, depends on a connected browser surface (the in-browser extension frequently fails to connect), and an anonymous debugging Chrome can only tell you whether a page is publicly reachable — it cannot read login-walled content. Workflow: references/browser-dom-fallback.md. Battle-tested DOM rules (virtual scroll, ordering, table/bullet extraction, image streams): references/browser-failure-rules.md.
data-block-id仅当lark-cli确实无法获取内容时使用(无法安装,且文档未设权限墙)。这是旧版的虚拟滚动/目录驱动DOM捕获工作流。此方法速度较慢,依赖连接的浏览器界面(浏览器扩展经常无法连接),且匿名调试Chrome仅能判断页面是否公开可访问——无法读取需登录的内容。工作流:references/browser-dom-fallback.md。经过实战验证的DOM规则(虚拟滚动、排序、表格/项目符号提取、图片流):references/browser-failure-rules.md。
data-block-idHard rules
硬性规则
These are the rules whose violation silently ruins the output. Each has a reason — follow the reason, not just the letter.
- Never let the document body pass through the model. Extract with /
jq/scripts straight to disk. The model paraphrasing source text is undetectable later and destroys fidelity. This is why Path A beats the browser path structurally.cat - for
export LARK_CLI_NO_PROXY=1. Otherwise credentials transit a local proxy and DNS is hijacked.*.feishu.cn - Transcripts come from the platform's native transcription, never re-ASR. Downloading media and transcribing again loses speaker labels, timestamps, and accuracy.
- A generated docx Markdown is not done until it has been visually verified against the source (render to image, read it). Feishu-exported docx uses font-size+bold for headings rather than Word heading styles, so a "no errors, word count matches" check passes while the entire heading hierarchy is silently flat. Text-level checks cannot catch this.
- Do not 死磕 (grind) on docx embedded-image download. lark-cli (through 1.0.32) cannot download tokens from a docx — exhaustively verified. Register the image tokens and note "needs document owner to right-click → save"; the text is the value, images are a tracked gap.
<image> - HTTP 200 from anonymous curl ≠ accessible. A Feishu login wall returns 200 with a body containing /
accounts.feishu.cn/login/ an emptypassport. Check the body, never infer "public" from the status code.<title> - A file "not found" by a search agent is not authoritative. Verify against authoritative sources before concluding (this is general Inference Discipline; relevant when locating where ingested content already lives).
- U+FFFD final check on every produced file: must be empty. A replacement character means an encoding step corrupted the text.
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
违反这些规则会无声地破坏输出质量。每条规则均有依据——请遵循规则背后的逻辑,而非仅字面意思。
- 绝不让文档正文经过模型处理。使用/
jq/脚本直接提取至磁盘。模型改写源文本的行为事后无法察觉,且会彻底破坏保真度。这也是路径A在结构上优于浏览器路径的原因。cat - 针对域名设置
*.feishu.cn。否则凭证会通过本地代理传输,且DNS会被劫持。export LARK_CLI_NO_PROXY=1 - 转写内容来自平台原生转写,绝不重新进行ASR。下载媒体后重新转写会丢失说话人标签、时间戳与准确性。
- 生成的docx转Markdown文件必须经过视觉验证(渲染为图片并查看),才可视为完成。飞书导出的docx使用字体大小+粗体表示标题,而非Word标题样式,因此“无错误、字数匹配”的检查会通过,但整个标题层级可能无声地变为扁平结构。文本层面的检查无法发现此问题。
- 不要死磕docx内嵌图片的下载。lark-cli(截至1.0.32版本)无法从docx文件中下载令牌——此结论已全面验证。记录图片令牌并注明“需文档所有者右键→保存”;文本内容是核心价值,图片为需跟踪的缺口。
<image> - 匿名curl返回HTTP 200 ≠ 可访问。飞书登录墙会返回200状态码,且正文包含/
accounts.feishu.cn/login/空passport。请检查正文内容,切勿仅通过状态码推断“公开可访问”。<title> - 搜索工具提示“文件未找到”不具备权威性。在得出结论前,请通过权威来源验证(这是通用推理准则;在定位已导入内容的位置时适用)。
- 对每个生成文件进行U+FFFD最终检查:必须返回空结果。替换字符表示编码步骤损坏了文本。
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
Acceptance contract
验收标准
Stop only when all that apply are true:
- Every fetched body reached disk via /script, not retyped by the model.
jq - Collections: the residual rich-media-tag grep (Path A step 5) is empty — every /
mention-doc/cross-tenant reference was followed to a leaf.sheet - is empty.
LC_ALL=C grep -rl $'\xef\xbf\xbd' . - docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).
- Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).
- Each output file's frontmatter records (the original URL/token) and, if any post-processing was applied, a
sourceprovenance line.post_process - Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.
仅当以下所有适用条件均满足时才可停止:
- 所有获取的正文均通过/脚本写入磁盘,未由模型重新输入。
jq - 合集提取:残留富媒体标签检查(路径A步骤5)返回空结果——所有/
mention-doc/跨租户引用均已遍历至叶子节点。sheet - 返回空结果。
LC_ALL=C grep -rl $'\xef\xbf\xbd' . - docx路径:已渲染为图片并与源内容进行视觉对比;标题层级与高亮格式匹配(详见docx参考文档的检查清单)。
- 仅使用浏览器备选方案:已完成目录覆盖范围与缩放检查(详见browser-failure-rules.md)。
- 每个输出文件的前置元数据记录了(原始URL/令牌),若进行了任何后处理,还需包含
source溯源行。post_process - 权限缺口(未导出的131006权限文档、无法下载的图片)已明确告知用户——透明的缺口优于无声的遗漏。
Do NOT attempt
请勿尝试
Verified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: references/permission-and-failure-boundaries.md. The top ones:
- Bypassing permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.
131006 - Downloading docx embedded images via ,
docs +media-download(with or withoutapi …/drive/v1/medias/<t>/download), orextra— none work; lark-cli even mis-reports the real HTTP 400 as "empty JSON".schema drive.medias.download - against
WebFetchfor API specs — backend is flaky; useopen.feishu.cn/document/server-docs/...instead (LLM-friendly, stable).open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt - AppleScript/JXA , Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).
executeJavaScript - Using to convert docx→md — it is a docx authoring tool; use the doc-to-markdown skill instead.
minimax-docx
已验证的死胡同——重试只会浪费会话时间。包含失败模式与根本原因的完整表格:references/permission-and-failure-boundaries.md。主要包括:
- 通过任何方式绕过权限拒绝错误(lark-cli / curl / 匿名浏览器)——这是服务器端的硬性边界。
131006 - 通过、
docs +media-download(含或不含api …/drive/v1/medias/<t>/download参数)或extra下载docx内嵌图片——均无效;lark-cli甚至会将真实的HTTP 400错误误报为“空JSON”。schema drive.medias.download - 针对使用
open.feishu.cn/document/server-docs/...获取API规范——后端不稳定;请改用WebFetch(对LLM友好,稳定)。open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt - AppleScript/JXA 、端口9222上的Chrome CDP——在此环境中已禁用/无返回(仅适用于浏览器路径)。
executeJavaScript - 使用将docx转换为md——这是docx创作工具;请改用文档转Markdown技能。
minimax-docx
Bundled resources
内置资源
- — deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate
scripts/feishu_extract_refs.py/<mention-doc>/<sheet>/cross-tenant/Minutes/Tencent-Meeting references as JSON.<image> - — for Path B: reads true font sizes via python-docx, maps them to heading levels, restores
scripts/restore_docx_headings.pyhighlights to Obsidianw:shd, without retyping body text.==…== - — Path D: injectable end-to-end browser DOM capture.
scripts/feishu_dom_capture.js - — Path D: SSR image extraction when browser automation is unavailable.
scripts/download_feishu_images.py - — Path D: render a capture manifest into Markdown.
scripts/build_feishu_markdown.py - — coverage verification (both paths).
scripts/check_heading_coverage.py - — Path A full reference (commands, recursion, sheets, cross-tenant).
references/lark-cli-api-extraction.md - — Path C native transcript API + scope auth.
references/feishu-minutes-transcript.md - — error codes + the full Do-NOT-attempt table.
references/permission-and-failure-boundaries.md - — Path B faithful conversion procedure.
references/docx-export-to-markdown.md - +
references/browser-dom-fallback.md— Path D.references/browser-failure-rules.md - — manifest shape for
references/capture-manifest.md.build_feishu_markdown.py
- ——确定性引用令牌提取器;递归引擎的核心。对每个获取的正文运行此脚本,可将
scripts/feishu_extract_refs.py/<mention-doc>/<sheet>/跨租户/妙记/腾讯会议引用枚举为JSON格式。<image> - ——用于路径B:通过python-docx读取真实字体大小,映射为标题层级,将
scripts/restore_docx_headings.py高亮恢复为Obsidian的w:shd格式,且不会重新输入正文内容。==…== - ——路径D:可注入的端到端浏览器DOM捕获脚本。
scripts/feishu_dom_capture.js - ——路径D:当浏览器自动化不可用时,进行SSR图片提取。
scripts/download_feishu_images.py - ——路径D:将捕获清单渲染为Markdown文件。
scripts/build_feishu_markdown.py - ——覆盖范围验证(适用于所有路径)。
scripts/check_heading_coverage.py - ——路径A完整参考文档(命令、递归、表格、跨租户)。
references/lark-cli-api-extraction.md - ——路径C原生转写API + 权限范围认证。
references/feishu-minutes-transcript.md - ——错误码 + 完整“请勿尝试”表格。
references/permission-and-failure-boundaries.md - ——路径B高保真转换流程。
references/docx-export-to-markdown.md - +
references/browser-dom-fallback.md——路径D相关文档。references/browser-failure-rules.md - ——
references/capture-manifest.md使用的清单格式。build_feishu_markdown.py
Next step
下一步操作
After extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:
Extraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).
Options:
A) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)
B) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files
C) Stop here — the faithful Markdown is the deliverable提取完成后,干净的Markdown文件通常会进入用户自身的知识库导入流程(归档、索引、去重)——这部分内容刻意排除在本技能的范围之外。若源内容通过路径B(docx文件)处理,文档转Markdown技能已包含在该流程中。请移交文件,不要自动整理:
提取完成:[N]个源内容 → 高保真Markdown文件(已列出[M]个权限/图片缺口)。
选项:
A) 移交至您的PKM/整理工作流——归档并索引这些文件(若存入知识库,推荐此选项)
B) 运行/daymade-docs:docs-cleaner——整合提取文件中的冗余内容
C) 在此停止——高保真Markdown文件即为交付成果