feishu-doc-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Feishu Doc Scraper

Feishu Doc Scraper

Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback, not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed.
将飞书/Lark源内容提取为高保真本地Markdown文件。优先使用lark-cli API——它通过程序化方式提取正文(无模型改写),可遍历合集的引用图谱,并通过错误码识别权限边界而非猜测。将渲染后的浏览器页面视为备选方案,而非可信源:在实际的合集爬取工作中,API路径始终能完成全部任务,而浏览器路径几乎从未被用到。

Scope (read this first)

适用范围(请先阅读)

This skill's contract is faithful per-source Markdown + a record of what was extracted. It does not decide how the resulting files are named, indexed, deduplicated against existing notes, or organized into a knowledge base — that belongs to the host PKM / the user's own conventions. Stopping at faithful extraction keeps this skill orthogonal and reusable. When the user wants the output filed into a vault, extract first, then hand the clean Markdown to their organizing workflow.
本技能的核心目标是按源内容生成高保真Markdown文件,并记录提取内容。它不负责决定生成文件的命名、索引、与现有笔记去重,或是组织到知识库中的方式——这些属于宿主PKM(个人知识管理系统)或用户自身的约定范畴。仅专注于高保真提取可确保本技能的独立性与可复用性。当用户希望将输出文件存入知识库时,需先完成提取,再将干净的Markdown文件移交至其整理工作流。

Choose the path

选择处理路径

Is the source a Feishu/Lark URL (wiki / docx / sheets / minutes / base)?
├── YES → is lark-cli installed and authenticated to that tenant?
│        ├── YES → PATH A: lark-cli API extraction  (primary — start here)
│        │         └── hit code 131006 / 99991679 (permission denied)?
│        │              └── PATH B: owner-exported .docx → faithful Markdown
│        └── NO  → install/auth lark-cli first (it is worth it); only if
│                  truly impossible → PATH D: browser DOM fallback
├── the URL is a Minutes / 妙记 link, or a doc references one → PATH C: Minutes transcript
└── you were handed an exported .docx (not a URL) → PATH B
A collection/hub is just a docx whose body references other docs — Path A handles it by recursively following the reference graph, not by visiting pages in a browser.
源内容是否为飞书/Lark URL(wiki / docx / 表格 / 妙记 / 多维表格)?
├── 是 → 是否已安装并认证lark-cli至对应租户?
│        ├── 是 → 路径A:lark-cli API提取(首选——从此处开始)
│        │         └── 是否遇到错误码131006 / 99991679(权限拒绝)?
│        │              └── 路径B:所有者导出的.docx文件转换为高保真Markdown
│        └── 否 → 先安装/认证lark-cli(值得投入);仅当确实无法安装时
│                  → 路径D:浏览器DOM备选方案
├── URL为妙记链接,或文档中引用了妙记内容 → 路径C:妙记转写内容提取
└── 收到的是导出的.docx文件(而非URL) → 路径B
合集/知识库本质是正文包含其他文档引用的docx文件——路径A通过递归遍历引用图谱处理此类内容,无需在浏览器中访问页面。

Path A — lark-cli API extraction (primary)

路径A — lark-cli API提取(首选)

Full command catalog, recursion engine, cross-tenant and personal-space nuances: references/lark-cli-api-extraction.md. The essentials for the common case:
1. Disable the proxy for Feishu domestic domains. Feishu's
*.feishu.cn
endpoints are direct-connect in mainland China; routing them through a local proxy leaks credentials through the proxy and gets DNS-hijacked. lark-cli itself warns about this. Always:
bash
export LARK_CLI_NO_PROXY=1
This does not conflict with any "Claude/Anthropic domains must use the proxy" rule — Feishu is a different host and is direct.
2. Classify the URL, then resolve to a fetchable doc token.
  • …/wiki/<node_token>
    — a wiki node token is not a doc token. Resolve it first:
    bash
    lark-cli wiki spaces get_node --params '{"token":"<node_token>"}'
    # → .data.node.obj_token  and  .data.node.obj_type  (e.g. "docx")
  • …/docx/<doc_token>
    — already a doc token, fetch directly.
  • …/sheets/<token>
    — spreadsheet, use the sheets commands (see reference).
  • …/minutes/<token>
    — Minutes, go to Path C.
3. Fetch the body as Markdown — programmatically, never via the model.
bash
lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err
完整命令目录、递归引擎、跨租户与个人空间细节:references/lark-cli-api-extraction.md。常见场景的核心步骤:
1. 禁用飞书国内域名的代理。飞书
*.feishu.cn
端点在中国大陆为直连模式;通过本地代理路由会导致凭证通过代理泄露,并遭遇DNS劫持。lark-cli本身会对此发出警告。请始终执行:
bash
export LARK_CLI_NO_PROXY=1
此设置与“Claude/Anthropic域名必须使用代理”的规则无冲突——飞书是独立主机,采用直连方式。
2. 对URL进行分类,解析为可获取的文档令牌
  • …/wiki/<node_token>
    —— wiki节点令牌并非文档令牌。需先解析:
    bash
    lark-cli wiki spaces get_node --params '{"token":"<node_token>"}'
    # → .data.node.obj_token 以及 .data.node.obj_type(例如"docx")
  • …/docx/<doc_token>
    —— 已为文档令牌,可直接获取。
  • …/sheets/<token>
    —— 电子表格,使用表格相关命令(详见参考文档)。
  • …/minutes/<token>
    —— 妙记内容,跳转至路径C
3. 以Markdown格式获取正文——通过程序化方式,绝不使用模型
bash
lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err

body is .data.markdown — extract with jq, do NOT retype or summarize it

正文位于.data.markdown字段——使用jq提取,请勿重新输入或总结

jq -r '.data.markdown' /tmp/fetch.json > source.md

Keep stdout and stderr separate. A harmless `[deprecated] docs +fetch with v1 API is deprecated` goes to stderr; piping `2>/dev/null` *and* `jq` together produced a false `Exit code 5` in practice — redirect to files and inspect, don't blind-pipe. The body must reach disk without passing through the model (paraphrasing silently corrupts source text — this is the single most important fidelity rule).

**4. If it's a collection/hub, follow the reference graph (BFS).** The hub body contains `<mention-doc>`, `<sheet>`, `<image>` tags and cross-tenant / Minutes / Tencent-Meeting URLs. Extract every reference, dispatch by type, fetch, and **repeat on each newly fetched doc until no new references remain** (leaf nodes). Use the bundled extractor so nothing is silently missed (a missed reference = a missing document, the #1 hub-scraping failure):

```bash
python3 scripts/feishu_extract_refs.py source.md   # → JSON list of {type, token, title}
Recursion loop, dispatch table, and the cross-tenant/
my.feishu.cn
personal-space rules are in the reference.
5. Final residual-tag check (acceptance gate for collections). Every rich-media reference must have been resolved and rendered:
bash
grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "UNRESOLVED — keep recursing" || echo "clean"
Must be empty before you stop.
jq -r '.data.markdown' /tmp/fetch.json > source.md

将标准输出与标准错误分开保存。无害的`[deprecated] docs +fetch with v1 API is deprecated`信息会输出到标准错误;若同时使用`2>/dev/null`与`jq`,实际操作中会产生错误的“退出码5”——请将输出重定向至文件并检查,不要盲目管道传输。正文必须直接写入磁盘,不得经过模型处理(改写会无声地破坏源文本——这是保真度的核心规则)。

**4. 若为合集/知识库,遍历引用图谱(广度优先搜索)**。合集正文包含`<mention-doc>`、`<sheet>`、`<image>`标签以及跨租户/妙记/腾讯会议URL。提取所有引用,按类型分发、获取内容,并**对每个新获取的文档重复此过程直至无新引用(叶子节点)**。使用内置的提取器确保无遗漏(遗漏引用=缺失文档,这是合集爬取失败的头号原因):

```bash
python3 scripts/feishu_extract_refs.py source.md   # → 生成包含{type, token, title}的JSON列表
递归循环、分发表以及跨租户/
my.feishu.cn
个人空间规则详见参考文档。
5. 最终残留标签检查(合集提取的验收标准)。所有富媒体引用必须已解析并渲染:
bash
grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "未解析完成——继续递归" || echo "已清理完成"
必须返回空结果才可停止。

Path B — permission denied → owner-exported .docx

路径B — 权限拒绝 → 所有者导出的.docx文件

lark-cli wiki spaces get_node
returning
code 131006 … node permission denied, user needs read permission
(or fetch returning it) is a hard Feishu-side boundary. lark-cli, anonymous curl, and the browser all fail it — this has been verified exhaustively; do not spend cycles trying to bypass it. The only correct move: ask the permission holder to export the doc as
.docx
and send it back out-of-band, then convert with fidelity (font-size→heading and
w:shd
→highlight restoration, then visual verification). Full procedure: references/docx-export-to-markdown.md.
lark-cli wiki spaces get_node
返回
code 131006 … node permission denied, user needs read permission
(或获取内容时返回此错误)是飞书端的硬性权限边界。lark-cli、匿名curl与浏览器均无法突破——此结论已通过全面验证;请勿浪费时间尝试绕过。唯一正确的做法:请求权限持有者将文档导出为
.docx
文件并离线发送,再进行高保真转换(将字体大小映射为标题、
w:shd
格式恢复为高亮,然后进行视觉验证)。完整流程:references/docx-export-to-markdown.md

Path C — Feishu Minutes (妙记) transcript

路径C — 飞书妙记(Minutes)转写内容

lark-cli minutes
only returns metadata and can download audio/video — it cannot export the text transcript. The transcript comes from a native endpoint called through
lark-cli api
, and needs an extra scope granted via a device-flow login. Native AI transcription is far better than downloading the media and re-running ASR — never do the latter. Endpoint, scope name, the device-flow timeout trap, and per-minute (not per-tenant) permission behavior: references/feishu-minutes-transcript.md.
lark-cli minutes
仅返回元数据并可下载音频/视频——它无法导出文本转写内容。转写内容需通过
lark-cli api
调用原生端点获取,且需通过设备流登录授予额外权限。平台原生AI转写效果远优于下载媒体后重新运行ASR(自动语音识别)——绝不要采用后者。端点、权限范围名称、设备流超时陷阱以及按分钟(而非按租户)的权限行为:references/feishu-minutes-transcript.md

Path D — browser DOM fallback (last resort)

路径D — 浏览器DOM备选方案(最后手段)

Only when lark-cli genuinely cannot reach the content (no install possible, and the doc is not permission-walled). This is the old virtual-scroll / TOC-driven DOM capture workflow. It is slower, depends on a connected browser surface (the in-browser extension frequently fails to connect), and an anonymous debugging Chrome can only tell you whether a page is publicly reachable — it cannot read login-walled content. Workflow: references/browser-dom-fallback.md. Battle-tested DOM rules (virtual scroll,
data-block-id
ordering, table/bullet extraction, image streams): references/browser-failure-rules.md.
仅当lark-cli确实无法获取内容时使用(无法安装,且文档未设权限墙)。这是旧版的虚拟滚动/目录驱动DOM捕获工作流。此方法速度较慢,依赖连接的浏览器界面(浏览器扩展经常无法连接),且匿名调试Chrome仅能判断页面是否公开可访问——无法读取需登录的内容。工作流:references/browser-dom-fallback.md。经过实战验证的DOM规则(虚拟滚动、
data-block-id
排序、表格/项目符号提取、图片流):references/browser-failure-rules.md

Hard rules

硬性规则

These are the rules whose violation silently ruins the output. Each has a reason — follow the reason, not just the letter.
  • Never let the document body pass through the model. Extract with
    jq
    /
    cat
    /scripts straight to disk. The model paraphrasing source text is undetectable later and destroys fidelity. This is why Path A beats the browser path structurally.
  • export LARK_CLI_NO_PROXY=1
    for
    *.feishu.cn
    .
    Otherwise credentials transit a local proxy and DNS is hijacked.
  • Transcripts come from the platform's native transcription, never re-ASR. Downloading media and transcribing again loses speaker labels, timestamps, and accuracy.
  • A generated docx Markdown is not done until it has been visually verified against the source (render to image, read it). Feishu-exported docx uses font-size+bold for headings rather than Word heading styles, so a "no errors, word count matches" check passes while the entire heading hierarchy is silently flat. Text-level checks cannot catch this.
  • Do not 死磕 (grind) on docx embedded-image download. lark-cli (through 1.0.32) cannot download
    <image>
    tokens from a docx — exhaustively verified. Register the image tokens and note "needs document owner to right-click → save"; the text is the value, images are a tracked gap.
  • HTTP 200 from anonymous curl ≠ accessible. A Feishu login wall returns 200 with a body containing
    accounts.feishu.cn
    /
    login
    /
    passport
    / an empty
    <title>
    . Check the body, never infer "public" from the status code.
  • A file "not found" by a search agent is not authoritative. Verify against authoritative sources before concluding (this is general Inference Discipline; relevant when locating where ingested content already lives).
  • U+FFFD final check on every produced file:
    LC_ALL=C grep -rl $'\xef\xbf\xbd' .
    must be empty. A replacement character means an encoding step corrupted the text.
违反这些规则会无声地破坏输出质量。每条规则均有依据——请遵循规则背后的逻辑,而非仅字面意思。
  • 绝不让文档正文经过模型处理。使用
    jq
    /
    cat
    /脚本直接提取至磁盘。模型改写源文本的行为事后无法察觉,且会彻底破坏保真度。这也是路径A在结构上优于浏览器路径的原因。
  • 针对
    *.feishu.cn
    域名设置
    export LARK_CLI_NO_PROXY=1
    。否则凭证会通过本地代理传输,且DNS会被劫持。
  • 转写内容来自平台原生转写,绝不重新进行ASR。下载媒体后重新转写会丢失说话人标签、时间戳与准确性。
  • 生成的docx转Markdown文件必须经过视觉验证(渲染为图片并查看),才可视为完成。飞书导出的docx使用字体大小+粗体表示标题,而非Word标题样式,因此“无错误、字数匹配”的检查会通过,但整个标题层级可能无声地变为扁平结构。文本层面的检查无法发现此问题。
  • 不要死磕docx内嵌图片的下载。lark-cli(截至1.0.32版本)无法从docx文件中下载
    <image>
    令牌——此结论已全面验证。记录图片令牌并注明“需文档所有者右键→保存”;文本内容是核心价值,图片为需跟踪的缺口。
  • 匿名curl返回HTTP 200 ≠ 可访问。飞书登录墙会返回200状态码,且正文包含
    accounts.feishu.cn
    /
    login
    /
    passport
    /空
    <title>
    。请检查正文内容,切勿仅通过状态码推断“公开可访问”。
  • 搜索工具提示“文件未找到”不具备权威性。在得出结论前,请通过权威来源验证(这是通用推理准则;在定位已导入内容的位置时适用)。
  • 对每个生成文件进行U+FFFD最终检查
    LC_ALL=C grep -rl $'\xef\xbf\xbd' .
    必须返回空结果。替换字符表示编码步骤损坏了文本。

Acceptance contract

验收标准

Stop only when all that apply are true:
  • Every fetched body reached disk via
    jq
    /script, not retyped by the model.
  • Collections: the residual rich-media-tag grep (Path A step 5) is empty — every
    mention-doc
    /
    sheet
    /cross-tenant reference was followed to a leaf.
  • LC_ALL=C grep -rl $'\xef\xbf\xbd' .
    is empty.
  • docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).
  • Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).
  • Each output file's frontmatter records
    source
    (the original URL/token) and, if any post-processing was applied, a
    post_process
    provenance line.
  • Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.
仅当以下所有适用条件均满足时才可停止:
  • 所有获取的正文均通过
    jq
    /脚本写入磁盘,未由模型重新输入。
  • 合集提取:残留富媒体标签检查(路径A步骤5)返回空结果——所有
    mention-doc
    /
    sheet
    /跨租户引用均已遍历至叶子节点。
  • LC_ALL=C grep -rl $'\xef\xbf\xbd' .
    返回空结果。
  • docx路径:已渲染为图片并与源内容进行视觉对比;标题层级与高亮格式匹配(详见docx参考文档的检查清单)。
  • 仅使用浏览器备选方案:已完成目录覆盖范围与缩放检查(详见browser-failure-rules.md)。
  • 每个输出文件的前置元数据记录了
    source
    (原始URL/令牌),若进行了任何后处理,还需包含
    post_process
    溯源行。
  • 权限缺口(未导出的131006权限文档、无法下载的图片)已明确告知用户——透明的缺口优于无声的遗漏。

Do NOT attempt

请勿尝试

Verified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: references/permission-and-failure-boundaries.md. The top ones:
  • Bypassing
    131006
    permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.
  • Downloading docx embedded images via
    docs +media-download
    ,
    api …/drive/v1/medias/<t>/download
    (with or without
    extra
    ), or
    schema drive.medias.download
    — none work; lark-cli even mis-reports the real HTTP 400 as "empty JSON".
  • WebFetch
    against
    open.feishu.cn/document/server-docs/...
    for API specs — backend is flaky; use
    open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt
    instead (LLM-friendly, stable).
  • AppleScript/JXA
    executeJavaScript
    , Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).
  • Using
    minimax-docx
    to convert docx→md — it is a docx authoring tool; use the doc-to-markdown skill instead.
已验证的死胡同——重试只会浪费会话时间。包含失败模式与根本原因的完整表格:references/permission-and-failure-boundaries.md。主要包括:
  • 通过任何方式绕过
    131006
    权限拒绝错误(lark-cli / curl / 匿名浏览器)——这是服务器端的硬性边界。
  • 通过
    docs +media-download
    api …/drive/v1/medias/<t>/download
    (含或不含
    extra
    参数)或
    schema drive.medias.download
    下载docx内嵌图片——均无效;lark-cli甚至会将真实的HTTP 400错误误报为“空JSON”。
  • 针对
    open.feishu.cn/document/server-docs/...
    使用
    WebFetch
    获取API规范——后端不稳定;请改用
    open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt
    (对LLM友好,稳定)。
  • AppleScript/JXA
    executeJavaScript
    、端口9222上的Chrome CDP——在此环境中已禁用/无返回(仅适用于浏览器路径)。
  • 使用
    minimax-docx
    将docx转换为md——这是docx创作工具;请改用文档转Markdown技能。

Bundled resources

内置资源

  • scripts/feishu_extract_refs.py
    — deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate
    <mention-doc>
    /
    <sheet>
    /
    <image>
    /cross-tenant/Minutes/Tencent-Meeting references as JSON.
  • scripts/restore_docx_headings.py
    — for Path B: reads true font sizes via python-docx, maps them to heading levels, restores
    w:shd
    highlights to Obsidian
    ==…==
    , without retyping body text.
  • scripts/feishu_dom_capture.js
    — Path D: injectable end-to-end browser DOM capture.
  • scripts/download_feishu_images.py
    — Path D: SSR image extraction when browser automation is unavailable.
  • scripts/build_feishu_markdown.py
    — Path D: render a capture manifest into Markdown.
  • scripts/check_heading_coverage.py
    — coverage verification (both paths).
  • references/lark-cli-api-extraction.md
    — Path A full reference (commands, recursion, sheets, cross-tenant).
  • references/feishu-minutes-transcript.md
    — Path C native transcript API + scope auth.
  • references/permission-and-failure-boundaries.md
    — error codes + the full Do-NOT-attempt table.
  • references/docx-export-to-markdown.md
    — Path B faithful conversion procedure.
  • references/browser-dom-fallback.md
    +
    references/browser-failure-rules.md
    — Path D.
  • references/capture-manifest.md
    — manifest shape for
    build_feishu_markdown.py
    .
  • scripts/feishu_extract_refs.py
    ——确定性引用令牌提取器;递归引擎的核心。对每个获取的正文运行此脚本,可将
    <mention-doc>
    /
    <sheet>
    /
    <image>
    /跨租户/妙记/腾讯会议引用枚举为JSON格式。
  • scripts/restore_docx_headings.py
    ——用于路径B:通过python-docx读取真实字体大小,映射为标题层级,将
    w:shd
    高亮恢复为Obsidian的
    ==…==
    格式,且不会重新输入正文内容。
  • scripts/feishu_dom_capture.js
    ——路径D:可注入的端到端浏览器DOM捕获脚本。
  • scripts/download_feishu_images.py
    ——路径D:当浏览器自动化不可用时,进行SSR图片提取。
  • scripts/build_feishu_markdown.py
    ——路径D:将捕获清单渲染为Markdown文件。
  • scripts/check_heading_coverage.py
    ——覆盖范围验证(适用于所有路径)。
  • references/lark-cli-api-extraction.md
    ——路径A完整参考文档(命令、递归、表格、跨租户)。
  • references/feishu-minutes-transcript.md
    ——路径C原生转写API + 权限范围认证。
  • references/permission-and-failure-boundaries.md
    ——错误码 + 完整“请勿尝试”表格。
  • references/docx-export-to-markdown.md
    ——路径B高保真转换流程。
  • references/browser-dom-fallback.md
    +
    references/browser-failure-rules.md
    ——路径D相关文档。
  • references/capture-manifest.md
    ——
    build_feishu_markdown.py
    使用的清单格式。

Next step

下一步操作

After extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:
Extraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).

Options:
A) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)
B) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files
C) Stop here — the faithful Markdown is the deliverable
提取完成后,干净的Markdown文件通常会进入用户自身的知识库导入流程(归档、索引、去重)——这部分内容刻意排除在本技能的范围之外。若源内容通过路径B(docx文件)处理,文档转Markdown技能已包含在该流程中。请移交文件,不要自动整理:
提取完成:[N]个源内容 → 高保真Markdown文件(已列出[M]个权限/图片缺口)。

选项:
A) 移交至您的PKM/整理工作流——归档并索引这些文件(若存入知识库,推荐此选项)
B) 运行/daymade-docs:docs-cleaner——整合提取文件中的冗余内容
C) 在此停止——高保真Markdown文件即为交付成果