feishu-doc-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Feishu Doc Scraper

Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback, not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed.

将飞书/Lark源内容提取为高保真本地Markdown文件。优先使用lark-cli API——它通过程序化方式提取正文（无模型改写），可遍历合集的引用图谱，并通过错误码识别权限边界而非猜测。将渲染后的浏览器页面视为备选方案，而非可信源：在实际的合集爬取工作中，API路径始终能完成全部任务，而浏览器路径几乎从未被用到。

Scope (read this first)

适用范围（请先阅读）

This skill's contract is faithful per-source Markdown + a record of what was extracted. It does not decide how the resulting files are named, indexed, deduplicated against existing notes, or organized into a knowledge base — that belongs to the host PKM / the user's own conventions. Stopping at faithful extraction keeps this skill orthogonal and reusable. When the user wants the output filed into a vault, extract first, then hand the clean Markdown to their organizing workflow.

本技能的核心目标是按源内容生成高保真Markdown文件，并记录提取内容。它不负责决定生成文件的命名、索引、与现有笔记去重，或是组织到知识库中的方式——这些属于宿主PKM（个人知识管理系统）或用户自身的约定范畴。仅专注于高保真提取可确保本技能的独立性与可复用性。当用户希望将输出文件存入知识库时，需先完成提取，再将干净的Markdown文件移交至其整理工作流。

Choose the path

选择处理路径

Is the source a Feishu/Lark URL (wiki / docx / sheets / minutes / base)?
├── YES → is lark-cli installed and authenticated to that tenant?
│        ├── YES → PATH A: lark-cli API extraction  (primary — start here)
│        │         └── hit code 131006 / 99991679 (permission denied)?
│        │              └── PATH B: owner-exported .docx → faithful Markdown
│        └── NO  → install/auth lark-cli first (it is worth it); only if
│                  truly impossible → PATH D: browser DOM fallback
├── the URL is a Minutes / 妙记 link, or a doc references one → PATH C: Minutes transcript
└── you were handed an exported .docx (not a URL) → PATH B

A collection/hub is just a docx whose body references other docs — Path A handles it by recursively following the reference graph, not by visiting pages in a browser.

源内容是否为飞书/Lark URL（wiki / docx / 表格 / 妙记 / 多维表格）？
├── 是 → 是否已安装并认证lark-cli至对应租户？
│        ├── 是 → 路径A：lark-cli API提取（首选——从此处开始）
│        │         └── 是否遇到错误码131006 / 99991679（权限拒绝）？
│        │              └── 路径B：所有者导出的.docx文件转换为高保真Markdown
│        └── 否 → 先安装/认证lark-cli（值得投入）；仅当确实无法安装时
│                  → 路径D：浏览器DOM备选方案
├── URL为妙记链接，或文档中引用了妙记内容 → 路径C：妙记转写内容提取
└── 收到的是导出的.docx文件（而非URL） → 路径B

合集/知识库本质是正文包含其他文档引用的docx文件——路径A通过递归遍历引用图谱处理此类内容，无需在浏览器中访问页面。

Path A — lark-cli API extraction (primary)

路径A — lark-cli API提取（首选）

Full command catalog, recursion engine, cross-tenant and personal-space nuances: references/lark-cli-api-extraction.md. The essentials for the common case:

1. Disable the proxy for Feishu domestic domains. Feishu's

*.feishu.cn

endpoints are direct-connect in mainland China; routing them through a local proxy leaks credentials through the proxy and gets DNS-hijacked. lark-cli itself warns about this. Always:

bash

export LARK_CLI_NO_PROXY=1

This does not conflict with any "Claude/Anthropic domains must use the proxy" rule — Feishu is a different host and is direct.

2. Classify the URL, then resolve to a fetchable doc token.

…/wiki/<node_token>

— a wiki node token is not a doc token. Resolve it first:

bash

lark-cli wiki spaces get_node --params '{"token":"<node_token>"}'
# → .data.node.obj_token  and  .data.node.obj_type  (e.g. "docx")

```
…/docx/<doc_token>
```
— already a doc token, fetch directly.
```
…/sheets/<token>
```
— spreadsheet, use the sheets commands (see reference).
```
…/minutes/<token>
```
— Minutes, go to Path C.

3. Fetch the body as Markdown — programmatically, never via the model.

bash

lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err

完整命令目录、递归引擎、跨租户与个人空间细节：references/lark-cli-api-extraction.md。常见场景的核心步骤：

1. 禁用飞书国内域名的代理。飞书

*.feishu.cn

端点在中国大陆为直连模式；通过本地代理路由会导致凭证通过代理泄露，并遭遇DNS劫持。lark-cli本身会对此发出警告。请始终执行：

bash

export LARK_CLI_NO_PROXY=1

此设置与“Claude/Anthropic域名必须使用代理”的规则无冲突——飞书是独立主机，采用直连方式。

2. 对URL进行分类，解析为可获取的文档令牌。

…/wiki/<node_token>

—— wiki节点令牌并非文档令牌。需先解析：

bash

lark-cli wiki spaces get_node --params '{"token":"<node_token>"}'
# → .data.node.obj_token 以及 .data.node.obj_type（例如"docx"）

```
…/docx/<doc_token>
```
—— 已为文档令牌，可直接获取。
```
…/sheets/<token>
```
—— 电子表格，使用表格相关命令（详见参考文档）。
```
…/minutes/<token>
```
—— 妙记内容，跳转至路径C。

3. 以Markdown格式获取正文——通过程序化方式，绝不使用模型。

bash

lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err

body is .data.markdown — extract with jq, do NOT retype or summarize it

正文位于.data.markdown字段——使用jq提取，请勿重新输入或总结

jq -r '.data.markdown' /tmp/fetch.json > source.md


Keep stdout and stderr separate. A harmless `[deprecated] docs +fetch with v1 API is deprecated` goes to stderr; piping `2>/dev/null` *and* `jq` together produced a false `Exit code 5` in practice — redirect to files and inspect, don't blind-pipe. The body must reach disk without passing through the model (paraphrasing silently corrupts source text — this is the single most important fidelity rule).

**4. If it's a collection/hub, follow the reference graph (BFS).** The hub body contains `<mention-doc>`, `<sheet>`, `<image>` tags and cross-tenant / Minutes / Tencent-Meeting URLs. Extract every reference, dispatch by type, fetch, and **repeat on each newly fetched doc until no new references remain** (leaf nodes). Use the bundled extractor so nothing is silently missed (a missed reference = a missing document, the #1 hub-scraping failure):

```bash
python3 scripts/feishu_extract_refs.py source.md   # → JSON list of {type, token, title}

Recursion loop, dispatch table, and the cross-tenant/

my.feishu.cn

personal-space rules are in the reference.

5. Final residual-tag check (acceptance gate for collections). Every rich-media reference must have been resolved and rendered:

bash

grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "UNRESOLVED — keep recursing" || echo "clean"

Must be empty before you stop.

jq -r '.data.markdown' /tmp/fetch.json > source.md


将标准输出与标准错误分开保存。无害的`[deprecated] docs +fetch with v1 API is deprecated`信息会输出到标准错误；若同时使用`2>/dev/null`与`jq`，实际操作中会产生错误的“退出码5”——请将输出重定向至文件并检查，不要盲目管道传输。正文必须直接写入磁盘，不得经过模型处理（改写会无声地破坏源文本——这是保真度的核心规则）。

**4. 若为合集/知识库，遍历引用图谱（广度优先搜索）**。合集正文包含`<mention-doc>`、`<sheet>`、`<image>`标签以及跨租户/妙记/腾讯会议URL。提取所有引用，按类型分发、获取内容，并**对每个新获取的文档重复此过程直至无新引用（叶子节点）**。使用内置的提取器确保无遗漏（遗漏引用=缺失文档，这是合集爬取失败的头号原因）：

```bash
python3 scripts/feishu_extract_refs.py source.md   # → 生成包含{type, token, title}的JSON列表

递归循环、分发表以及跨租户/

my.feishu.cn

个人空间规则详见参考文档。

5. 最终残留标签检查（合集提取的验收标准）。所有富媒体引用必须已解析并渲染：

bash

grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "未解析完成——继续递归" || echo "已清理完成"

必须返回空结果才可停止。

Path B — permission denied → owner-exported .docx

路径B — 权限拒绝 → 所有者导出的.docx文件

lark-cli wiki spaces get_node

returning

code 131006 … node permission denied, user needs read permission

(or fetch returning it) is a hard Feishu-side boundary. lark-cli, anonymous curl, and the browser all fail it — this has been verified exhaustively; do not spend cycles trying to bypass it. The only correct move: ask the permission holder to export the doc as

.docx

and send it back out-of-band, then convert with fidelity (font-size→heading and

w:shd

→highlight restoration, then visual verification). Full procedure: references/docx-export-to-markdown.md.

lark-cli wiki spaces get_node

code 131006 … node permission denied, user needs read permission

（或获取内容时返回此错误）是飞书端的硬性权限边界。lark-cli、匿名curl与浏览器均无法突破——此结论已通过全面验证；请勿浪费时间尝试绕过。唯一正确的做法：请求权限持有者将文档导出为

.docx

文件并离线发送，再进行高保真转换（将字体大小映射为标题、

w:shd

格式恢复为高亮，然后进行视觉验证）。完整流程：references/docx-export-to-markdown.md。

Path C — Feishu Minutes (妙记) transcript

路径C — 飞书妙记（Minutes）转写内容

lark-cli minutes

only returns metadata and can download audio/video — it cannot export the text transcript. The transcript comes from a native endpoint called through

lark-cli api

, and needs an extra scope granted via a device-flow login. Native AI transcription is far better than downloading the media and re-running ASR — never do the latter. Endpoint, scope name, the device-flow timeout trap, and per-minute (not per-tenant) permission behavior: references/feishu-minutes-transcript.md.

lark-cli minutes

仅返回元数据并可下载音频/视频——它无法导出文本转写内容。转写内容需通过

lark-cli api

调用原生端点获取，且需通过设备流登录授予额外权限。平台原生AI转写效果远优于下载媒体后重新运行ASR（自动语音识别）——绝不要采用后者。端点、权限范围名称、设备流超时陷阱以及按分钟（而非按租户）的权限行为：references/feishu-minutes-transcript.md。

Path D — browser DOM fallback (last resort)

路径D — 浏览器DOM备选方案（最后手段）

Only when lark-cli genuinely cannot reach the content (no install possible, and the doc is not permission-walled). This is the old virtual-scroll / TOC-driven DOM capture workflow. It is slower, depends on a connected browser surface (the in-browser extension frequently fails to connect), and an anonymous debugging Chrome can only tell you whether a page is publicly reachable — it cannot read login-walled content. Workflow: references/browser-dom-fallback.md. Battle-tested DOM rules (virtual scroll,

data-block-id

ordering, table/bullet extraction, image streams): references/browser-failure-rules.md.

仅当lark-cli确实无法获取内容时使用（无法安装，且文档未设权限墙）。这是旧版的虚拟滚动/目录驱动DOM捕获工作流。此方法速度较慢，依赖连接的浏览器界面（浏览器扩展经常无法连接），且匿名调试Chrome仅能判断页面是否公开可访问——无法读取需登录的内容。工作流：references/browser-dom-fallback.md。经过实战验证的DOM规则（虚拟滚动、

data-block-id

排序、表格/项目符号提取、图片流）：references/browser-failure-rules.md。

Hard rules

硬性规则

These are the rules whose violation silently ruins the output. Each has a reason — follow the reason, not just the letter.

Never let the document body pass through the model. Extract with
```
jq
```
/
```
cat
```
/scripts straight to disk. The model paraphrasing source text is undetectable later and destroys fidelity. This is why Path A beats the browser path structurally.
export LARK_CLI_NO_PROXY=1
for
*.feishu.cn
. Otherwise credentials transit a local proxy and DNS is hijacked.
Transcripts come from the platform's native transcription, never re-ASR. Downloading media and transcribing again loses speaker labels, timestamps, and accuracy.
A generated docx Markdown is not done until it has been visually verified against the source (render to image, read it). Feishu-exported docx uses font-size+bold for headings rather than Word heading styles, so a "no errors, word count matches" check passes while the entire heading hierarchy is silently flat. Text-level checks cannot catch this.
Do not 死磕 (grind) on docx embedded-image download. lark-cli (through 1.0.32) cannot download
```
<image>
```
tokens from a docx — exhaustively verified. Register the image tokens and note "needs document owner to right-click → save"; the text is the value, images are a tracked gap.
HTTP 200 from anonymous curl ≠ accessible. A Feishu login wall returns 200 with a body containing
```
accounts.feishu.cn
```
/
```
login
```
/
```
passport
```
/ an empty
```
<title>
```
. Check the body, never infer "public" from the status code.
A file "not found" by a search agent is not authoritative. Verify against authoritative sources before concluding (this is general Inference Discipline; relevant when locating where ingested content already lives).
U+FFFD final check on every produced file:
```
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
```
must be empty. A replacement character means an encoding step corrupted the text.

违反这些规则会无声地破坏输出质量。每条规则均有依据——请遵循规则背后的逻辑，而非仅字面意思。

绝不让文档正文经过模型处理。使用
```
jq
```
/
```
cat
```
/脚本直接提取至磁盘。模型改写源文本的行为事后无法察觉，且会彻底破坏保真度。这也是路径A在结构上优于浏览器路径的原因。
针对
*.feishu.cn
域名设置
export LARK_CLI_NO_PROXY=1
。否则凭证会通过本地代理传输，且DNS会被劫持。
转写内容来自平台原生转写，绝不重新进行ASR。下载媒体后重新转写会丢失说话人标签、时间戳与准确性。
生成的docx转Markdown文件必须经过视觉验证（渲染为图片并查看），才可视为完成。飞书导出的docx使用字体大小+粗体表示标题，而非Word标题样式，因此“无错误、字数匹配”的检查会通过，但整个标题层级可能无声地变为扁平结构。文本层面的检查无法发现此问题。
不要死磕docx内嵌图片的下载。lark-cli（截至1.0.32版本）无法从docx文件中下载
```
<image>
```
令牌——此结论已全面验证。记录图片令牌并注明“需文档所有者右键→保存”；文本内容是核心价值，图片为需跟踪的缺口。
匿名curl返回HTTP 200 ≠ 可访问。飞书登录墙会返回200状态码，且正文包含
```
accounts.feishu.cn
```
/
```
login
```
/
```
passport
```
/空
```
<title>
```
。请检查正文内容，切勿仅通过状态码推断“公开可访问”。
搜索工具提示“文件未找到”不具备权威性。在得出结论前，请通过权威来源验证（这是通用推理准则；在定位已导入内容的位置时适用）。
对每个生成文件进行U+FFFD最终检查：
```
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
```
必须返回空结果。替换字符表示编码步骤损坏了文本。

Acceptance contract

验收标准

Stop only when all that apply are true:

Every fetched body reached disk via
```
jq
```
/script, not retyped by the model.
Collections: the residual rich-media-tag grep (Path A step 5) is empty — every
```
mention-doc
```
/
```
sheet
```
/cross-tenant reference was followed to a leaf.
```
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
```
is empty.
docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).
Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).
Each output file's frontmatter records
```
source
```
(the original URL/token) and, if any post-processing was applied, a
```
post_process
```
provenance line.
Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.

仅当以下所有适用条件均满足时才可停止：

所有获取的正文均通过
```
jq
```
/脚本写入磁盘，未由模型重新输入。
合集提取：残留富媒体标签检查（路径A步骤5）返回空结果——所有
```
mention-doc
```
/
```
sheet
```
/跨租户引用均已遍历至叶子节点。
```
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
```
返回空结果。
docx路径：已渲染为图片并与源内容进行视觉对比；标题层级与高亮格式匹配（详见docx参考文档的检查清单）。
仅使用浏览器备选方案：已完成目录覆盖范围与缩放检查（详见browser-failure-rules.md）。
每个输出文件的前置元数据记录了
```
source
```
（原始URL/令牌），若进行了任何后处理，还需包含
```
post_process
```
溯源行。
权限缺口（未导出的131006权限文档、无法下载的图片）已明确告知用户——透明的缺口优于无声的遗漏。

Do NOT attempt

请勿尝试

Verified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: references/permission-and-failure-boundaries.md. The top ones:

Bypassing
```
131006
```
permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.
Downloading docx embedded images via
```
docs +media-download
```
,
```
api …/drive/v1/medias/<t>/download
```
(with or without
```
extra
```
), or
```
schema drive.medias.download
```
— none work; lark-cli even mis-reports the real HTTP 400 as "empty JSON".

WebFetch

against

open.feishu.cn/document/server-docs/...

for API specs — backend is flaky; use

open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt

instead (LLM-friendly, stable).

AppleScript/JXA
```
executeJavaScript
```
, Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).
Using
```
minimax-docx
```
to convert docx→md — it is a docx authoring tool; use the doc-to-markdown skill instead.

已验证的死胡同——重试只会浪费会话时间。包含失败模式与根本原因的完整表格：references/permission-and-failure-boundaries.md。主要包括：

通过任何方式绕过
```
131006
```
权限拒绝错误（lark-cli / curl / 匿名浏览器）——这是服务器端的硬性边界。
通过
```
docs +media-download
```
、
```
api …/drive/v1/medias/<t>/download
```
（含或不含
```
extra
```
参数）或
```
schema drive.medias.download
```
下载docx内嵌图片——均无效；lark-cli甚至会将真实的HTTP 400错误误报为“空JSON”。

针对

open.feishu.cn/document/server-docs/...

使用

WebFetch

获取API规范——后端不稳定；请改用

open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt

（对LLM友好，稳定）。

AppleScript/JXA
```
executeJavaScript
```
、端口9222上的Chrome CDP——在此环境中已禁用/无返回（仅适用于浏览器路径）。
使用
```
minimax-docx
```
将docx转换为md——这是docx创作工具；请改用文档转Markdown技能。

Bundled resources

内置资源

```
scripts/feishu_extract_refs.py
```
— deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate
```
<mention-doc>
```
/
```
<sheet>
```
/
```
<image>
```
/cross-tenant/Minutes/Tencent-Meeting references as JSON.
```
scripts/restore_docx_headings.py
```
— for Path B: reads true font sizes via python-docx, maps them to heading levels, restores
```
w:shd
```
highlights to Obsidian
```
==…==
```
, without retyping body text.
```
scripts/feishu_dom_capture.js
```
— Path D: injectable end-to-end browser DOM capture.
```
scripts/download_feishu_images.py
```
— Path D: SSR image extraction when browser automation is unavailable.
```
scripts/build_feishu_markdown.py
```
— Path D: render a capture manifest into Markdown.
```
scripts/check_heading_coverage.py
```
— coverage verification (both paths).
```
references/lark-cli-api-extraction.md
```
— Path A full reference (commands, recursion, sheets, cross-tenant).
```
references/feishu-minutes-transcript.md
```
— Path C native transcript API + scope auth.
```
references/permission-and-failure-boundaries.md
```
— error codes + the full Do-NOT-attempt table.
```
references/docx-export-to-markdown.md
```
— Path B faithful conversion procedure.

references/browser-dom-fallback.md

references/browser-failure-rules.md

— Path D.

references/capture-manifest.md

— manifest shape for

build_feishu_markdown.py

```
scripts/feishu_extract_refs.py
```
——确定性引用令牌提取器；递归引擎的核心。对每个获取的正文运行此脚本，可将
```
<mention-doc>
```
/
```
<sheet>
```
/
```
<image>
```
/跨租户/妙记/腾讯会议引用枚举为JSON格式。
```
scripts/restore_docx_headings.py
```
——用于路径B：通过python-docx读取真实字体大小，映射为标题层级，将
```
w:shd
```
高亮恢复为Obsidian的
```
==…==
```
格式，且不会重新输入正文内容。
```
scripts/feishu_dom_capture.js
```
——路径D：可注入的端到端浏览器DOM捕获脚本。
```
scripts/download_feishu_images.py
```
——路径D：当浏览器自动化不可用时，进行SSR图片提取。
```
scripts/build_feishu_markdown.py
```
——路径D：将捕获清单渲染为Markdown文件。
```
scripts/check_heading_coverage.py
```
——覆盖范围验证（适用于所有路径）。
```
references/lark-cli-api-extraction.md
```
——路径A完整参考文档（命令、递归、表格、跨租户）。
```
references/feishu-minutes-transcript.md
```
——路径C原生转写API + 权限范围认证。
```
references/permission-and-failure-boundaries.md
```
——错误码 + 完整“请勿尝试”表格。
```
references/docx-export-to-markdown.md
```
——路径B高保真转换流程。

references/browser-dom-fallback.md

references/browser-failure-rules.md

——路径D相关文档。

references/capture-manifest.md

——

build_feishu_markdown.py

使用的清单格式。

Next step

下一步操作

After extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:

Extraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).

Options:
A) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)
B) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files
C) Stop here — the faithful Markdown is the deliverable

提取完成后，干净的Markdown文件通常会进入用户自身的知识库导入流程（归档、索引、去重）——这部分内容刻意排除在本技能的范围之外。若源内容通过路径B（docx文件）处理，文档转Markdown技能已包含在该流程中。请移交文件，不要自动整理：

提取完成：[N]个源内容 → 高保真Markdown文件（已列出[M]个权限/图片缺口）。

选项：
A) 移交至您的PKM/整理工作流——归档并索引这些文件（若存入知识库，推荐此选项）
B) 运行/daymade-docs:docs-cleaner——整合提取文件中的冗余内容
C) 在此停止——高保真Markdown文件即为交付成果