Feishu Doc Scraper
Extract a Feishu/Lark source into faithful local Markdown. Prefer the lark-cli API — it extracts the body programmatically (no model paraphrasing), follows a collection's reference graph, and reads permission boundaries from error codes instead of guessing. Treat the rendered browser page as a fallback, not the source of truth: in real collection-scraping work the API path consistently does the whole job while the browser path is never needed.
Scope (read this first)
This skill's contract is faithful per-source Markdown + a record of what was extracted. It does not decide how the resulting files are named, indexed, deduplicated against existing notes, or organized into a knowledge base — that belongs to the host PKM / the user's own conventions. Stopping at faithful extraction keeps this skill orthogonal and reusable. When the user wants the output filed into a vault, extract first, then hand the clean Markdown to their organizing workflow.
Choose the path
Is the source a Feishu/Lark URL (wiki / docx / sheets / minutes / base)?
├── YES → is lark-cli installed and authenticated to that tenant?
│ ├── YES → PATH A: lark-cli API extraction (primary — start here)
│ │ └── hit code 131006 / 99991679 (permission denied)?
│ │ └── PATH B: owner-exported .docx → faithful Markdown
│ └── NO → install/auth lark-cli first (it is worth it); only if
│ truly impossible → PATH D: browser DOM fallback
├── the URL is a Minutes / 妙记 link, or a doc references one → PATH C: Minutes transcript
└── you were handed an exported .docx (not a URL) → PATH B
A collection/hub is just a docx whose body references other docs — Path A handles it by recursively following the reference graph, not by visiting pages in a browser.
Path A — lark-cli API extraction (primary)
Full command catalog, recursion engine, cross-tenant and personal-space nuances: references/lark-cli-api-extraction.md. The essentials for the common case:
1. Disable the proxy for Feishu domestic domains. Feishu's
endpoints are direct-connect in mainland China; routing them through a local proxy leaks credentials through the proxy and gets DNS-hijacked. lark-cli itself warns about this. Always:
bash
export LARK_CLI_NO_PROXY=1
This does not conflict with any "Claude/Anthropic domains must use the proxy" rule — Feishu is a different host and is direct.
2. Classify the URL, then resolve to a fetchable doc token.
- — a wiki node token is not a doc token. Resolve it first:
bash
lark-cli wiki spaces get_node --params '{"token":"<node_token>"}'
# → .data.node.obj_token and .data.node.obj_type (e.g. "docx")
- — already a doc token, fetch directly.
- — spreadsheet, use the sheets commands (see reference).
- — Minutes, go to Path C.
3. Fetch the body as Markdown — programmatically, never via the model.
bash
lark-cli docs +fetch --doc <obj_token> --format json > /tmp/fetch.json 2> /tmp/fetch.err
# body is .data.markdown — extract with jq, do NOT retype or summarize it
jq -r '.data.markdown' /tmp/fetch.json > source.md
Keep stdout and stderr separate. A harmless
[deprecated] docs +fetch with v1 API is deprecated
goes to stderr; piping
and together produced a false
in practice — redirect to files and inspect, don't blind-pipe. The body must reach disk without passing through the model (paraphrasing silently corrupts source text — this is the single most important fidelity rule).
4. If it's a collection/hub, follow the reference graph (BFS). The hub body contains
,
,
tags and cross-tenant / Minutes / Tencent-Meeting URLs. Extract every reference, dispatch by type, fetch, and
repeat on each newly fetched doc until no new references remain (leaf nodes). Use the bundled extractor so nothing is silently missed (a missed reference = a missing document, the #1 hub-scraping failure):
bash
python3 scripts/feishu_extract_refs.py source.md # → JSON list of {type, token, title}
Recursion loop, dispatch table, and the cross-tenant/
personal-space rules are in the reference.
5. Final residual-tag check (acceptance gate for collections). Every rich-media reference must have been resolved and rendered:
bash
grep -rlE '<(lark-table|lark-tr|sheet token=|mention-doc|view type=)' . && echo "UNRESOLVED — keep recursing" || echo "clean"
Must be empty before you stop.
Path B — permission denied → owner-exported .docx
lark-cli wiki spaces get_node
returning
code 131006 … node permission denied, user needs read permission
(or fetch returning it) is a
hard Feishu-side boundary. lark-cli, anonymous curl, and the browser all fail it — this has been verified exhaustively; do not spend cycles trying to bypass it. The only correct move: ask the permission holder to export the doc as
and send it back out-of-band, then convert with fidelity (font-size→heading and
→highlight restoration, then visual verification). Full procedure:
references/docx-export-to-markdown.md.
Path C — Feishu Minutes (妙记) transcript
only returns metadata and can download audio/video — it
cannot export the text transcript. The transcript comes from a native endpoint called through
, and needs an extra scope granted via a device-flow login. Native AI transcription is far better than downloading the media and re-running ASR — never do the latter. Endpoint, scope name, the device-flow timeout trap, and per-minute (not per-tenant) permission behavior:
references/feishu-minutes-transcript.md.
Path D — browser DOM fallback (last resort)
Only when lark-cli genuinely cannot reach the content (no install possible, and the doc is not permission-walled). This is the old virtual-scroll / TOC-driven DOM capture workflow. It is slower, depends on a connected browser surface (the in-browser extension frequently fails to connect), and an anonymous debugging Chrome can only tell you whether a page is
publicly reachable — it cannot read login-walled content. Workflow:
references/browser-dom-fallback.md. Battle-tested DOM rules (virtual scroll,
ordering, table/bullet extraction, image streams):
references/browser-failure-rules.md.
Hard rules
These are the rules whose violation silently ruins the output. Each has a reason — follow the reason, not just the letter.
- Never let the document body pass through the model. Extract with //scripts straight to disk. The model paraphrasing source text is undetectable later and destroys fidelity. This is why Path A beats the browser path structurally.
export LARK_CLI_NO_PROXY=1
for . Otherwise credentials transit a local proxy and DNS is hijacked.
- Transcripts come from the platform's native transcription, never re-ASR. Downloading media and transcribing again loses speaker labels, timestamps, and accuracy.
- A generated docx Markdown is not done until it has been visually verified against the source (render to image, read it). Feishu-exported docx uses font-size+bold for headings rather than Word heading styles, so a "no errors, word count matches" check passes while the entire heading hierarchy is silently flat. Text-level checks cannot catch this.
- Do not 死磕 (grind) on docx embedded-image download. lark-cli (through 1.0.32) cannot download tokens from a docx — exhaustively verified. Register the image tokens and note "needs document owner to right-click → save"; the text is the value, images are a tracked gap.
- HTTP 200 from anonymous curl ≠ accessible. A Feishu login wall returns 200 with a body containing / / / an empty . Check the body, never infer "public" from the status code.
- A file "not found" by a search agent is not authoritative. Verify against authoritative sources before concluding (this is general Inference Discipline; relevant when locating where ingested content already lives).
- U+FFFD final check on every produced file:
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
must be empty. A replacement character means an encoding step corrupted the text.
Acceptance contract
Stop only when all that apply are true:
- Every fetched body reached disk via /script, not retyped by the model.
- Collections: the residual rich-media-tag grep (Path A step 5) is empty — every //cross-tenant reference was followed to a leaf.
LC_ALL=C grep -rl $'\xef\xbf\xbd' .
is empty.
- docx path: rendered to an image and visually compared to the source; heading hierarchy and highlights match (see docx reference's checklist).
- Browser fallback only: TOC coverage + scale check (see browser-failure-rules.md).
- Each output file's frontmatter records (the original URL/token) and, if any post-processing was applied, a provenance line.
- Permission gaps (131006 docs not exported yet, undownloadable images) are explicitly listed for the user — a transparent gap beats a silent omission.
Do NOT attempt
Verified dead-ends — retrying them only wastes the session. Full table with failure modes and root causes: references/permission-and-failure-boundaries.md. The top ones:
- Bypassing permission-denied by any means (lark-cli / curl / anonymous browser) — it is a server-side boundary.
- Downloading docx embedded images via ,
api …/drive/v1/medias/<t>/download
(with or without ), or schema drive.medias.download
— none work; lark-cli even mis-reports the real HTTP 400 as "empty JSON".
- against
open.feishu.cn/document/server-docs/...
for API specs — backend is flaky; use open.feishu.cn/llms-docs/zh-CN/llms-<module>.txt
instead (LLM-friendly, stable).
- AppleScript/JXA , Chrome CDP on port 9222 — disabled/empty in this environment (browser path only).
- Using to convert docx→md — it is a docx authoring tool; use the doc-to-markdown skill instead.
Bundled resources
scripts/feishu_extract_refs.py
— deterministic reference-token extractor; the recursion engine's core. Run it on every fetched body to enumerate ///cross-tenant/Minutes/Tencent-Meeting references as JSON.
scripts/restore_docx_headings.py
— for Path B: reads true font sizes via python-docx, maps them to heading levels, restores highlights to Obsidian , without retyping body text.
scripts/feishu_dom_capture.js
— Path D: injectable end-to-end browser DOM capture.
scripts/download_feishu_images.py
— Path D: SSR image extraction when browser automation is unavailable.
scripts/build_feishu_markdown.py
— Path D: render a capture manifest into Markdown.
scripts/check_heading_coverage.py
— coverage verification (both paths).
references/lark-cli-api-extraction.md
— Path A full reference (commands, recursion, sheets, cross-tenant).
references/feishu-minutes-transcript.md
— Path C native transcript API + scope auth.
references/permission-and-failure-boundaries.md
— error codes + the full Do-NOT-attempt table.
references/docx-export-to-markdown.md
— Path B faithful conversion procedure.
references/browser-dom-fallback.md
+ references/browser-failure-rules.md
— Path D.
references/capture-manifest.md
— manifest shape for .
Next step
After extraction completes, the clean Markdown typically feeds the user's own knowledge-base ingestion (filing, indexing, dedup) — which is deliberately out of this skill's scope. If the source went through Path B (a docx), the doc-to-markdown skill is already part of that flow. Offer the handoff; do not auto-organize:
Extraction complete: [N] sources → faithful Markdown ([M] permission/image gaps listed).
Options:
A) Hand off to your PKM/organizing workflow — file & index these (Recommended if part of a vault)
B) Run /daymade-docs:docs-cleaner — consolidate redundant content across the extracted files
C) Stop here — the faithful Markdown is the deliverable