extract

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

/extract — Knowledge Extraction Skill

/extract — 知识提取Skill

When NOT to use this skill

本Skill的不适用场景

  • User wants a summary, digest, or TL;DR → use summarize instead
  • Content is news, announcements, or product launches with no methodology → decline and explain
  • User wants real-time data, prices, or live information → use web_search instead
  • Content is purely promotional/marketing with no transferable insight → say so, don't extract

  • 用户想要摘要、digest或TL;DR → 请改用summarize工具
  • 内容为无方法论的新闻、公告或产品发布 → 拒绝请求并说明原因
  • 用户想要实时数据、价格或直播信息 → 请改用web_search工具
  • 内容为纯推广/营销内容,无可迁移洞见 → 说明情况,不进行提取

Quick Reference

快速参考

SourceMethod
YouTubeyt-dlp subtitles → Groq audio fallback
Podcast / direct audioyt-dlp download → Groq transcription
X/Twitter threadX API v2 (X_BEARER_TOKEN required)
Web articleWebFetch tool
Local file / PDFRead tool
Paywalled contentExtract what's accessible, note the wall
Output always starts with Source: [title] — [URL] then knowledge by category.

来源处理方式
YouTubeyt-dlp提取字幕 → Groq音频转录兜底
播客 / 直连音频yt-dlp下载 → Groq转录
X/Twitter帖子串X API v2(需要X_BEARER_TOKEN)
网页文章WebFetch工具
本地文件 / PDFRead工具
付费墙内容提取可访问内容,标注付费墙限制
输出始终以 来源: [标题] — [URL] 开头,之后按分类展示提取的知识。

Workflow

工作流程

Step 1: Identify source type and fetch content

步骤1:识别来源类型并获取内容

YouTube video (youtube.com or youtu.be):
bash
TMPDIR=$(mktemp -d)
YouTube视频(youtube.com 或 youtu.be):
bash
TMPDIR=$(mktemp -d)

Step 1: try subtitle extraction — use SRT conversion to handle auto-caption dedup

Step 1: try subtitle extraction — use SRT conversion to handle auto-caption dedup

yt-dlp --write-auto-sub --sub-lang "en" --skip-download --convert-subs srt -o "${TMPDIR}/sub" "<url>" 2>/dev/null TRANSCRIPT=$(cat ${TMPDIR}/sub*.srt 2>/dev/null | grep -v "^[0-9]" | grep -v "^-->" | grep -v "^$" | sed 's/<[^>]*>//g' | tr '\n' ' ')
yt-dlp --write-auto-sub --sub-lang "en" --skip-download --convert-subs srt -o "${TMPDIR}/sub" "<url>" 2>/dev/null TRANSCRIPT=$(cat ${TMPDIR}/sub*.srt 2>/dev/null | grep -v "^[0-9]" | grep -v "^-->" | grep -v "^$" | sed 's/<[^>]*>//g' | tr '\n' ' ')

Step 2: if no subtitles, transcribe via Groq

Step 2: if no subtitles, transcribe via Groq

if [ -z "$TRANSCRIPT" ]; then if [ -z "$GROQ_API_KEY" ]; then echo "ERROR: No subtitles found and GROQ_API_KEY is not set. Cannot transcribe." rm -rf "$TMPDIR" exit 1 fi yt-dlp -x --audio-format mp3 --audio-quality 9 -o "${TMPDIR}/audio.%(ext)s" "<url>" 2>/dev/null

Check file size — Groq limit: 25MB free tier, 100MB dev tier

FILESIZE=$(stat -f%z "${TMPDIR}/audio.mp3" 2>/dev/null || stat -c%s "${TMPDIR}/audio.mp3" 2>/dev/null) if [ "$FILESIZE" -gt 100000000 ]; then echo "ERROR: Audio too large for Groq ($(($FILESIZE/1048576))MB, limit 100MB). Try a shorter video." rm -rf "$TMPDIR" exit 1 fi
TRANSCRIPT=$(curl -s https://api.groq.com/openai/v1/audio/transcriptions
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"

If both paths fail: tell the user exactly what failed and stop.

---

**Podcast / direct audio** (MP3/M4A URL, SoundCloud, podcast episode):

yt-dlp handles most audio URLs natively. Use the same Groq transcription path as the YouTube fallback (requires GROQ_API_KEY). For RSS feeds: extract the episode `<enclosure>` URL first, then treat as direct audio.

---

**X/Twitter thread** (x.com or twitter.com):

X blocks all unauthenticated access. Requires X_BEARER_TOKEN in environment.
X API uses pay-per-use credits — each call costs credits from your balance.

If X_BEARER_TOKEN is not set: ask the user to paste the thread text directly.
"X requires a paid API for access. Paste the thread text and I'll extract from that."

```bash
if [ -z "$X_BEARER_TOKEN" ]; then
  echo "X_BEARER_TOKEN is not set. Ask user to paste thread text."
  exit 1
fi

TWEET_ID=$(echo "<url>" | grep -oE '[0-9]{15,}' | tail -1)
if [ -z "$TRANSCRIPT" ]; then if [ -z "$GROQ_API_KEY" ]; then echo "ERROR: No subtitles found and GROQ_API_KEY is not set. Cannot transcribe." rm -rf "$TMPDIR" exit 1 fi yt-dlp -x --audio-format mp3 --audio-quality 9 -o "${TMPDIR}/audio.%(ext)s" "<url>" 2>/dev/null

Check file size — Groq limit: 25MB free tier, 100MB dev tier

FILESIZE=$(stat -f%z "${TMPDIR}/audio.mp3" 2>/dev/null || stat -c%s "${TMPDIR}/audio.mp3" 2>/dev/null) if [ "$FILESIZE" -gt 100000000 ]; then echo "ERROR: Audio too large for Groq ($(($FILESIZE/1048576))MB, limit 100MB). Try a shorter video." rm -rf "$TMPDIR" exit 1 fi
TRANSCRIPT=$(curl -s https://api.groq.com/openai/v1/audio/transcriptions
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"

如果两种路径都失败:明确告知用户失败原因,终止流程。

---

**播客 / 直连音频**(MP3/M4A链接、SoundCloud、播客单集):

yt-dlp原生支持绝大多数音频链接,使用和YouTube兜底方案相同的Groq转录流程(需要GROQ_API_KEY)。如果是RSS feed:先提取单集的`<enclosure>`链接,再按直连音频处理。

---

**X/Twitter帖子串**(x.com 或 twitter.com):

X屏蔽所有未认证访问,需要环境变量中配置X_BEARER_TOKEN。
X API采用按量付费模式——每次调用都会扣除账户余额中的额度。

如果未配置X_BEARER_TOKEN:请用户直接粘贴帖子串文本,回复话术:"X需要付费API才能访问,请粘贴帖子串文本,我会从中提取内容。"

```bash
if [ -z "$X_BEARER_TOKEN" ]; then
  echo "X_BEARER_TOKEN is not set. Ask user to paste thread text."
  exit 1
fi

TWEET_ID=$(echo "<url>" | grep -oE '[0-9]{15,}' | tail -1)

Fetch root tweet with conversation context

Fetch root tweet with conversation context

CONV_ID=$(echo "$TWEET_DATA" | jq -r '.data.conversation_id') AUTHOR_ID=$(echo "$TWEET_DATA" | jq -r '.data.author_id')
CONV_ID=$(echo "$TWEET_DATA" | jq -r '.data.conversation_id') AUTHOR_ID=$(echo "$TWEET_DATA" | jq -r '.data.author_id')

Fetch thread (search/recent — last 7 days only)

Fetch thread (search/recent — last 7 days only)

Filter to author only, reverse to chronological

Filter to author only, reverse to chronological

THREAD_TWEETS=$(echo "$THREAD" | jq -r --arg aid "$AUTHOR_ID"
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')

If the thread is older than 7 days (search returns empty): ask the user to paste the thread text.
"This thread is older than 7 days — X's search API can't reach it. Paste the thread text and I'll extract from that."

Reconstruct thread as sequential blockquotes before extracting.

---

**Web article** (any HTTP/HTTPS URL, not YouTube or X):
Use WebFetch: "Return the complete text content of this page. Preserve all details, quotes, examples, and structure. Do not summarize."

If WebFetch returns garbage (login wall, JS-only rendering): ask the user to paste the article text.

**Paywalled content:** Note it clearly at the top:
> Warning: Paywalled — only the free preview was accessible. Extraction is based on partial content.

Then extract whatever is accessible. Do not fabricate beyond the paywall.

**Local file / PDF**: Use the Read tool directly.

---
THREAD_TWEETS=$(echo "$THREAD" | jq -r --arg aid "$AUTHOR_ID"
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')

如果帖子串发布时间超过7天(搜索返回空结果):请用户直接粘贴帖子串文本,回复话术:"该帖子串发布时间已超过7天,X的搜索API无法获取相关内容,请粘贴帖子串文本,我会从中提取内容。"

提取前先将帖子串按时间顺序重组为连续的引用块。

---

**网页文章**(任意HTTP/HTTPS链接,非YouTube或X平台):
使用WebFetch,指令:"返回该页面的完整文本内容,保留所有细节、引用、示例和结构,不要摘要。"

如果WebFetch返回无效内容(登录墙、仅JS渲染内容):请用户直接粘贴文章文本。

**付费墙内容:** 在输出顶部明确标注:
> 警告:内容存在付费墙限制,仅可访问免费预览部分,提取结果基于部分内容生成。

之后提取所有可访问的内容,不要编造付费墙后的内容。

**本地文件 / PDF**: 直接使用Read工具。

---

Step 2: Assess content quality

步骤2:评估内容质量

Before extracting, scan the raw content:
  • Shallow (listicle, hype, no real methodology) — say so in 1-2 lines. Don't manufacture depth.
  • News/announcement-only — extract only if real methodology is buried in it.
  • Substantial — proceed.
提取前先扫描原始内容:
  • 内容浅薄(清单文、营销炒作、无实质方法论)—— 用1-2句话说明情况,不要虚构深度内容。
  • 仅新闻/公告内容—— 只有当内容中隐含实质方法论时才进行提取。
  • 内容充实—— 继续后续流程。

Step 3: Extract knowledge

步骤3:提取知识

Target depth by insight density, not raw length.
First, assess density:
  • High density (every minute has new ideas): technical talks, dense essays, practitioner deep-dives
  • Medium density (mixed signal/filler): most interviews, conference talks, long-form articles
  • Low density (mostly filler, few real insights): casual podcasts, rambling discussions
Content lengthHigh densityMedium densityLow density
Short (<30m / short article)1,000–1,500800–1,200500–800
Medium (30–60m / long article)2,000–3,0001,500–2,000800–1,200
Long (1–2hr)3,000–5,0002,000–3,0001,000–1,500
Very long (2hr+)5,000–7,0003,000–4,0001,500–2,500
These are floors. Dense content warrants more. Below the floor = you're under-extracting. Low density + short content may not be worth extracting at all — say so.

提取深度由内容的洞见密度决定,而非原始长度。
首先评估密度:
  • 高密度(每分钟都有新观点):技术演讲、高密度随笔、从业者深度分享
  • 中密度(有效信息和冗余内容混合):多数采访、会议演讲、长篇文章
  • 低密度(大部分为冗余内容,有效洞见极少):休闲播客、无逻辑的漫谈内容
内容长度高密度中密度低密度
短篇(<30分钟 / 短篇文章)1000–1500字800–1200字500–800字
中篇(30–60分钟 / 长篇文章)2000–3000字1500–2000字800–1200字
长篇(1–2小时)3000–5000字2000–3000字1000–1500字
超长篇(2小时以上)5000–7000字3000–4000字1500–2500字
以上为最低字数要求,高密度内容可以提取更多内容,低于下限则属于提取不足。 低密度+短篇内容可能完全不值得提取,请直接说明情况。

Extraction Framework

提取框架

Use whichever categories are present. Skip empty ones.
使用内容中存在的分类,跳过无内容的分类。

Mental Models & Frameworks

心智模型与框架

Ways of thinking. Decision heuristics. How experts frame situations differently.
思考方式、决策启发法、专家处理特定场景的特殊思考逻辑。

Systematic Methods & Processes

系统方法与流程

Step-by-step techniques. Playbooks. Include sequence and reasoning behind each step.
分步技术、操作手册,包含每个步骤的顺序和背后的逻辑。

Specific Techniques & Tactics

具体技术与策略

Named techniques, scripts, templates, prompt structures. Concrete and immediately applicable.
有命名的技术、脚本、模板、提示词结构,可直接落地使用的具体内容。

Key Numbers & Benchmarks

关键数值与基准

Every specific statistic, threshold, ratio, percentage, timeframe, quantity. Never omit or round.
所有具体的统计数据、阈值、比例、百分比、时间范围、数量,不要省略或四舍五入。

Use Cases & Applications

用例与应用场景

Concrete examples — situation, action, result. Include vivid anecdotes even if specific to one person.
具体示例——场景、行动、结果,即便是仅适用于单个人的生动案例也需要保留。

Principles & Heuristics

原则与启发法

Underlying truths. Rules of thumb. "Always X, never Y" guidance.
底层真相、经验法则,如「始终做X,永远不做Y」类的指导建议。

Contrarian & Non-Obvious Insights

反常识与非显性洞见

Challenges conventional wisdom. Only include if genuinely non-obvious.
Include practitioner honesty: when the speaker admits their practice contradicts their advice, or acknowledges failure.
挑战传统认知的内容,仅保留真正非显性的内容。
保留从业者的真实表述: 当发言者承认自己的实践和给出的建议相悖,或是承认失败的内容需要保留。

Predictions & Future Signals

预测与未来信号

Forward-looking bets, timeline estimates, emerging trends. Preserve the reasoning chain.
前瞻性判断、时间线预估、新兴趋势,保留完整的推理链条。

Tools & Resources

工具与资源

Named tools, books, people, communities with organic use-case context. Strip affiliate/sponsored mentions.

有实际使用场景支撑的工具、书籍、人物、社区,去掉联盟营销/赞助相关的提及内容。

Filtering Rules

过滤规则

Always strip: sponsored segments, ad reads, CTAs, self-promotion, filler intros/outros.
Strip time-sensitive noise: "[Tool] just launched" (unless the build methodology is the insight), pricing, availability dates, capability comparisons that will expire.
Always preserve: reasoning behind tool choices, prediction reasoning chains, historical context, every specific number, practitioner admissions, vivid examples, organic tool recommendations with context.

必须移除: 赞助片段、广告口播、行动号召、自我推广、冗余的开场/收尾内容。
移除时效性噪声: "[工具]刚刚发布"(除非构建方法论本身就是洞见)、定价、上架日期、很快会失效的能力对比内容。
必须保留: 工具选择的原因、预测的推理链条、历史背景、所有具体数值、从业者的真实表述、生动的示例、有场景支撑的自然工具推荐。

Output Format

输出格式

Source: [title] — [URL]
Then categories as markdown headers (###). Bullet points for discrete insights, numbered lists for processes, blockquotes for sharp direct quotes.
来源: [标题] — [URL]
之后用markdown三级标题(###)展示分类,独立洞见用无序列表,流程用有序列表,精准的直接引用用块引用。

Output Destination

输出位置

By default, print the extraction to the conversation.
If the user says "save this" or "write this", write to:
~/brain/extractions/YYYY-MM-DD-<slugified-title>.md

默认将提取结果输出到对话中。
如果用户说"save this"或"write this",写入到路径:
~/brain/extractions/YYYY-MM-DD-<slugified-title>.md

Edge Cases

边界情况处理

  • Multiple topics: Extract all. Separate with ---.
  • Interview format: Extract from ALL participants. Attribute when it matters.
  • Tutorial/how-to: Full process. Do not skip steps.
  • Non-English: Extract in English. Keep original terms in parentheses when needed.
  • 2hr+ content: Do not compress. Match depth to density tier.
  • 多主题内容: 提取所有内容,用---分隔不同主题。
  • 访谈格式: 提取所有参与者的内容,必要时标注发言者。
  • 教程/操作指南: 保留完整流程,不要跳过步骤。
  • 非英文内容: 用英文提取,必要时将原始术语放在括号中保留。
  • 2小时以上内容: 不要压缩内容,提取深度匹配对应密度等级的要求。