/extract — Knowledge Extraction Skill

/extract — 知识提取Skill

When NOT to use this skill

本Skill的不适用场景

User wants a summary, digest, or TL;DR → use summarize instead
Content is news, announcements, or product launches with no methodology → decline and explain
User wants real-time data, prices, or live information → use web_search instead
Content is purely promotional/marketing with no transferable insight → say so, don't extract

用户想要摘要、digest或TL;DR → 请改用summarize工具
内容为无方法论的新闻、公告或产品发布 → 拒绝请求并说明原因
用户想要实时数据、价格或直播信息 → 请改用web_search工具
内容为纯推广/营销内容，无可迁移洞见 → 说明情况，不进行提取

Quick Reference

快速参考

Source	Method
YouTube	yt-dlp subtitles → Groq audio fallback
Podcast / direct audio	yt-dlp download → Groq transcription
X/Twitter thread	X API v2 (X_BEARER_TOKEN required)
Web article	WebFetch tool
Local file / PDF	Read tool
Paywalled content	Extract what's accessible, note the wall

Output always starts with Source: [title] — [URL] then knowledge by category.

来源	处理方式
YouTube	yt-dlp提取字幕 → Groq音频转录兜底
播客 / 直连音频	yt-dlp下载 → Groq转录
X/Twitter帖子串	X API v2（需要X_BEARER_TOKEN）
网页文章	WebFetch工具
本地文件 / PDF	Read工具
付费墙内容	提取可访问内容，标注付费墙限制

输出始终以 来源: [标题] — [URL] 开头，之后按分类展示提取的知识。

Workflow

工作流程

Step 1: Identify source type and fetch content

步骤1：识别来源类型并获取内容

YouTube video (youtube.com or youtu.be):

bash

TMPDIR=$(mktemp -d)

YouTube视频（youtube.com 或 youtu.be）：

bash

TMPDIR=$(mktemp -d)

Step 1: try subtitle extraction — use SRT conversion to handle auto-caption dedup

yt-dlp --write-auto-sub --sub-lang "en" --skip-download --convert-subs srt -o "${TMPDIR}/sub" "<url>" 2>/dev/null TRANSCRIPT=$(cat ${TMPDIR}/sub*.srt 2>/dev/null | grep -v "^[0-9]" | grep -v "^-->" | grep -v "^$" | sed 's/<[^>]*>//g' | tr '\n' ' ')

Step 2: if no subtitles, transcribe via Groq

if [ -z "$TRANSCRIPT" ]; then if [ -z "$GROQ_API_KEY" ]; then echo "ERROR: No subtitles found and GROQ_API_KEY is not set. Cannot transcribe." rm -rf "$TMPDIR" exit 1 fi yt-dlp -x --audio-format mp3 --audio-quality 9 -o "${TMPDIR}/audio.%(ext)s" "<url>" 2>/dev/null

Check file size — Groq limit: 25MB free tier, 100MB dev tier

FILESIZE=$(stat -f%z "${TMPDIR}/audio.mp3" 2>/dev/null || stat -c%s "${TMPDIR}/audio.mp3" 2>/dev/null) if [ "$FILESIZE" -gt 100000000 ]; then echo "ERROR: Audio too large for Groq ($(($FILESIZE/1048576))MB, limit 100MB). Try a shorter video." rm -rf "$TMPDIR" exit 1 fi

TRANSCRIPT=$(curl -s https://api.groq.com/openai/v1/audio/transcriptions
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"


If both paths fail: tell the user exactly what failed and stop.

---

**Podcast / direct audio** (MP3/M4A URL, SoundCloud, podcast episode):

yt-dlp handles most audio URLs natively. Use the same Groq transcription path as the YouTube fallback (requires GROQ_API_KEY). For RSS feeds: extract the episode `<enclosure>` URL first, then treat as direct audio.

---

**X/Twitter thread** (x.com or twitter.com):

X blocks all unauthenticated access. Requires X_BEARER_TOKEN in environment.
X API uses pay-per-use credits — each call costs credits from your balance.

If X_BEARER_TOKEN is not set: ask the user to paste the thread text directly.
"X requires a paid API for access. Paste the thread text and I'll extract from that."

```bash
if [ -z "$X_BEARER_TOKEN" ]; then
  echo "X_BEARER_TOKEN is not set. Ask user to paste thread text."
  exit 1
fi

TWEET_ID=$(echo "<url>" | grep -oE '[0-9]{15,}' | tail -1)

if [ -z "$TRANSCRIPT" ]; then if [ -z "$GROQ_API_KEY" ]; then echo "ERROR: No subtitles found and GROQ_API_KEY is not set. Cannot transcribe." rm -rf "$TMPDIR" exit 1 fi yt-dlp -x --audio-format mp3 --audio-quality 9 -o "${TMPDIR}/audio.%(ext)s" "<url>" 2>/dev/null

Check file size — Groq limit: 25MB free tier, 100MB dev tier

FILESIZE=$(stat -f%z "${TMPDIR}/audio.mp3" 2>/dev/null || stat -c%s "${TMPDIR}/audio.mp3" 2>/dev/null) if [ "$FILESIZE" -gt 100000000 ]; then echo "ERROR: Audio too large for Groq ($(($FILESIZE/1048576))MB, limit 100MB). Try a shorter video." rm -rf "$TMPDIR" exit 1 fi

TRANSCRIPT=$(curl -s https://api.groq.com/openai/v1/audio/transcriptions
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"


如果两种路径都失败：明确告知用户失败原因，终止流程。

---

**播客 / 直连音频**（MP3/M4A链接、SoundCloud、播客单集）：

yt-dlp原生支持绝大多数音频链接，使用和YouTube兜底方案相同的Groq转录流程（需要GROQ_API_KEY）。如果是RSS feed：先提取单集的`<enclosure>`链接，再按直连音频处理。

---

**X/Twitter帖子串**（x.com 或 twitter.com）：

X屏蔽所有未认证访问，需要环境变量中配置X_BEARER_TOKEN。
X API采用按量付费模式——每次调用都会扣除账户余额中的额度。

如果未配置X_BEARER_TOKEN：请用户直接粘贴帖子串文本，回复话术："X需要付费API才能访问，请粘贴帖子串文本，我会从中提取内容。"

```bash
if [ -z "$X_BEARER_TOKEN" ]; then
  echo "X_BEARER_TOKEN is not set. Ask user to paste thread text."
  exit 1
fi

TWEET_ID=$(echo "<url>" | grep -oE '[0-9]{15,}' | tail -1)

Fetch root tweet with conversation context

TWEET_DATA=$(curl -s "https://api.twitter.com/2/tweets/${TWEET_ID}?tweet.fields=conversation_id,author_id,text,created_at&expansions=author_id&user.fields=name,username"
-H "Authorization: Bearer $X_BEARER_TOKEN")

CONV_ID=$(echo "$TWEET_DATA" | jq -r '.data.conversation_id') AUTHOR_ID=$(echo "$TWEET_DATA" | jq -r '.data.author_id')

TWEET_DATA=$(curl -s "https://api.twitter.com/2/tweets/${TWEET_ID}?tweet.fields=conversation_id,author_id,text,created_at&expansions=author_id&user.fields=name,username"
-H "Authorization: Bearer $X_BEARER_TOKEN")

CONV_ID=$(echo "$TWEET_DATA" | jq -r '.data.conversation_id') AUTHOR_ID=$(echo "$TWEET_DATA" | jq -r '.data.author_id')

Fetch thread (search/recent — last 7 days only)

THREAD=$(curl -s "https://api.twitter.com/2/tweets/search/recent?query=conversation_id:${CONV_ID}&tweet.fields=text,created_at,author_id&max_results=100&sort_order=recency"
-H "Authorization: Bearer $X_BEARER_TOKEN")

Filter to author only, reverse to chronological

THREAD_TWEETS=$(echo "$THREAD" | jq -r --arg aid "$AUTHOR_ID"
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')


If the thread is older than 7 days (search returns empty): ask the user to paste the thread text.
"This thread is older than 7 days — X's search API can't reach it. Paste the thread text and I'll extract from that."

Reconstruct thread as sequential blockquotes before extracting.

---

**Web article** (any HTTP/HTTPS URL, not YouTube or X):
Use WebFetch: "Return the complete text content of this page. Preserve all details, quotes, examples, and structure. Do not summarize."

If WebFetch returns garbage (login wall, JS-only rendering): ask the user to paste the article text.

**Paywalled content:** Note it clearly at the top:
> Warning: Paywalled — only the free preview was accessible. Extraction is based on partial content.

Then extract whatever is accessible. Do not fabricate beyond the paywall.

**Local file / PDF**: Use the Read tool directly.

---

THREAD_TWEETS=$(echo "$THREAD" | jq -r --arg aid "$AUTHOR_ID"
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')


如果帖子串发布时间超过7天（搜索返回空结果）：请用户直接粘贴帖子串文本，回复话术："该帖子串发布时间已超过7天，X的搜索API无法获取相关内容，请粘贴帖子串文本，我会从中提取内容。"

提取前先将帖子串按时间顺序重组为连续的引用块。

---

**网页文章**（任意HTTP/HTTPS链接，非YouTube或X平台）：
使用WebFetch，指令："返回该页面的完整文本内容，保留所有细节、引用、示例和结构，不要摘要。"

如果WebFetch返回无效内容（登录墙、仅JS渲染内容）：请用户直接粘贴文章文本。

**付费墙内容：** 在输出顶部明确标注：
> 警告：内容存在付费墙限制，仅可访问免费预览部分，提取结果基于部分内容生成。

之后提取所有可访问的内容，不要编造付费墙后的内容。

**本地文件 / PDF**: 直接使用Read工具。

---

Step 2: Assess content quality

步骤2：评估内容质量

Before extracting, scan the raw content:

Shallow (listicle, hype, no real methodology) — say so in 1-2 lines. Don't manufacture depth.
News/announcement-only — extract only if real methodology is buried in it.
Substantial — proceed.

提取前先扫描原始内容：

内容浅薄（清单文、营销炒作、无实质方法论）—— 用1-2句话说明情况，不要虚构深度内容。
仅新闻/公告内容—— 只有当内容中隐含实质方法论时才进行提取。
内容充实—— 继续后续流程。

Step 3: Extract knowledge

步骤3：提取知识

Target depth by insight density, not raw length.

First, assess density:

High density (every minute has new ideas): technical talks, dense essays, practitioner deep-dives
Medium density (mixed signal/filler): most interviews, conference talks, long-form articles
Low density (mostly filler, few real insights): casual podcasts, rambling discussions

Content length	High density	Medium density	Low density
Short (<30m / short article)	1,000–1,500	800–1,200	500–800
Medium (30–60m / long article)	2,000–3,000	1,500–2,000	800–1,200
Long (1–2hr)	3,000–5,000	2,000–3,000	1,000–1,500
Very long (2hr+)	5,000–7,000	3,000–4,000	1,500–2,500

These are floors. Dense content warrants more. Below the floor = you're under-extracting. Low density + short content may not be worth extracting at all — say so.

提取深度由内容的洞见密度决定，而非原始长度。

首先评估密度：

高密度（每分钟都有新观点）：技术演讲、高密度随笔、从业者深度分享
中密度（有效信息和冗余内容混合）：多数采访、会议演讲、长篇文章
低密度（大部分为冗余内容，有效洞见极少）：休闲播客、无逻辑的漫谈内容

内容长度	高密度	中密度	低密度
短篇（<30分钟 / 短篇文章）	1000–1500字	800–1200字	500–800字
中篇（30–60分钟 / 长篇文章）	2000–3000字	1500–2000字	800–1200字
长篇（1–2小时）	3000–5000字	2000–3000字	1000–1500字
超长篇（2小时以上）	5000–7000字	3000–4000字	1500–2500字

以上为最低字数要求，高密度内容可以提取更多内容，低于下限则属于提取不足。低密度+短篇内容可能完全不值得提取，请直接说明情况。

Extraction Framework

提取框架

Use whichever categories are present. Skip empty ones.

使用内容中存在的分类，跳过无内容的分类。

Mental Models & Frameworks

心智模型与框架

Ways of thinking. Decision heuristics. How experts frame situations differently.

思考方式、决策启发法、专家处理特定场景的特殊思考逻辑。

Systematic Methods & Processes

系统方法与流程

Step-by-step techniques. Playbooks. Include sequence and reasoning behind each step.

分步技术、操作手册，包含每个步骤的顺序和背后的逻辑。

Specific Techniques & Tactics

具体技术与策略

Named techniques, scripts, templates, prompt structures. Concrete and immediately applicable.

有命名的技术、脚本、模板、提示词结构，可直接落地使用的具体内容。

Key Numbers & Benchmarks

关键数值与基准

Every specific statistic, threshold, ratio, percentage, timeframe, quantity. Never omit or round.

所有具体的统计数据、阈值、比例、百分比、时间范围、数量，不要省略或四舍五入。

Use Cases & Applications

用例与应用场景

Concrete examples — situation, action, result. Include vivid anecdotes even if specific to one person.

具体示例——场景、行动、结果，即便是仅适用于单个人的生动案例也需要保留。

Principles & Heuristics

原则与启发法

Underlying truths. Rules of thumb. "Always X, never Y" guidance.

底层真相、经验法则，如「始终做X，永远不做Y」类的指导建议。

Contrarian & Non-Obvious Insights

反常识与非显性洞见

Challenges conventional wisdom. Only include if genuinely non-obvious.

Include practitioner honesty: when the speaker admits their practice contradicts their advice, or acknowledges failure.

挑战传统认知的内容，仅保留真正非显性的内容。

保留从业者的真实表述： 当发言者承认自己的实践和给出的建议相悖，或是承认失败的内容需要保留。

Predictions & Future Signals

预测与未来信号

Forward-looking bets, timeline estimates, emerging trends. Preserve the reasoning chain.

前瞻性判断、时间线预估、新兴趋势，保留完整的推理链条。

Tools & Resources

工具与资源

Named tools, books, people, communities with organic use-case context. Strip affiliate/sponsored mentions.

有实际使用场景支撑的工具、书籍、人物、社区，去掉联盟营销/赞助相关的提及内容。

Filtering Rules

过滤规则

Always strip: sponsored segments, ad reads, CTAs, self-promotion, filler intros/outros.

Strip time-sensitive noise: "[Tool] just launched" (unless the build methodology is the insight), pricing, availability dates, capability comparisons that will expire.

Always preserve: reasoning behind tool choices, prediction reasoning chains, historical context, every specific number, practitioner admissions, vivid examples, organic tool recommendations with context.

必须移除： 赞助片段、广告口播、行动号召、自我推广、冗余的开场/收尾内容。

移除时效性噪声： "[工具]刚刚发布"（除非构建方法论本身就是洞见）、定价、上架日期、很快会失效的能力对比内容。

必须保留： 工具选择的原因、预测的推理链条、历史背景、所有具体数值、从业者的真实表述、生动的示例、有场景支撑的自然工具推荐。

Output Format

输出格式

Source: [title] — [URL]

Then categories as markdown headers (###). Bullet points for discrete insights, numbered lists for processes, blockquotes for sharp direct quotes.

来源： [标题] — [URL]

之后用markdown三级标题（###）展示分类，独立洞见用无序列表，流程用有序列表，精准的直接引用用块引用。

Output Destination

输出位置

By default, print the extraction to the conversation.

If the user says "save this" or "write this", write to:

~/brain/extractions/YYYY-MM-DD-<slugified-title>.md

默认将提取结果输出到对话中。

如果用户说"save this"或"write this"，写入到路径：

~/brain/extractions/YYYY-MM-DD-<slugified-title>.md

Edge Cases

边界情况处理

Multiple topics: Extract all. Separate with ---.
Interview format: Extract from ALL participants. Attribute when it matters.
Tutorial/how-to: Full process. Do not skip steps.
Non-English: Extract in English. Keep original terms in parentheses when needed.
2hr+ content: Do not compress. Match depth to density tier.

多主题内容： 提取所有内容，用---分隔不同主题。
访谈格式： 提取所有参与者的内容，必要时标注发言者。
教程/操作指南： 保留完整流程，不要跳过步骤。
非英文内容： 用英文提取，必要时将原始术语放在括号中保留。
2小时以上内容： 不要压缩内容，提取深度匹配对应密度等级的要求。

extract

Original

Translation

/extract — Knowledge Extraction Skill

/extract — 知识提取Skill

When NOT to use this skill

本Skill的不适用场景

Quick Reference

快速参考

Workflow

工作流程

Step 1: Identify source type and fetch content

步骤1：识别来源类型并获取内容

Step 1: try subtitle extraction — use SRT conversion to handle auto-caption dedup

Step 1: try subtitle extraction — use SRT conversion to handle auto-caption dedup

Step 2: if no subtitles, transcribe via Groq

Step 2: if no subtitles, transcribe via Groq

Check file size — Groq limit: 25MB free tier, 100MB dev tier

Check file size — Groq limit: 25MB free tier, 100MB dev tier

Fetch root tweet with conversation context

Fetch root tweet with conversation context

Fetch thread (search/recent — last 7 days only)

Fetch thread (search/recent — last 7 days only)

Filter to author only, reverse to chronological

Filter to author only, reverse to chronological

Step 2: Assess content quality

步骤2：评估内容质量

Step 3: Extract knowledge

步骤3：提取知识

Extraction Framework

提取框架

Mental Models & Frameworks

心智模型与框架

Systematic Methods & Processes

系统方法与流程

Specific Techniques & Tactics

具体技术与策略

Key Numbers & Benchmarks

关键数值与基准

Use Cases & Applications

用例与应用场景

Principles & Heuristics

原则与启发法

Contrarian & Non-Obvious Insights

反常识与非显性洞见

Predictions & Future Signals

预测与未来信号

Tools & Resources

工具与资源

Filtering Rules

过滤规则

Output Format

输出格式

Output Destination

输出位置

Edge Cases

边界情况处理