extract
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinese/extract — Knowledge Extraction Skill
/extract — 知识提取Skill
When NOT to use this skill
本Skill的不适用场景
- User wants a summary, digest, or TL;DR → use summarize instead
- Content is news, announcements, or product launches with no methodology → decline and explain
- User wants real-time data, prices, or live information → use web_search instead
- Content is purely promotional/marketing with no transferable insight → say so, don't extract
- 用户想要摘要、digest或TL;DR → 请改用summarize工具
- 内容为无方法论的新闻、公告或产品发布 → 拒绝请求并说明原因
- 用户想要实时数据、价格或直播信息 → 请改用web_search工具
- 内容为纯推广/营销内容,无可迁移洞见 → 说明情况,不进行提取
Quick Reference
快速参考
| Source | Method |
|---|---|
| YouTube | yt-dlp subtitles → Groq audio fallback |
| Podcast / direct audio | yt-dlp download → Groq transcription |
| X/Twitter thread | X API v2 (X_BEARER_TOKEN required) |
| Web article | WebFetch tool |
| Local file / PDF | Read tool |
| Paywalled content | Extract what's accessible, note the wall |
Output always starts with Source: [title] — [URL] then knowledge by category.
| 来源 | 处理方式 |
|---|---|
| YouTube | yt-dlp提取字幕 → Groq音频转录兜底 |
| 播客 / 直连音频 | yt-dlp下载 → Groq转录 |
| X/Twitter帖子串 | X API v2(需要X_BEARER_TOKEN) |
| 网页文章 | WebFetch工具 |
| 本地文件 / PDF | Read工具 |
| 付费墙内容 | 提取可访问内容,标注付费墙限制 |
输出始终以 来源: [标题] — [URL] 开头,之后按分类展示提取的知识。
Workflow
工作流程
Step 1: Identify source type and fetch content
步骤1:识别来源类型并获取内容
YouTube video (youtube.com or youtu.be):
bash
TMPDIR=$(mktemp -d)YouTube视频(youtube.com 或 youtu.be):
bash
TMPDIR=$(mktemp -d)Step 1: try subtitle extraction — use SRT conversion to handle auto-caption dedup
Step 1: try subtitle extraction — use SRT conversion to handle auto-caption dedup
yt-dlp --write-auto-sub --sub-lang "en" --skip-download --convert-subs srt -o "${TMPDIR}/sub" "<url>" 2>/dev/null
TRANSCRIPT=$(cat ${TMPDIR}/sub*.srt 2>/dev/null | grep -v "^[0-9]" | grep -v "^-->" | grep -v "^$" | sed 's/<[^>]*>//g' | tr '\n' ' ')
yt-dlp --write-auto-sub --sub-lang "en" --skip-download --convert-subs srt -o "${TMPDIR}/sub" "<url>" 2>/dev/null
TRANSCRIPT=$(cat ${TMPDIR}/sub*.srt 2>/dev/null | grep -v "^[0-9]" | grep -v "^-->" | grep -v "^$" | sed 's/<[^>]*>//g' | tr '\n' ' ')
Step 2: if no subtitles, transcribe via Groq
Step 2: if no subtitles, transcribe via Groq
if [ -z "$TRANSCRIPT" ]; then
if [ -z "$GROQ_API_KEY" ]; then
echo "ERROR: No subtitles found and GROQ_API_KEY is not set. Cannot transcribe."
rm -rf "$TMPDIR"
exit 1
fi
yt-dlp -x --audio-format mp3 --audio-quality 9 -o "${TMPDIR}/audio.%(ext)s" "<url>" 2>/dev/null
Check file size — Groq limit: 25MB free tier, 100MB dev tier
FILESIZE=$(stat -f%z "${TMPDIR}/audio.mp3" 2>/dev/null || stat -c%s "${TMPDIR}/audio.mp3" 2>/dev/null)
if [ "$FILESIZE" -gt 100000000 ]; then
echo "ERROR: Audio too large for Groq ($(($FILESIZE/1048576))MB, limit 100MB). Try a shorter video."
rm -rf "$TMPDIR"
exit 1
fi
TRANSCRIPT=$(curl -s https://api.groq.com/openai/v1/audio/transcriptions
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"
If both paths fail: tell the user exactly what failed and stop.
---
**Podcast / direct audio** (MP3/M4A URL, SoundCloud, podcast episode):
yt-dlp handles most audio URLs natively. Use the same Groq transcription path as the YouTube fallback (requires GROQ_API_KEY). For RSS feeds: extract the episode `<enclosure>` URL first, then treat as direct audio.
---
**X/Twitter thread** (x.com or twitter.com):
X blocks all unauthenticated access. Requires X_BEARER_TOKEN in environment.
X API uses pay-per-use credits — each call costs credits from your balance.
If X_BEARER_TOKEN is not set: ask the user to paste the thread text directly.
"X requires a paid API for access. Paste the thread text and I'll extract from that."
```bash
if [ -z "$X_BEARER_TOKEN" ]; then
echo "X_BEARER_TOKEN is not set. Ask user to paste thread text."
exit 1
fi
TWEET_ID=$(echo "<url>" | grep -oE '[0-9]{15,}' | tail -1)if [ -z "$TRANSCRIPT" ]; then
if [ -z "$GROQ_API_KEY" ]; then
echo "ERROR: No subtitles found and GROQ_API_KEY is not set. Cannot transcribe."
rm -rf "$TMPDIR"
exit 1
fi
yt-dlp -x --audio-format mp3 --audio-quality 9 -o "${TMPDIR}/audio.%(ext)s" "<url>" 2>/dev/null
Check file size — Groq limit: 25MB free tier, 100MB dev tier
FILESIZE=$(stat -f%z "${TMPDIR}/audio.mp3" 2>/dev/null || stat -c%s "${TMPDIR}/audio.mp3" 2>/dev/null)
if [ "$FILESIZE" -gt 100000000 ]; then
echo "ERROR: Audio too large for Groq ($(($FILESIZE/1048576))MB, limit 100MB). Try a shorter video."
rm -rf "$TMPDIR"
exit 1
fi
TRANSCRIPT=$(curl -s https://api.groq.com/openai/v1/audio/transcriptions
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"
-H "Authorization: Bearer $GROQ_API_KEY"
-F "file=@${TMPDIR}/audio.mp3"
-F "model=whisper-large-v3-turbo" | jq -r '.text') fi rm -rf "$TMPDIR"
如果两种路径都失败:明确告知用户失败原因,终止流程。
---
**播客 / 直连音频**(MP3/M4A链接、SoundCloud、播客单集):
yt-dlp原生支持绝大多数音频链接,使用和YouTube兜底方案相同的Groq转录流程(需要GROQ_API_KEY)。如果是RSS feed:先提取单集的`<enclosure>`链接,再按直连音频处理。
---
**X/Twitter帖子串**(x.com 或 twitter.com):
X屏蔽所有未认证访问,需要环境变量中配置X_BEARER_TOKEN。
X API采用按量付费模式——每次调用都会扣除账户余额中的额度。
如果未配置X_BEARER_TOKEN:请用户直接粘贴帖子串文本,回复话术:"X需要付费API才能访问,请粘贴帖子串文本,我会从中提取内容。"
```bash
if [ -z "$X_BEARER_TOKEN" ]; then
echo "X_BEARER_TOKEN is not set. Ask user to paste thread text."
exit 1
fi
TWEET_ID=$(echo "<url>" | grep -oE '[0-9]{15,}' | tail -1)Fetch root tweet with conversation context
Fetch root tweet with conversation context
TWEET_DATA=$(curl -s "https://api.twitter.com/2/tweets/${TWEET_ID}?tweet.fields=conversation_id,author_id,text,created_at&expansions=author_id&user.fields=name,username"
-H "Authorization: Bearer $X_BEARER_TOKEN")
-H "Authorization: Bearer $X_BEARER_TOKEN")
CONV_ID=$(echo "$TWEET_DATA" | jq -r '.data.conversation_id')
AUTHOR_ID=$(echo "$TWEET_DATA" | jq -r '.data.author_id')
TWEET_DATA=$(curl -s "https://api.twitter.com/2/tweets/${TWEET_ID}?tweet.fields=conversation_id,author_id,text,created_at&expansions=author_id&user.fields=name,username"
-H "Authorization: Bearer $X_BEARER_TOKEN")
-H "Authorization: Bearer $X_BEARER_TOKEN")
CONV_ID=$(echo "$TWEET_DATA" | jq -r '.data.conversation_id')
AUTHOR_ID=$(echo "$TWEET_DATA" | jq -r '.data.author_id')
Fetch thread (search/recent — last 7 days only)
Fetch thread (search/recent — last 7 days only)
THREAD=$(curl -s "https://api.twitter.com/2/tweets/search/recent?query=conversation_id:${CONV_ID}&tweet.fields=text,created_at,author_id&max_results=100&sort_order=recency"
-H "Authorization: Bearer $X_BEARER_TOKEN")
-H "Authorization: Bearer $X_BEARER_TOKEN")
THREAD=$(curl -s "https://api.twitter.com/2/tweets/search/recent?query=conversation_id:${CONV_ID}&tweet.fields=text,created_at,author_id&max_results=100&sort_order=recency"
-H "Authorization: Bearer $X_BEARER_TOKEN")
-H "Authorization: Bearer $X_BEARER_TOKEN")
Filter to author only, reverse to chronological
Filter to author only, reverse to chronological
THREAD_TWEETS=$(echo "$THREAD" | jq -r --arg aid "$AUTHOR_ID"
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')
If the thread is older than 7 days (search returns empty): ask the user to paste the thread text.
"This thread is older than 7 days — X's search API can't reach it. Paste the thread text and I'll extract from that."
Reconstruct thread as sequential blockquotes before extracting.
---
**Web article** (any HTTP/HTTPS URL, not YouTube or X):
Use WebFetch: "Return the complete text content of this page. Preserve all details, quotes, examples, and structure. Do not summarize."
If WebFetch returns garbage (login wall, JS-only rendering): ask the user to paste the article text.
**Paywalled content:** Note it clearly at the top:
> Warning: Paywalled — only the free preview was accessible. Extraction is based on partial content.
Then extract whatever is accessible. Do not fabricate beyond the paywall.
**Local file / PDF**: Use the Read tool directly.
---THREAD_TWEETS=$(echo "$THREAD" | jq -r --arg aid "$AUTHOR_ID"
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')
'[.data[] | select(.author_id == $aid)] | reverse | .[].text')
如果帖子串发布时间超过7天(搜索返回空结果):请用户直接粘贴帖子串文本,回复话术:"该帖子串发布时间已超过7天,X的搜索API无法获取相关内容,请粘贴帖子串文本,我会从中提取内容。"
提取前先将帖子串按时间顺序重组为连续的引用块。
---
**网页文章**(任意HTTP/HTTPS链接,非YouTube或X平台):
使用WebFetch,指令:"返回该页面的完整文本内容,保留所有细节、引用、示例和结构,不要摘要。"
如果WebFetch返回无效内容(登录墙、仅JS渲染内容):请用户直接粘贴文章文本。
**付费墙内容:** 在输出顶部明确标注:
> 警告:内容存在付费墙限制,仅可访问免费预览部分,提取结果基于部分内容生成。
之后提取所有可访问的内容,不要编造付费墙后的内容。
**本地文件 / PDF**: 直接使用Read工具。
---Step 2: Assess content quality
步骤2:评估内容质量
Before extracting, scan the raw content:
- Shallow (listicle, hype, no real methodology) — say so in 1-2 lines. Don't manufacture depth.
- News/announcement-only — extract only if real methodology is buried in it.
- Substantial — proceed.
提取前先扫描原始内容:
- 内容浅薄(清单文、营销炒作、无实质方法论)—— 用1-2句话说明情况,不要虚构深度内容。
- 仅新闻/公告内容—— 只有当内容中隐含实质方法论时才进行提取。
- 内容充实—— 继续后续流程。
Step 3: Extract knowledge
步骤3:提取知识
Target depth by insight density, not raw length.
First, assess density:
- High density (every minute has new ideas): technical talks, dense essays, practitioner deep-dives
- Medium density (mixed signal/filler): most interviews, conference talks, long-form articles
- Low density (mostly filler, few real insights): casual podcasts, rambling discussions
| Content length | High density | Medium density | Low density |
|---|---|---|---|
| Short (<30m / short article) | 1,000–1,500 | 800–1,200 | 500–800 |
| Medium (30–60m / long article) | 2,000–3,000 | 1,500–2,000 | 800–1,200 |
| Long (1–2hr) | 3,000–5,000 | 2,000–3,000 | 1,000–1,500 |
| Very long (2hr+) | 5,000–7,000 | 3,000–4,000 | 1,500–2,500 |
These are floors. Dense content warrants more. Below the floor = you're under-extracting.
Low density + short content may not be worth extracting at all — say so.
提取深度由内容的洞见密度决定,而非原始长度。
首先评估密度:
- 高密度(每分钟都有新观点):技术演讲、高密度随笔、从业者深度分享
- 中密度(有效信息和冗余内容混合):多数采访、会议演讲、长篇文章
- 低密度(大部分为冗余内容,有效洞见极少):休闲播客、无逻辑的漫谈内容
| 内容长度 | 高密度 | 中密度 | 低密度 |
|---|---|---|---|
| 短篇(<30分钟 / 短篇文章) | 1000–1500字 | 800–1200字 | 500–800字 |
| 中篇(30–60分钟 / 长篇文章) | 2000–3000字 | 1500–2000字 | 800–1200字 |
| 长篇(1–2小时) | 3000–5000字 | 2000–3000字 | 1000–1500字 |
| 超长篇(2小时以上) | 5000–7000字 | 3000–4000字 | 1500–2500字 |
以上为最低字数要求,高密度内容可以提取更多内容,低于下限则属于提取不足。
低密度+短篇内容可能完全不值得提取,请直接说明情况。
Extraction Framework
提取框架
Use whichever categories are present. Skip empty ones.
使用内容中存在的分类,跳过无内容的分类。
Mental Models & Frameworks
心智模型与框架
Ways of thinking. Decision heuristics. How experts frame situations differently.
思考方式、决策启发法、专家处理特定场景的特殊思考逻辑。
Systematic Methods & Processes
系统方法与流程
Step-by-step techniques. Playbooks. Include sequence and reasoning behind each step.
分步技术、操作手册,包含每个步骤的顺序和背后的逻辑。
Specific Techniques & Tactics
具体技术与策略
Named techniques, scripts, templates, prompt structures. Concrete and immediately applicable.
有命名的技术、脚本、模板、提示词结构,可直接落地使用的具体内容。
Key Numbers & Benchmarks
关键数值与基准
Every specific statistic, threshold, ratio, percentage, timeframe, quantity. Never omit or round.
所有具体的统计数据、阈值、比例、百分比、时间范围、数量,不要省略或四舍五入。
Use Cases & Applications
用例与应用场景
Concrete examples — situation, action, result. Include vivid anecdotes even if specific to one person.
具体示例——场景、行动、结果,即便是仅适用于单个人的生动案例也需要保留。
Principles & Heuristics
原则与启发法
Underlying truths. Rules of thumb. "Always X, never Y" guidance.
底层真相、经验法则,如「始终做X,永远不做Y」类的指导建议。
Contrarian & Non-Obvious Insights
反常识与非显性洞见
Challenges conventional wisdom. Only include if genuinely non-obvious.
Include practitioner honesty: when the speaker admits their practice contradicts their advice, or acknowledges failure.
挑战传统认知的内容,仅保留真正非显性的内容。
保留从业者的真实表述: 当发言者承认自己的实践和给出的建议相悖,或是承认失败的内容需要保留。
Predictions & Future Signals
预测与未来信号
Forward-looking bets, timeline estimates, emerging trends. Preserve the reasoning chain.
前瞻性判断、时间线预估、新兴趋势,保留完整的推理链条。
Tools & Resources
工具与资源
Named tools, books, people, communities with organic use-case context. Strip affiliate/sponsored mentions.
有实际使用场景支撑的工具、书籍、人物、社区,去掉联盟营销/赞助相关的提及内容。
Filtering Rules
过滤规则
Always strip: sponsored segments, ad reads, CTAs, self-promotion, filler intros/outros.
Strip time-sensitive noise: "[Tool] just launched" (unless the build methodology is the insight), pricing, availability dates, capability comparisons that will expire.
Always preserve: reasoning behind tool choices, prediction reasoning chains, historical context, every specific number, practitioner admissions, vivid examples, organic tool recommendations with context.
必须移除: 赞助片段、广告口播、行动号召、自我推广、冗余的开场/收尾内容。
移除时效性噪声: "[工具]刚刚发布"(除非构建方法论本身就是洞见)、定价、上架日期、很快会失效的能力对比内容。
必须保留: 工具选择的原因、预测的推理链条、历史背景、所有具体数值、从业者的真实表述、生动的示例、有场景支撑的自然工具推荐。
Output Format
输出格式
Source: [title] — [URL]
Then categories as markdown headers (###). Bullet points for discrete insights, numbered lists for processes, blockquotes for sharp direct quotes.
来源: [标题] — [URL]
之后用markdown三级标题(###)展示分类,独立洞见用无序列表,流程用有序列表,精准的直接引用用块引用。
Output Destination
输出位置
By default, print the extraction to the conversation.
If the user says "save this" or "write this", write to:
~/brain/extractions/YYYY-MM-DD-<slugified-title>.md默认将提取结果输出到对话中。
如果用户说"save this"或"write this",写入到路径:
~/brain/extractions/YYYY-MM-DD-<slugified-title>.mdEdge Cases
边界情况处理
- Multiple topics: Extract all. Separate with ---.
- Interview format: Extract from ALL participants. Attribute when it matters.
- Tutorial/how-to: Full process. Do not skip steps.
- Non-English: Extract in English. Keep original terms in parentheses when needed.
- 2hr+ content: Do not compress. Match depth to density tier.
- 多主题内容: 提取所有内容,用---分隔不同主题。
- 访谈格式: 提取所有参与者的内容,必要时标注发言者。
- 教程/操作指南: 保留完整流程,不要跳过步骤。
- 非英文内容: 用英文提取,必要时将原始术语放在括号中保留。
- 2小时以上内容: 不要压缩内容,提取深度匹配对应密度等级的要求。