research-collector
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseResearch Collector
Research Collector
这个 skill 只做一件事:
- 为某个主题批量收集 YouTube 视频 + 网页文章,喂进 NotebookLM,跑分析查询,把结果落地到本地目录(默认 ,可配置)
./research/<topic>/
不负责:
- 写成品文章(交给你自己的写作工具 / skill)
- 选主标题
- 下载视频(交给本仓库里的 skill)
yt-dlp-direct - 发布到多平台(交给本仓库里的 skill)
publisher-wechatsync
一句话原则:用户说"帮我收集 X 话题的素材"或"拉一批 YouTube + 文章到 NotebookLM",就走这条固定流水线,不要每次重新设计。
This skill does only one thing:
- Batch collect YouTube videos + web articles on a specific topic, feed them into NotebookLM, run analysis queries, and save the results to a local directory (default , configurable)
./research/<topic>/
It does NOT handle:
- Writing the final article (leave this to your own writing tools / skills)
- Choosing main titles
- Downloading videos (use the skill in this repository)
yt-dlp-direct - Publishing to multiple platforms (use the skill in this repository)
publisher-wechatsync
One-sentence principle: When users say "help me collect materials on topic X" or "pull a batch of YouTube videos + articles into NotebookLM", follow this fixed workflow instead of redesigning it every time.
When To Use
When To Use
适用场景:
- 用户要为某个话题写推荐/测评/观点文,需要先做背景研究
- 用户说"帮我找 X 的热门 YouTube 视频和文章"
- 用户说"收集到 NotebookLM 里分析"
- 用户说"给我整理一份 X 话题的素材研究"
不适用场景:
- 用户已经有明确素材清单,只想要总结 → 直接跑
nlm notebook query - 用户要做的是实时对话研究,不需要持久化到 notebook → 用 WebSearch + WebFetch
- 用户只要下载单个视频 → 用
yt-dlp-direct
Applicable scenarios:
- Users need to conduct background research before writing a recommendation/review/opinion article on a topic
- Users say "help me find popular YouTube videos and articles about X"
- Users say "collect them into NotebookLM for analysis"
- Users say "organize a material research report on topic X for me"
Inapplicable scenarios:
- Users already have a clear list of materials and only want summaries → directly run
nlm notebook query - Users want to conduct real-time conversational research without persisting to a notebook → use WebSearch + WebFetch
- Users only need to download a single video → use
yt-dlp-direct
Preconditions
Preconditions
开始前必须确认:
- CLI 已安装且登录:
nlmnlm login --check - 在 PATH 中:
yt-dlpwhich yt-dlp - 用户明确说明了主题和角度
- 输出目录可写(默认 ,可以通过
./research/<topic>/环境变量或对话里直接指定其他路径)RESEARCH_OUTPUT_DIR
前置不满足时:
- 失败 → 让用户跑
nlm login --check,session 有效期 ~20 分钟nlm login - 没装 → 停止并告诉用户
yt-dlp
Must confirm the following before starting:
- CLI is installed and logged in:
nlmnlm login --check - is in PATH:
yt-dlpwhich yt-dlp - Users have clearly specified the topic and angle
- The output directory is writable (default , can be configured via the
./research/<topic>/environment variable or directly specified in the conversation)RESEARCH_OUTPUT_DIR
If preconditions are not met:
- If fails → ask the user to run
nlm login --check; session validity is ~20 minutesnlm login - If is not installed → stop and inform the user
yt-dlp
Working Rules
Working Rules
- 先和用户对齐主题、角度、量级,再动手
- 每轮 ytsearch 默认 15 条,可以根据需要调整
- NotebookLM deep research 一次只能跑一个任务,不能并发
- 添加 source 时每条之间 sleep 2 秒,避免限流
- 所有产出(原始 JSON + 汇总 markdown)落到 下(或用户指定的目录)
./research/<topic>/ - 这个 skill 只负责收集和分析,不要擅自接着写成品文章
- 不要删 notebook,用户后面可能还要回去跑 query
- Align the topic, angle, and volume with the user before taking action
- Default to 15 results per ytsearch, adjust as needed
- NotebookLM deep research can only run one task at a time, no concurrency allowed
- Sleep for 2 seconds between adding each source to avoid rate limiting
- All outputs (raw JSON + summary markdown) are saved to (or the user-specified directory)
./research/<topic>/ - This skill only handles collection and analysis; do not automatically proceed to write the final article
- Do not delete the notebook, as users may need to run queries later
Core Workflow
Core Workflow
Phase 0: 对齐目标
Phase 0: Align Objectives
在动手前必须和用户明确:
- 主题是什么(要一句话能喂给 ytsearch 的关键词)
- 角度(比如"最常用 + 个人创作" vs "最新发布 + 技术细节")
- 笔记本命名(默认 )
<主题> 素材 - 量级(默认:15 油管 + deep research 自动 ~40 网页)
Before starting, you must confirm with the user:
- What is the topic (a keyword phrase that can be directly used for ytsearch)
- Angle (e.g., "most commonly used + personal creation" vs "latest release + technical details")
- Notebook name (default: "<Topic> Materials")
- Volume (default: 15 YouTube videos + ~40 web articles from NotebookLM deep research)
Phase 1: 创建笔记本 + 设 alias
Phase 1: Create Notebook + Set Alias
bash
nlm notebook create "<话题> 素材"bash
nlm notebook create "<Topic> Materials"从输出提取 ID,然后:
Extract ID from output, then:
nlm alias set <short-name> <notebook-id>
alias 取短名,比如 `skills-research`、`vps-2026`,后续所有命令都用 alias。nlm alias set <short-name> <notebook-id>
Use a short alias, such as `skills-research` or `vps-2026`, and use the alias for all subsequent commands.Phase 2: yt-dlp ytsearch 找热门 YouTube
Phase 2: Search for Popular YouTube Videos with yt-dlp ytsearch
并行跑 2-3 个不同角度的搜索,每个 15 条:
bash
yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
"ytsearch15:<关键词 A>"
yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
"ytsearch15:<关键词 B>"输出里的 JS runtime warning 可以忽略。
从结果里按以下规则筛 top 15:
- 去重(同一视频出现在多个搜索里)
- 优先官方账号(比如 Anthropic、OpenAI 等)
- 按 view count 从高到低,但要留 2-3 个垂直向角度的中腰部视频,避免全是爆款通稿
- 每个角度至少保留 5 条
Run 2-3 searches with different angles in parallel, 15 results each:
bash
yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
"ytsearch15:<Keyword A>"
yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
"ytsearch15:<Keyword B>"Ignore JS runtime warnings in the output.
Filter the top 15 results using the following rules:
- Remove duplicates (same video appearing in multiple searches)
- Prioritize official accounts (e.g., Anthropic, OpenAI, etc.)
- Sort by view count from highest to lowest, but reserve 2-3 mid-tier videos with vertical angles to avoid all being blockbuster press releases
- Keep at least 5 results for each angle
Phase 3: 把 YouTube 加为 source
Phase 3: Add YouTube Videos as Sources
用 bash 循环逐条加,每次 sleep 2 秒:
bash
cat > /tmp/yt_urls.txt <<'EOF'
https://www.youtube.com/watch?v=XXX1
https://www.youtube.com/watch?v=XXX2
...
EOF
while IFS= read -r url; do
echo "=== Adding: $url ==="
nlm source add <alias> --url "$url" 2>&1 | tail -5
sleep 2
done < /tmp/yt_urls.txt偶尔会遇到单条失败(视频不公开、区域限制),忽略继续,最后报告成功率。
Use a bash loop to add them one by one, sleeping for 2 seconds each time:
bash
cat > /tmp/yt_urls.txt <<'EOF'
https://www.youtube.com/watch?v=XXX1
https://www.youtube.com/watch?v=XXX2
...
EOF
while IFS= read -r url; do
echo "=== Adding: $url ==="
nlm source add <alias> --url "$url" 2>&1 | tail -5
sleep 2
done < /tmp/yt_urls.txtOccasionally, individual additions may fail (video not public, region-restricted), ignore and continue, then report the success rate at the end.
Phase 4: 跑 NotebookLM deep research 发现网页文章
Phase 4: Run NotebookLM Deep Research to Discover Web Articles
bash
nlm research start "<英文查询,适合 web 研究>" \
--notebook-id <alias> --mode deepdeep 模式 ~5 分钟,返回 ~40 条网页源。
关键:一个 notebook 同一时间只能有一个 research 任务在跑。如果要跑第二轮,必须等第一轮 import 完或 --force。
等待完成:
bash
nlm research status <alias> --max-wait 360Bash 工具默认 timeout 120 秒,必须加 (即 400 秒)。
timeout: 400000bash
nlm research start "<English query suitable for web research>" \
--notebook-id <alias> --mode deepDeep mode takes ~5 minutes and returns ~40 web sources.
Key: Only one research task can run in a notebook at the same time. If you want to run a second round, you must wait for the first round to finish importing or use .
--forceWait for completion:
bash
nlm research status <alias> --max-wait 360The Bash tool has a default timeout of 120 seconds; you must add (i.e., 400 seconds).
timeout: 400000Phase 5: 导入 research 结果
Phase 5: Import Research Results
研究完成后从输出里拿 task-id,然后:
bash
nlm research import <alias> <task-id> --timeout 600Bash 工具加 。
timeout: 700000注意:用户有时会说"素材够了,不用再导入",要停下来直接进 Phase 6。
After the research is completed, get the task-id from the output, then:
bash
nlm research import <alias> <task-id> --timeout 600Add to the Bash tool call.
timeout: 700000Note: If the user says "enough materials, no need to import more", stop and proceed directly to Phase 6.
Phase 6: 跑 3 个分析查询
Phase 6: Run 3 Analysis Queries
默认跑 3 个角度,命令直接重定向到文件避免输出过大:
bash
mkdir -p "./research/<topic>"
nlm notebook query <alias> "<问题 1 的中文提示>" \
> "./research/<topic>/query1-<slug>-raw.json" 2>&1
nlm notebook query <alias> "<问题 2 的中文提示>" \
> "./research/<topic>/query2-<slug>-raw.json" 2>&1
nlm notebook query <alias> "<问题 3 的中文提示>" \
> "./research/<topic>/query3-<slug>-raw.json" 2>&1每个 query 的 Bash 调用要加 。
timeout: 240000默认 3 个查询模板(按需改关键词):
- Top 清单:"基于所有 source,请列出被最多来源推荐的 Top 10 X。对每个 X 说明:(1) 名称 (2) 具体做什么 (3) 主要使用场景 (4) 推荐它的来源数量 (5) 类型分类。按推荐频率从高到低,用中文输出。"
- 目标读者向:"我要写一篇面向 <读者画像> 的文章。请筛选出对 <读者> 最有帮助的 Top 8 X,每个说明:(1) 名称 (2) 具体痛点 (3) 典型用法一句话 (4) 类型 (5) 最具体的来源编号。去掉不相关的,聚焦 <场景>,用中文。"
- 入门 + 坑:"针对 <读者> 使用 X 时,请总结:(1) 最快入门方式 (2) 去哪里获取 (3) 最容易踩的 5 个坑 (4) 什么时候其实不需要 (5) 最新的重要更新。每点配来源编号,用中文。"
By default, run queries from 3 angles, redirect commands directly to files to avoid excessive output:
bash
mkdir -p "./research/<topic>"
nlm notebook query <alias> "<Chinese prompt for question 1>" \
> "./research/<topic>/query1-<slug>-raw.json" 2>&1
nlm notebook query <alias> "<Chinese prompt for question 2>" \
> "./research/<topic>/query2-<slug>-raw.json" 2>&1
nlm notebook query <alias> "<Chinese prompt for question 3>" \
> "./research/<topic>/query3-<slug>-raw.json" 2>&1Add to each Bash query call.
timeout: 240000Default 3 query templates (modify keywords as needed):
- Top List: "Based on all sources, please list the Top 10 X recommended by the most sources. For each X, explain: (1) Name (2) What it does specifically (3) Main usage scenarios (4) Number of sources recommending it (5) Type classification. Sort by recommendation frequency from highest to lowest, output in Chinese."
- Target Audience-Oriented: "I want to write an article for <audience portrait>. Please filter the Top 8 X that are most helpful to <audience>, explain each with: (1) Name (2) Specific pain points (3) One-sentence typical usage (4) Type (5) Most specific source number. Remove irrelevant content, focus on <scenario>, output in Chinese."
- Getting Started + Pitfalls: "For <audience> using X, please summarize: (1) Fastest way to get started (2) Where to obtain it (3) 5 easiest pitfalls to fall into (4) When it's actually not needed (5) Latest important updates. Attach source numbers to each point, output in Chinese."
Phase 7: 抽取 answer 字段,生成汇总 markdown
Phase 7: Extract Answer Field and Generate Summary Markdown
原始输出是 JSON 包含 answer + citations,用 Python 抽 字段:
value.answerbash
python3 <<'PY'
import json, pathlib
base = pathlib.Path("./research/<topic>")
files = [
("query1-<slug>-raw.json", "## Query 1:<标题>"),
("query2-<slug>-raw.json", "## Query 2:<标题>"),
("query3-<slug>-raw.json", "## Query 3:<标题>"),
]
out = ["# <话题> 素材研究", "",
"> 基于 NotebookLM 笔记本 `<notebook-name>` 的分析结果", "",
"---", ""]
for fname, heading in files:
out.append(heading)
out.append("")
raw = (base/fname).read_text()
try:
data = json.loads(raw)
out.append(data.get("value",{}).get("answer",""))
except Exception as e:
out.append(f"(解析失败: {e})")
out.append("")
out.append("---")
out.append("")
(base/"素材研究汇总.md").write_text("\n".join(out))
print("Written:", (base/"素材研究汇总.md").stat().st_size, "bytes")
PYThe raw output is JSON containing answer + citations; use Python to extract the field:
value.answerbash
python3 <<'PY'
import json, pathlib
base = pathlib.Path("./research/<topic>")
files = [
("query1-<slug>-raw.json", "## Query 1:<Title>"),
("query2-<slug>-raw.json", "## Query 2:<Title>"),
("query3-<slug>-raw.json", "## Query 3:<Title>"),
]
out = ["# <Topic> Material Research", "",
"> Analysis results based on NotebookLM notebook `<notebook-name>`", "",
"---", ""]
for fname, heading in files:
out.append(heading)
out.append("")
raw = (base/fname).read_text()
try:
data = json.loads(raw)
out.append(data.get("value",{}).get("answer",""))
except Exception as e:
out.append(f"(Parsing failed: {e})")
out.append("")
out.append("---")
out.append("")
(base/"Material Research Summary.md").write_text("\n".join(out))
print("Written:", (base/"Material Research Summary.md").stat().st_size, "bytes")
PYOutput Contract
Output Contract
执行完要给用户报告:
- Notebook 名字 + alias + 实际 source 数量
- 3 份 raw JSON 和 1 份汇总 markdown 的落盘路径
- 失败/跳过的 source(如果有)
- 汇总文件的头部预览(前 20 行左右)
- 建议的下一步(交给用户决定下游怎么用,本 skill 到此结束)
After execution, provide the user with a report including:
- Notebook name + alias + actual number of sources
- Storage paths of the 3 raw JSON files and 1 summary markdown file
- Failed/skipped sources (if any)
- Preview of the summary file's header (first 20 lines or so)
- Suggested next steps (leave downstream usage to the user; this skill ends here)
Safety and Boundaries
Safety and Boundaries
- 不要默认跑 audio/video/slides 生成,这些费配额,用户没要就不碰
- 不要自动跑第二轮 research,一轮够用绝大多数场景
- 不要覆盖已有的 ,如果存在先追加
素材研究汇总.md-v2 - 研究查询里不要塞用户私密信息(notebook 是可搜索的)
- Do not run audio/video/slides generation by default, as these consume quotas; only do so if the user requests it
- Do not automatically run a second round of research; one round is sufficient for most scenarios
- Do not overwrite existing ; append
Material Research Summary.mdif it exists-v2 - Do not include users' private information in research queries (notebooks are searchable)
Troubleshooting
Troubleshooting
nlm login 失效
nlm Login Expired
bash
nlm login --check # 会告诉你是否有效
nlm login # 重新登录session 有效期约 20 分钟。
bash
nlm login --check # Tells you if the session is valid
nlm login # Re-loginSession validity is approximately 20 minutes.
yt-dlp 搜索结果没输出
yt-dlp Search Returns No Output
先看版本:
bash
yt-dlp --version如果太旧提示用户更新。JS runtime / ffmpeg 的警告可以忽略,不影响 模式。
--simulateFirst check the version:
bash
yt-dlp --versionIf it's too old, prompt the user to update. JS runtime / ffmpeg warnings can be ignored and do not affect mode.
--simulateresearch 超时或卡住
Research Times Out or Gets Stuck
单独查状态(不阻塞):
bash
nlm research status <alias> --max-wait 0如果 status 一直是 in_progress 超过 10 分钟,用 重开:
--forcebash
nlm research start "..." --notebook-id <alias> --mode deep --forceCheck status separately (non-blocking):
bash
nlm research status <alias> --max-wait 0If the status remains for more than 10 minutes, restart with :
in_progress--forcebash
nlm research start "..." --notebook-id <alias> --mode deep --forcequery 输出太大无法直接看
Query Output Is Too Large to View Directly
所有 query 都重定向到文件,再用 Python 抽 answer,不要尝试在终端直接打印大 JSON。
Redirect all queries to files, then use Python to extract the answer; do not attempt to print large JSON directly in the terminal.
source add 连续失败
Continuous Failures When Adding Sources
- 检查是否有限流 → 增大 sleep 到 3-5 秒
- 检查 URL 格式(YouTube 要用 标准格式,不用 shorts/live)
watch?v= - 检查登录态 →
nlm login --check
- Check for rate limiting → increase sleep time to 3-5 seconds
- Check URL format (YouTube must use the standard format, not shorts/live)
watch?v= - Check login status →
nlm login --check
References
References
- NotebookLM CLI 完整指引:(pip 包,作者 jacob-bd)随附的 nlm-skill,或上游 README https://github.com/jacob-bd/notebooklm-mcp-cli
notebooklm-mcp-cli - yt-dlp 命令库:同仓库
../yt-dlp-direct/SKILL.md - 项目自身约定:如果你的工作目录有 /
CLAUDE.md,本 skill 不强依赖,可选阅读AGENTS.md
- Complete NotebookLM CLI Guide: (pip package by jacob-bd) comes with nlm-skill, or refer to the upstream README https://github.com/jacob-bd/notebooklm-mcp-cli
notebooklm-mcp-cli - yt-dlp Command Library: in the same repository
../yt-dlp-direct/SKILL.md - Project Own Conventions: If your working directory has /
CLAUDE.md, this skill does not depend on them; optional readingAGENTS.md