research-collector

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Research Collector

这个 skill 只做一件事:

为某个主题批量收集 YouTube 视频 + 网页文章,喂进 NotebookLM,跑分析查询,把结果落地到本地目录(默认
```
./research/<topic>/
```
,可配置)

不负责:

写成品文章(交给你自己的写作工具 / skill)
选主标题
下载视频(交给本仓库里的
```
yt-dlp-direct
```
skill)
发布到多平台(交给本仓库里的
```
publisher-wechatsync
```
skill)

一句话原则:用户说"帮我收集 X 话题的素材"或"拉一批 YouTube + 文章到 NotebookLM",就走这条固定流水线,不要每次重新设计。

This skill does only one thing:

Batch collect YouTube videos + web articles on a specific topic, feed them into NotebookLM, run analysis queries, and save the results to a local directory (default
```
./research/<topic>/
```
, configurable)

It does NOT handle:

Writing the final article (leave this to your own writing tools / skills)
Choosing main titles
Downloading videos (use the
```
yt-dlp-direct
```
skill in this repository)
Publishing to multiple platforms (use the
```
publisher-wechatsync
```
skill in this repository)

One-sentence principle: When users say "help me collect materials on topic X" or "pull a batch of YouTube videos + articles into NotebookLM", follow this fixed workflow instead of redesigning it every time.

When To Use

适用场景:

用户要为某个话题写推荐/测评/观点文,需要先做背景研究
用户说"帮我找 X 的热门 YouTube 视频和文章"
用户说"收集到 NotebookLM 里分析"
用户说"给我整理一份 X 话题的素材研究"

不适用场景:

用户已经有明确素材清单,只想要总结 → 直接跑
```
nlm notebook query
```
用户要做的是实时对话研究,不需要持久化到 notebook → 用 WebSearch + WebFetch
用户只要下载单个视频 → 用
```
yt-dlp-direct
```

Applicable scenarios:

Users need to conduct background research before writing a recommendation/review/opinion article on a topic
Users say "help me find popular YouTube videos and articles about X"
Users say "collect them into NotebookLM for analysis"
Users say "organize a material research report on topic X for me"

Inapplicable scenarios:

Users already have a clear list of materials and only want summaries → directly run
```
nlm notebook query
```
Users want to conduct real-time conversational research without persisting to a notebook → use WebSearch + WebFetch
Users only need to download a single video → use
```
yt-dlp-direct
```

Preconditions

开始前必须确认:

```
nlm
```
CLI 已安装且登录:
```
nlm login --check
```
```
yt-dlp
```
在 PATH 中:
```
which yt-dlp
```
用户明确说明了主题和角度
输出目录可写(默认
```
./research/<topic>/
```
,可以通过
```
RESEARCH_OUTPUT_DIR
```
环境变量或对话里直接指定其他路径)

前置不满足时:

```
nlm login --check
```
失败 → 让用户跑
```
nlm login
```
,session 有效期 ~20 分钟
```
yt-dlp
```
没装 → 停止并告诉用户

Must confirm the following before starting:

```
nlm
```
CLI is installed and logged in:
```
nlm login --check
```
```
yt-dlp
```
is in PATH:
```
which yt-dlp
```
Users have clearly specified the topic and angle
The output directory is writable (default
```
./research/<topic>/
```
, can be configured via the
```
RESEARCH_OUTPUT_DIR
```
environment variable or directly specified in the conversation)

If preconditions are not met:

If
```
nlm login --check
```
fails → ask the user to run
```
nlm login
```
; session validity is ~20 minutes
If
```
yt-dlp
```
is not installed → stop and inform the user

Working Rules

先和用户对齐主题、角度、量级,再动手
每轮 ytsearch 默认 15 条,可以根据需要调整
NotebookLM deep research 一次只能跑一个任务,不能并发
添加 source 时每条之间 sleep 2 秒,避免限流
所有产出(原始 JSON + 汇总 markdown)落到
```
./research/<topic>/
```
下(或用户指定的目录)
这个 skill 只负责收集和分析,不要擅自接着写成品文章
不要删 notebook,用户后面可能还要回去跑 query

Align the topic, angle, and volume with the user before taking action
Default to 15 results per ytsearch, adjust as needed
NotebookLM deep research can only run one task at a time, no concurrency allowed
Sleep for 2 seconds between adding each source to avoid rate limiting
All outputs (raw JSON + summary markdown) are saved to
```
./research/<topic>/
```
(or the user-specified directory)
This skill only handles collection and analysis; do not automatically proceed to write the final article
Do not delete the notebook, as users may need to run queries later

Core Workflow

Phase 0: 对齐目标

Phase 0: Align Objectives

在动手前必须和用户明确:

主题是什么(要一句话能喂给 ytsearch 的关键词)
角度(比如"最常用 + 个人创作" vs "最新发布 + 技术细节")
笔记本命名(默认
```
<主题> 素材
```
)
量级(默认:15 油管 + deep research 自动 ~40 网页)

Before starting, you must confirm with the user:

What is the topic (a keyword phrase that can be directly used for ytsearch)
Angle (e.g., "most commonly used + personal creation" vs "latest release + technical details")
Notebook name (default: "<Topic> Materials")
Volume (default: 15 YouTube videos + ~40 web articles from NotebookLM deep research)

Phase 1: 创建笔记本 + 设 alias

Phase 1: Create Notebook + Set Alias

bash

nlm notebook create "<话题> 素材"

bash

nlm notebook create "<Topic> Materials"

从输出提取 ID,然后:

Extract ID from output, then:

nlm alias set <short-name> <notebook-id>


alias 取短名,比如 `skills-research`、`vps-2026`,后续所有命令都用 alias。

nlm alias set <short-name> <notebook-id>


Use a short alias, such as `skills-research` or `vps-2026`, and use the alias for all subsequent commands.

Phase 2: yt-dlp ytsearch 找热门 YouTube

Phase 2: Search for Popular YouTube Videos with yt-dlp ytsearch

并行跑 2-3 个不同角度的搜索,每个 15 条:

bash

yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
  "ytsearch15:<关键词 A>"
yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
  "ytsearch15:<关键词 B>"

输出里的 JS runtime warning 可以忽略。

从结果里按以下规则筛 top 15:

去重(同一视频出现在多个搜索里)
优先官方账号(比如 Anthropic、OpenAI 等)
按 view count 从高到低,但要留 2-3 个垂直向角度的中腰部视频,避免全是爆款通稿
每个角度至少保留 5 条

Run 2-3 searches with different angles in parallel, 15 results each:

bash

yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
  "ytsearch15:<Keyword A>"
yt-dlp --simulate --print "%(title)s|%(webpage_url)s|%(view_count)s|%(uploader)s" \
  "ytsearch15:<Keyword B>"

Ignore JS runtime warnings in the output.

Filter the top 15 results using the following rules:

Remove duplicates (same video appearing in multiple searches)
Prioritize official accounts (e.g., Anthropic, OpenAI, etc.)
Sort by view count from highest to lowest, but reserve 2-3 mid-tier videos with vertical angles to avoid all being blockbuster press releases
Keep at least 5 results for each angle

Phase 3: 把 YouTube 加为 source

Phase 3: Add YouTube Videos as Sources

用 bash 循环逐条加,每次 sleep 2 秒:

bash

cat > /tmp/yt_urls.txt <<'EOF'
https://www.youtube.com/watch?v=XXX1
https://www.youtube.com/watch?v=XXX2
...
EOF

while IFS= read -r url; do
  echo "=== Adding: $url ==="
  nlm source add <alias> --url "$url" 2>&1 | tail -5
  sleep 2
done < /tmp/yt_urls.txt

偶尔会遇到单条失败(视频不公开、区域限制),忽略继续,最后报告成功率。

Use a bash loop to add them one by one, sleeping for 2 seconds each time:

bash

cat > /tmp/yt_urls.txt <<'EOF'
https://www.youtube.com/watch?v=XXX1
https://www.youtube.com/watch?v=XXX2
...
EOF

while IFS= read -r url; do
  echo "=== Adding: $url ==="
  nlm source add <alias> --url "$url" 2>&1 | tail -5
  sleep 2
done < /tmp/yt_urls.txt

Occasionally, individual additions may fail (video not public, region-restricted), ignore and continue, then report the success rate at the end.

Phase 4: 跑 NotebookLM deep research 发现网页文章

Phase 4: Run NotebookLM Deep Research to Discover Web Articles

bash

nlm research start "<英文查询,适合 web 研究>" \
  --notebook-id <alias> --mode deep

deep 模式 ~5 分钟,返回 ~40 条网页源。

关键:一个 notebook 同一时间只能有一个 research 任务在跑。如果要跑第二轮,必须等第一轮 import 完或 --force。

等待完成:

bash

nlm research status <alias> --max-wait 360

Bash 工具默认 timeout 120 秒,必须加

timeout: 400000

(即 400 秒)。

bash

nlm research start "<English query suitable for web research>" \
  --notebook-id <alias> --mode deep

Deep mode takes ~5 minutes and returns ~40 web sources.

Key: Only one research task can run in a notebook at the same time. If you want to run a second round, you must wait for the first round to finish importing or use

--force

Wait for completion:

bash

nlm research status <alias> --max-wait 360

The Bash tool has a default timeout of 120 seconds; you must add

timeout: 400000

(i.e., 400 seconds).

Phase 5: 导入 research 结果

Phase 5: Import Research Results

研究完成后从输出里拿 task-id,然后:

bash

nlm research import <alias> <task-id> --timeout 600

Bash 工具加

timeout: 700000

。

注意:用户有时会说"素材够了,不用再导入",要停下来直接进 Phase 6。

After the research is completed, get the task-id from the output, then:

bash

nlm research import <alias> <task-id> --timeout 600

Add

timeout: 700000

to the Bash tool call.

Note: If the user says "enough materials, no need to import more", stop and proceed directly to Phase 6.

Phase 6: 跑 3 个分析查询

Phase 6: Run 3 Analysis Queries

默认跑 3 个角度,命令直接重定向到文件避免输出过大:

bash

mkdir -p "./research/<topic>"

nlm notebook query <alias> "<问题 1 的中文提示>" \
  > "./research/<topic>/query1-<slug>-raw.json" 2>&1

nlm notebook query <alias> "<问题 2 的中文提示>" \
  > "./research/<topic>/query2-<slug>-raw.json" 2>&1

nlm notebook query <alias> "<问题 3 的中文提示>" \
  > "./research/<topic>/query3-<slug>-raw.json" 2>&1

每个 query 的 Bash 调用要加
timeout: 240000
。

默认 3 个查询模板(按需改关键词):

Top 清单:"基于所有 source,请列出被最多来源推荐的 Top 10 X。对每个 X 说明:(1) 名称 (2) 具体做什么 (3) 主要使用场景 (4) 推荐它的来源数量 (5) 类型分类。按推荐频率从高到低,用中文输出。"
目标读者向:"我要写一篇面向 <读者画像> 的文章。请筛选出对 <读者> 最有帮助的 Top 8 X,每个说明:(1) 名称 (2) 具体痛点 (3) 典型用法一句话 (4) 类型 (5) 最具体的来源编号。去掉不相关的,聚焦 <场景>,用中文。"
入门 + 坑:"针对 <读者> 使用 X 时,请总结:(1) 最快入门方式 (2) 去哪里获取 (3) 最容易踩的 5 个坑 (4) 什么时候其实不需要 (5) 最新的重要更新。每点配来源编号,用中文。"

By default, run queries from 3 angles, redirect commands directly to files to avoid excessive output:

bash

mkdir -p "./research/<topic>"

nlm notebook query <alias> "<Chinese prompt for question 1>" \
  > "./research/<topic>/query1-<slug>-raw.json" 2>&1

nlm notebook query <alias> "<Chinese prompt for question 2>" \
  > "./research/<topic>/query2-<slug>-raw.json" 2>&1

nlm notebook query <alias> "<Chinese prompt for question 3>" \
  > "./research/<topic>/query3-<slug>-raw.json" 2>&1

Add
timeout: 240000
to each Bash query call.

Default 3 query templates (modify keywords as needed):

Top List: "Based on all sources, please list the Top 10 X recommended by the most sources. For each X, explain: (1) Name (2) What it does specifically (3) Main usage scenarios (4) Number of sources recommending it (5) Type classification. Sort by recommendation frequency from highest to lowest, output in Chinese."
Target Audience-Oriented: "I want to write an article for <audience portrait>. Please filter the Top 8 X that are most helpful to <audience>, explain each with: (1) Name (2) Specific pain points (3) One-sentence typical usage (4) Type (5) Most specific source number. Remove irrelevant content, focus on <scenario>, output in Chinese."
Getting Started + Pitfalls: "For <audience> using X, please summarize: (1) Fastest way to get started (2) Where to obtain it (3) 5 easiest pitfalls to fall into (4) When it's actually not needed (5) Latest important updates. Attach source numbers to each point, output in Chinese."

Phase 7: 抽取 answer 字段,生成汇总 markdown

Phase 7: Extract Answer Field and Generate Summary Markdown

原始输出是 JSON 包含 answer + citations,用 Python 抽

value.answer

字段:

bash

python3 <<'PY'
import json, pathlib
base = pathlib.Path("./research/<topic>")
files = [
    ("query1-<slug>-raw.json", "## Query 1:<标题>"),
    ("query2-<slug>-raw.json", "## Query 2:<标题>"),
    ("query3-<slug>-raw.json", "## Query 3:<标题>"),
]
out = ["# <话题> 素材研究", "",
       "> 基于 NotebookLM 笔记本 `<notebook-name>` 的分析结果", "",
       "---", ""]
for fname, heading in files:
    out.append(heading)
    out.append("")
    raw = (base/fname).read_text()
    try:
        data = json.loads(raw)
        out.append(data.get("value",{}).get("answer",""))
    except Exception as e:
        out.append(f"(解析失败: {e})")
    out.append("")
    out.append("---")
    out.append("")
(base/"素材研究汇总.md").write_text("\n".join(out))
print("Written:", (base/"素材研究汇总.md").stat().st_size, "bytes")
PY

The raw output is JSON containing answer + citations; use Python to extract the

value.answer

field:

bash

python3 <<'PY'
import json, pathlib
base = pathlib.Path("./research/<topic>")
files = [
    ("query1-<slug>-raw.json", "## Query 1:<Title>"),
    ("query2-<slug>-raw.json", "## Query 2:<Title>"),
    ("query3-<slug>-raw.json", "## Query 3:<Title>"),
]
out = ["# <Topic> Material Research", "",
       "> Analysis results based on NotebookLM notebook `<notebook-name>`", "",
       "---", ""]
for fname, heading in files:
    out.append(heading)
    out.append("")
    raw = (base/fname).read_text()
    try:
        data = json.loads(raw)
        out.append(data.get("value",{}).get("answer",""))
    except Exception as e:
        out.append(f"(Parsing failed: {e})")
    out.append("")
    out.append("---")
    out.append("")
(base/"Material Research Summary.md").write_text("\n".join(out))
print("Written:", (base/"Material Research Summary.md").stat().st_size, "bytes")
PY

Output Contract

执行完要给用户报告:

Notebook 名字 + alias + 实际 source 数量
3 份 raw JSON 和 1 份汇总 markdown 的落盘路径
失败/跳过的 source(如果有)
汇总文件的头部预览(前 20 行左右)
建议的下一步(交给用户决定下游怎么用,本 skill 到此结束)

After execution, provide the user with a report including:

Notebook name + alias + actual number of sources
Storage paths of the 3 raw JSON files and 1 summary markdown file
Failed/skipped sources (if any)
Preview of the summary file's header (first 20 lines or so)
Suggested next steps (leave downstream usage to the user; this skill ends here)

Safety and Boundaries

不要默认跑 audio/video/slides 生成,这些费配额,用户没要就不碰
不要自动跑第二轮 research,一轮够用绝大多数场景
不要覆盖已有的
```
素材研究汇总.md
```
,如果存在先追加
```
-v2
```
研究查询里不要塞用户私密信息(notebook 是可搜索的)

Do not run audio/video/slides generation by default, as these consume quotas; only do so if the user requests it
Do not automatically run a second round of research; one round is sufficient for most scenarios
Do not overwrite existing
```
Material Research Summary.md
```
; append
```
-v2
```
if it exists
Do not include users' private information in research queries (notebooks are searchable)

Troubleshooting

nlm login 失效

nlm Login Expired

bash

nlm login --check  # 会告诉你是否有效
nlm login          # 重新登录

session 有效期约 20 分钟。

bash

nlm login --check  # Tells you if the session is valid
nlm login          # Re-login

Session validity is approximately 20 minutes.

yt-dlp 搜索结果没输出

yt-dlp Search Returns No Output

先看版本:

bash

yt-dlp --version

如果太旧提示用户更新。JS runtime / ffmpeg 的警告可以忽略,不影响

--simulate

模式。

First check the version:

bash

yt-dlp --version

If it's too old, prompt the user to update. JS runtime / ffmpeg warnings can be ignored and do not affect

--simulate

mode.

research 超时或卡住

Research Times Out or Gets Stuck

单独查状态(不阻塞):

bash

nlm research status <alias> --max-wait 0

如果 status 一直是 in_progress 超过 10 分钟,用

--force

重开:

bash

nlm research start "..." --notebook-id <alias> --mode deep --force

Check status separately (non-blocking):

bash

nlm research status <alias> --max-wait 0

If the status remains

in_progress

for more than 10 minutes, restart with

--force

bash

nlm research start "..." --notebook-id <alias> --mode deep --force

query 输出太大无法直接看

Query Output Is Too Large to View Directly

所有 query 都重定向到文件,再用 Python 抽 answer,不要尝试在终端直接打印大 JSON。

Redirect all queries to files, then use Python to extract the answer; do not attempt to print large JSON directly in the terminal.

source add 连续失败

Continuous Failures When Adding Sources

检查是否有限流 → 增大 sleep 到 3-5 秒
检查 URL 格式(YouTube 要用
```
watch?v=
```
标准格式,不用 shorts/live)
检查登录态 →
```
nlm login --check
```

Check for rate limiting → increase sleep time to 3-5 seconds
Check URL format (YouTube must use the standard
```
watch?v=
```
format, not shorts/live)
Check login status →
```
nlm login --check
```

References

NotebookLM CLI 完整指引:
```
notebooklm-mcp-cli
```
(pip 包,作者 jacob-bd)随附的 nlm-skill,或上游 README https://github.com/jacob-bd/notebooklm-mcp-cli
yt-dlp 命令库:同仓库
```
../yt-dlp-direct/SKILL.md
```
项目自身约定:如果你的工作目录有
```
CLAUDE.md
```
/
```
AGENTS.md
```
,本 skill 不强依赖,可选阅读

Complete NotebookLM CLI Guide:
```
notebooklm-mcp-cli
```
(pip package by jacob-bd) comes with nlm-skill, or refer to the upstream README https://github.com/jacob-bd/notebooklm-mcp-cli
yt-dlp Command Library:
```
../yt-dlp-direct/SKILL.md
```
in the same repository
Project Own Conventions: If your working directory has
```
CLAUDE.md
```
/
```
AGENTS.md
```
, this skill does not depend on them; optional reading