baoyu-url-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

URL to Markdown

URL转Markdown

Fetches any URL via
baoyu-fetch
CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.
通过
baoyu-fetch
CLI(Chrome CDP + 站点专属适配器)抓取任意URL并转换为整洁的markdown格式。

CLI Setup

CLI安装配置

Important: The CLI source is vendored in the
scripts/vendor/baoyu-fetch/
subdirectory of this skill.
Agent Execution Instructions:
  1. Determine this SKILL.md file's directory path as
    {baseDir}
  2. CLI entry point =
    {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
  3. Resolve
    ${BUN_X}
    runtime: if
    bun
    installed →
    bun
    ; if
    npx
    available →
    npx -y bun
    ; else suggest installing bun
  4. ${READER}
    =
    ${BUN_X} {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
  5. Replace all
    ${READER}
    in this document with the resolved value
重要提示:CLI源码存放于本skill的
scripts/vendor/baoyu-fetch/
子目录下。
Agent执行说明:
  1. 确定当前SKILL.md文件的目录路径为
    {baseDir}
  2. CLI入口文件 =
    {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
  3. 解析
    ${BUN_X}
    运行时:如果已安装
    bun
    则使用
    bun
    ;如果可用
    npx
    则使用
    npx -y bun
    ;否则建议用户安装bun
  4. ${READER}
    =
    ${BUN_X} {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
  5. 将本文档中所有
    ${READER}
    替换为解析后的实际值

Preferences (EXTEND.md)

偏好配置 (EXTEND.md)

Check EXTEND.md existence (priority order):
bash
undefined
按优先级顺序检查EXTEND.md是否存在:
bash
undefined

macOS, Linux, WSL, Git Bash

macOS, Linux, WSL, Git Bash

test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project" test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg" test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"

```powershell
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project" test -f "${XDG_CONFIG_HOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "xdg" test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"

```powershell

PowerShell (Windows)

PowerShell (Windows)

if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" } $xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" } if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" } if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }

| Path | Location |
|------|----------|
| `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | Project directory |
| `$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | User home |

| Result | Action |
|--------|--------|
| Found | Read, parse, apply settings |
| Not found | **MUST** run first-time setup (see below) — do NOT silently create defaults |

**EXTEND.md Supports**: Download media by default | Default output directory
if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { "project" } $xdg = if ($env:XDG_CONFIG_HOME) { $env:XDG_CONFIG_HOME } else { "$HOME/.config" } if (Test-Path "$xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "xdg" } if (Test-Path "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md") { "user" }

| 路径 | 位置 |
|------|----------|
| `.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | 项目目录 |
| `$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md` | 用户根目录 |

| 结果 | 操作 |
|--------|--------|
| 已找到 | 读取、解析并应用配置 |
| 未找到 | **必须**执行首次配置(见下文)—— 请勿静默创建默认配置 |

**EXTEND.md支持配置项**:默认下载媒体资源 | 默认输出目录

First-Time Setup (BLOCKING)

首次配置(阻塞操作)

CRITICAL: When EXTEND.md is not found, you MUST use
AskUserQuestion
to ask the user for their preferences before creating EXTEND.md. NEVER create EXTEND.md with defaults without asking. This is a BLOCKING operation — do NOT proceed with any conversion until setup is complete.
Use
AskUserQuestion
with ALL questions in ONE call:
Question 1 — header: "Media", question: "How to handle images and videos in pages?"
  • "Ask each time (Recommended)" — After saving markdown, ask whether to download media
  • "Always download" — Always download media to local imgs/ and videos/ directories
  • "Never download" — Keep original remote URLs in markdown
Question 2 — header: "Output", question: "Default output directory?"
  • "url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md
  • (User may choose "Other" to type a custom path)
Question 3 — header: "Save", question: "Where to save preferences?"
  • "User (Recommended)" — ~/.baoyu-skills/ (all projects)
  • "Project" — .baoyu-skills/ (this project only)
After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.
Full reference: references/config/first-time-setup.md
关键要求:未找到EXTEND.md时,你必须使用
AskUserQuestion
在创建EXTEND.md前询问用户偏好。绝对不要未经询问就创建带默认配置的EXTEND.md。这是阻塞
操作——配置完成前不要执行任何转换操作。
调用
AskUserQuestion
时需将所有问题放在单次请求中:
问题1 — 标题: "媒体资源", 问题: "如何处理页面中的图片和视频?"
  • "每次询问(推荐)" — 保存markdown后询问是否下载媒体资源
  • "始终下载" — 始终将媒体资源下载到本地imgs/和videos/目录
  • "从不下载" — 在markdown中保留原始远程URL
问题2 — 标题: "输出路径", 问题: "默认输出目录是?"
  • "url-to-markdown(推荐)" — 保存到 ./url-to-markdown/{domain}/{slug}.md
  • (用户可选择「其他」输入自定义路径)
问题3 — 标题: "保存位置", 问题: "偏好配置保存到哪里?"
  • "用户全局(推荐)" — ~/.baoyu-skills/ (对所有项目生效)
  • "当前项目" — .baoyu-skills/ (仅对本项目生效)
用户回答后,在选择的位置创建EXTEND.md,提示「偏好配置已保存到[路径]」,再继续后续操作。
完整参考: references/config/first-time-setup.md

Supported Keys

支持的配置项

KeyDefaultValuesDescription
download_media
ask
ask
/
1
/
0
ask
= prompt each time,
1
= always download,
0
= never
default_output_dir
emptypath or emptyDefault output directory (empty =
./url-to-markdown/
)
EXTEND.md → CLI mapping:
EXTEND.md keyCLI argumentNotes
download_media: 1
--download-media
Requires
--output
to be set
default_output_dir: ./posts/
Agent constructs
--output ./posts/{domain}/{slug}.md
Agent generates path, not a direct CLI flag
Value priority:
  1. CLI arguments (
    --download-media
    ,
    --output
    )
  2. EXTEND.md
  3. Skill defaults
配置键默认值可选值描述
download_media
ask
ask
/
1
/
0
ask
= 每次操作前询问,
1
= 始终下载,
0
= 从不下载
default_output_dir
路径或空默认输出目录(空 = 默认为
./url-to-markdown/
EXTEND.md → CLI参数映射:
EXTEND.md配置键CLI参数说明
download_media: 1
--download-media
需要同时设置
--output
default_output_dir: ./posts/
Agent自动构建
--output ./posts/{domain}/{slug}.md
由Agent生成路径,不是直接传入CLI的参数
配置优先级:
  1. CLI参数 (
    --download-media
    ,
    --output
    )
  2. EXTEND.md配置
  3. Skill默认值

Features

功能特性

  • Chrome CDP for full JavaScript rendering via
    baoyu-fetch
    CLI
  • Site-specific adapters: X/Twitter, YouTube, Hacker News, generic (Defuddle)
  • Automatic adapter selection based on URL, or force with
    --adapter
  • Interaction gate detection: Cloudflare, reCAPTCHA, hCAPTCHA, custom challenges
  • Two capture modes: headless (default) or interactive with wait-for-interaction
  • Clean markdown output with YAML front matter
  • Structured JSON output available via
    --format json
  • X/Twitter: extracts tweets, threads, and X Articles with media
  • YouTube: transcript/caption extraction, chapters, cover images
  • Hacker News: threaded comment parsing with proper nesting
  • Generic: Defuddle extraction with Readability fallback
  • Download images and videos to local directories
  • Chrome profile persistence for authenticated sessions
  • Debug artifact output for troubleshooting
  • 基于Chrome CDP实现完整JavaScript渲染,由
    baoyu-fetch
    CLI提供支持
  • 站点专属适配器:X/Twitter、YouTube、Hacker News、通用站点(Defuddle)
  • 基于URL自动选择适配器,也可通过
    --adapter
    强制指定
  • 交互门槛检测:Cloudflare、reCAPTCHA、hCAPTCHA、自定义验证
  • 两种捕获模式:无头模式(默认)或带交互等待的可视化模式
  • 整洁的markdown输出,自带YAML front matter
  • 可通过
    --format json
    输出结构化JSON结果
  • X/Twitter:提取推文、讨论线程、X专栏文章及关联媒体资源
  • YouTube:提取字幕/文案、章节信息、封面图
  • Hacker News:带层级嵌套的评论线程解析
  • 通用站点:Defuddle提取,兜底使用Readability方案
  • 支持将图片和视频下载到本地目录
  • Chrome配置文件持久化,支持已登录会话抓取
  • 调试产物输出,方便排查问题

Usage

使用方法

bash
undefined
bash
undefined

Default: headless capture, markdown to stdout

默认:无头模式捕获,markdown输出到标准输出

${READER} <url>
${READER} <url>

Save to file

保存到文件

${READER} <url> --output article.md
${READER} <url> --output article.md

Save with media download

保存同时下载媒体资源

${READER} <url> --output article.md --download-media
${READER} <url> --output article.md --download-media

Headless mode (explicit)

显式指定无头模式

${READER} <url> --headless --output article.md
${READER} <url> --headless --output article.md

Wait for interaction (login/CAPTCHA) — auto-detect and continue

等待交互(登录/CAPTCHA)—— 自动检测完成后继续

${READER} <url> --wait-for interaction --output article.md
${READER} <url> --wait-for interaction --output article.md

Wait for interaction — manual control (Enter to continue)

等待交互——手动控制(按Enter继续)

${READER} <url> --wait-for force --output article.md
${READER} <url> --wait-for force --output article.md

JSON output

JSON格式输出

${READER} <url> --format json --output article.json
${READER} <url> --format json --output article.json

Force specific adapter

强制使用指定适配器

${READER} <url> --adapter youtube --output transcript.md
${READER} <url> --adapter youtube --output transcript.md

Connect to existing Chrome

连接到已运行的Chrome实例

${READER} <url> --cdp-url http://localhost:9222 --output article.md
${READER} <url> --cdp-url http://localhost:9222 --output article.md

Debug artifacts

输出调试产物

${READER} <url> --output article.md --debug-dir ./debug/
undefined
${READER} <url> --output article.md --debug-dir ./debug/
undefined

Options

可选参数

OptionDescription
<url>
URL to fetch
--output <path>
Output file path (default: stdout)
--format <type>
Output format:
markdown
(default) or
json
--json
Shorthand for
--format json
--adapter <name>
Force adapter:
x
,
youtube
,
hn
, or
generic
(default: auto-detect)
--headless
Force headless Chrome (no visible window)
--wait-for <mode>
Interaction wait mode:
none
(default),
interaction
, or
force
--wait-for-interaction
Alias for
--wait-for interaction
--wait-for-login
Alias for
--wait-for interaction
--timeout <ms>
Page load timeout (default: 30000)
--interaction-timeout <ms>
Login/CAPTCHA wait timeout (default: 600000 = 10 min)
--interaction-poll-interval <ms>
Poll interval for interaction checks (default: 1500)
--download-media
Download images/videos to local
imgs/
and
videos/
, rewrite markdown links. Requires
--output
--media-dir <dir>
Base directory for downloaded media (default: same as
--output
directory)
--cdp-url <url>
Reuse existing Chrome DevTools Protocol endpoint
--browser-path <path>
Custom Chrome/Chromium binary path
--chrome-profile-dir <path>
Chrome user data directory (default:
BAOYU_CHROME_PROFILE_DIR
env or
./baoyu-skills/chrome-profile
)
--debug-dir <dir>
Write debug artifacts (document.json, markdown.md, page.html, network.json)
参数描述
<url>
要抓取的URL
--output <path>
输出文件路径(默认:标准输出)
--format <type>
输出格式:
markdown
(默认)或
json
--json
--format json
的简写
--adapter <name>
强制指定适配器:
x
,
youtube
,
hn
, 或
generic
(默认:自动检测)
--headless
强制使用无头Chrome(无可视化窗口)
--wait-for <mode>
交互等待模式:
none
(默认),
interaction
, 或
force
--wait-for-interaction
--wait-for interaction
的别名
--wait-for-login
--wait-for interaction
的别名
--timeout <ms>
页面加载超时时间(默认:30000)
--interaction-timeout <ms>
登录/CAPTCHA等待超时时间(默认:600000 = 10分钟)
--interaction-poll-interval <ms>
交互状态检测轮询间隔(默认:1500)
--download-media
将图片/视频下载到本地
imgs/
videos/
目录,重写markdown中的链接。需要同时设置
--output
--media-dir <dir>
下载媒体资源的根目录(默认:与
--output
目录相同)
--cdp-url <url>
复用已有的Chrome DevTools Protocol端点
--browser-path <path>
自定义Chrome/Chromium二进制文件路径
--chrome-profile-dir <path>
Chrome用户数据目录(默认:
BAOYU_CHROME_PROFILE_DIR
环境变量 或
./baoyu-skills/chrome-profile
--debug-dir <dir>
输出调试产物(document.json, markdown.md, page.html, network.json)

Capture Modes

捕获模式

ModeBehaviorUse When
DefaultHeadless Chrome, auto-extract on network idlePublic pages, static content
--headless
Explicit headless (same as default)Clarify intent
--wait-for interaction
Opens visible Chrome, auto-detects login/CAPTCHA gates, waits for them to clear, then continuesLogin-required, CAPTCHA-protected
--wait-for force
Opens visible Chrome, auto-detects OR accepts Enter keypress to continueComplex flows, lazy loading, paywalls
Interaction gate auto-detection:
  • Cloudflare Turnstile / "just a moment" pages
  • Google reCAPTCHA
  • hCaptcha
  • Custom challenge / verification screens
Wait-for-interaction workflow:
  1. Run with
    --wait-for interaction
    → Chrome opens visibly
  2. CLI auto-detects login/CAPTCHA gates
  3. User completes login or solves CAPTCHA in the browser
  4. CLI auto-detects gate cleared → captures page
  5. If
    --wait-for force
    is used, user can also press Enter to trigger capture manually
模式行为适用场景
默认无头Chrome,网络空闲时自动提取内容公开页面、静态内容
--headless
显式指定无头模式(与默认行为一致)明确声明执行意图
--wait-for interaction
打开可视化Chrome窗口,自动检测登录/CAPTCHA门槛,等待验证完成后继续需要登录、CAPTCHA保护的页面
--wait-for force
打开可视化Chrome窗口,自动检测或等待用户按Enter触发后续流程复杂交互流程、懒加载页面、付费墙
交互门槛自动检测支持:
  • Cloudflare Turnstile / "请稍候"页面
  • Google reCAPTCHA
  • hCAPTCHA
  • 自定义挑战/验证页面
交互等待工作流:
  1. 携带
    --wait-for interaction
    运行 → Chrome可视化窗口打开
  2. CLI自动检测登录/CAPTCHA门槛
  3. 用户在浏览器中完成登录或完成CAPTCHA验证
  4. CLI自动检测到门槛已清除 → 捕获页面内容
  5. 如果使用
    --wait-for force
    ,用户也可以按Enter手动触发捕获

Agent Quality Gate

Agent质量检测门槛

CRITICAL: The agent must treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without causing the CLI to fail.
After every headless run, the agent MUST inspect the saved markdown output.
关键要求:Agent必须将默认无头模式的捕获结果视为临时结果。部分站点在无头模式下渲染结果不同,可能静默返回低质量内容但不会导致CLI执行失败。
每次无头模式运行后,Agent必须检查保存的markdown输出内容。

Quality checks the agent must perform

Agent必须执行的质量检查

  1. Confirm the markdown title matches the target page, not a generic site shell
  2. Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
  3. Watch for obvious failure signs:
    • Application error
    • This page could not be found
    • Login, signup, subscribe, or verification shells
    • Extremely short markdown for a page that should be long-form
    • Raw framework payloads or mostly boilerplate content
  4. If the result is low quality, incomplete, or clearly wrong, do not accept the run as successful just because the CLI exited with code 0
Tip: Use
--format json
to get structured output including
status
,
login.state
, and
interaction
fields for programmatic quality assessment. A
"status": "needs_interaction"
response means the page requires manual interaction.
  1. 确认markdown标题与目标页面匹配,不是通用的站点框架标题
  2. 确认正文包含预期的文章或页面内容,而不只是导航、页脚或通用错误提示
  3. 注意明显的失败特征:
    • Application error
      (应用错误)
    • This page could not be found
      (页面未找到)
    • 登录、注册、订阅或验证引导页面
    • 本该是长内容的页面输出极短的markdown
    • 原始框架 payload 或大部分是模板内容
  4. 如果结果质量低、不完整或明显错误,不要仅因为CLI退出码为0就认为运行成功
提示:使用
--format json
可以获得包含
status
login.state
interaction
字段的结构化输出,方便程序化质量评估。返回
"status": "needs_interaction"
表示页面需要手动交互。

Recovery workflow the agent must follow

Agent必须遵循的恢复工作流

  1. First run headless (default) unless there is already a clear reason to use interaction mode
  2. Review markdown quality immediately after the run
  3. If the content is low quality or indicates login/CAPTCHA:
    • --wait-for interaction
      for auto-detected gates (login, CAPTCHA, Cloudflare)
    • --wait-for force
      when the page needs manual browsing, scroll loading, or complex interaction
  4. If
    --wait-for
    is used, tell the user exactly what to do:
    • If login is required, ask them to sign in in the browser
    • If CAPTCHA appears, ask them to solve it
    • If the page needs time to load, ask them to wait until content is visible
    • For
      --wait-for force
      : tell them to press Enter when ready
  5. If JSON output shows
    "status": "needs_interaction"
    , switch to
    --wait-for interaction
    automatically
  1. 首次运行默认使用无头模式,除非已经有明确理由使用交互模式
  2. 运行完成后立即检查markdown质量
  3. 如果内容质量低或提示需要登录/CAPTCHA:
    • 对于可自动检测的门槛(登录、CAPTCHA、Cloudflare)使用
      --wait-for interaction
    • 当页面需要手动浏览、滚动加载或复杂交互时使用
      --wait-for force
  4. 如果使用
    --wait-for
    参数,明确告知用户需要执行的操作:
    • 如果需要登录,提示用户在浏览器中完成登录
    • 如果出现CAPTCHA,提示用户完成验证
    • 如果页面需要加载时间,提示用户等待内容显示完整
    • 对于
      --wait-for force
      模式:告知用户准备完成后按Enter
  5. 如果JSON输出显示
    "status": "needs_interaction"
    ,自动切换到
    --wait-for interaction
    模式

Output Path Generation

输出路径生成规则

The agent must construct the output file path since
baoyu-fetch
does not auto-generate paths.
Algorithm:
  1. Determine base directory from EXTEND.md
    default_output_dir
    or default
    ./url-to-markdown/
  2. Extract domain from URL (e.g.,
    example.com
    )
  3. Generate slug from URL path or page title (kebab-case, 2-6 words)
  4. Construct:
    {base_dir}/{domain}/{slug}/{slug}.md
    — each URL gets its own directory so media files stay isolated
  5. Conflict resolution: append timestamp
    {slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md
Pass the constructed path to
--output
. Media files (
--download-media
) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.
Agent必须自行构建输出文件路径,因为
baoyu-fetch
不会自动生成路径。
生成算法:
  1. 从EXTEND.md的
    default_output_dir
    获取基础目录,默认使用
    ./url-to-markdown/
  2. 从URL提取域名(例如
    example.com
  3. 从URL路径或页面标题生成slug(短横线分隔格式,2-6个单词)
  4. 构建路径:
    {base_dir}/{domain}/{slug}/{slug}.md
    —— 每个URL对应独立目录,方便媒体资源隔离
  5. 冲突处理:追加时间戳
    {slug}-YYYYMMDD-HHMMSS/{slug}-YYYYMMDD-HHMMSS.md
将构建好的路径传给
--output
参数。媒体文件(开启
--download-media
时)会保存到markdown文件同级的子目录中,保证每个URL的资源独立存储。

Output Format

输出格式

Markdown output to stdout (or file with
--output
) as clean markdown text.
JSON output (
--format json
) returns structured data including:
  • adapter
    — which adapter handled the URL
  • status
    "ok"
    or
    "needs_interaction"
  • login
    — login state detection (
    logged_in
    ,
    logged_out
    ,
    unknown
    )
  • interaction
    — interaction gate details (kind, provider, prompt)
  • document
    — structured content (url, title, author, publishedAt, content blocks, metadata)
  • media
    — collected media assets with url, kind, role
  • markdown
    — converted markdown text
  • downloads
    — media download results (when
    --download-media
    used)
When
--download-media
is enabled:
  • Images are saved to
    imgs/
    next to the output file (or in
    --media-dir
    )
  • Videos are saved to
    videos/
    next to the output file (or in
    --media-dir
    )
  • Markdown media links are rewritten to local relative paths
Markdown输出到标准输出(或通过
--output
指定的文件),为整洁的markdown文本。
JSON输出(
--format json
)返回结构化数据,包含:
  • adapter
    —— 处理该URL的适配器
  • status
    ——
    "ok"
    "needs_interaction"
  • login
    —— 登录状态检测结果(
    logged_in
    已登录,
    logged_out
    未登录,
    unknown
    未知)
  • interaction
    —— 交互门槛详情(类型、提供商、提示)
  • document
    —— 结构化内容(url、标题、作者、发布时间、内容块、元数据)
  • media
    —— 采集到的媒体资源,包含url、类型、作用
  • markdown
    —— 转换后的markdown文本
  • downloads
    —— 媒体下载结果(开启
    --download-media
    时返回)
开启
--download-media
时:
  • 图片保存到输出文件同级的
    imgs/
    目录(或
    --media-dir
    指定的目录)
  • 视频保存到输出文件同级的
    videos/
    目录(或
    --media-dir
    指定的目录)
  • Markdown中的媒体链接会重写为本地相对路径

Built-in Adapters

内置适配器

AdapterURLsKey Features
x
x.com, twitter.comTweets, threads, X Articles, media, login detection
youtube
youtube.com, youtu.beTranscript/captions, chapters, cover image, metadata
hn
news.ycombinator.comThreaded comments, story metadata, nested replies
generic
Any URL (fallback)Defuddle extraction, Readability fallback, auto-scroll, network idle detection
Adapter is auto-selected based on URL. Use
--adapter <name>
to override.
适配器适配URL核心功能
x
x.com, twitter.com推文、讨论线程、X专栏文章、媒体资源、登录状态检测
youtube
youtube.com, youtu.be字幕/文案、章节信息、封面图、元数据
hn
news.ycombinator.com层级化评论、帖子元数据、嵌套回复
generic
任意URL(兜底适配)Defuddle提取、Readability兜底、自动滚动、网络空闲检测
适配器会根据URL自动选择,可通过
--adapter <name>
强制指定。

Media Download Workflow

媒体下载工作流

Based on
download_media
setting in EXTEND.md:
SettingBehavior
1
(always)
Run CLI with
--download-media --output <path>
0
(never)
Run CLI with
--output <path>
(no media download)
ask
(default)
Follow the ask-each-time flow below
基于EXTEND.md中的
download_media
配置执行:
配置值行为
1
(始终下载)
运行CLI时携带
--download-media --output <path>
参数
0
(从不下载)
运行CLI时携带
--output <path>
参数(不下载媒体)
ask
(默认)
遵循下文的每次询问流程

Ask-Each-Time Flow

每次询问流程

  1. Run CLI without
    --download-media
    with
    --output <path>
    → markdown saved
  2. Check saved markdown for remote media URLs (
    https://
    in image/video links)
  3. If no remote media found → done, no prompt needed
  4. If remote media found → use
    AskUserQuestion
    :
    • header: "Media", question: "Download N images/videos to local files?"
    • "Yes" — Download to local directories
    • "No" — Keep remote URLs
  5. If user confirms → run CLI again with
    --download-media --output <same-path>
    (overwrites markdown with localized links)
  1. 运行CLI不携带
    --download-media
    参数,携带
    --output <path>
    → 保存markdown
  2. 检查保存的markdown中是否存在远程媒体URL(图片/视频链接中包含
    https://
  3. 如果未找到远程媒体 → 流程结束,无需提示
  4. 如果找到远程媒体 → 调用
    AskUserQuestion
    :
    • 标题: "媒体资源", 问题: "是否下载N个图片/视频到本地?"
    • "是" —— 下载到本地目录
    • "否" —— 保留远程URL
  5. 如果用户确认 → 再次运行CLI携带
    --download-media --output <相同路径>
    (覆盖markdown,将链接替换为本地路径)

Environment Variables

环境变量

VariableDescription
BAOYU_CHROME_PROFILE_DIR
Chrome user data directory (can also use
--chrome-profile-dir
)
Troubleshooting: Chrome not found → use
--browser-path
. Timeout → increase
--timeout
. Login/CAPTCHA pages → use
--wait-for interaction
. Debug → use
--debug-dir
to inspect captured HTML and network logs.
变量名描述
BAOYU_CHROME_PROFILE_DIR
Chrome用户数据目录(也可通过
--chrome-profile-dir
指定)
故障排查:找不到Chrome → 使用
--browser-path
指定路径。超时 → 调大
--timeout
参数。登录/CAPTCHA页面 → 使用
--wait-for interaction
。调试 → 使用
--debug-dir
检查捕获的HTML和网络日志。

YouTube Notes

YouTube使用注意

  • YouTube adapter extracts transcripts/captions automatically when available
  • Transcript format:
    [MM:SS] Text segment
    with chapter headings
  • Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled or restricted playback may produce description-only output
  • Use
    --wait-for force
    if the page needs time to finish loading player metadata
  • 可用时YouTube适配器会自动提取字幕/文案
  • 字幕格式:
    [MM:SS] 文本片段
    ,附带章节标题
  • 字幕可用性取决于YouTube是否开放字幕轨道。关闭字幕或播放受限的视频可能仅返回描述内容
  • 如果页面需要时间加载播放器元数据,使用
    --wait-for force

X/Twitter Notes

X/Twitter使用注意

  • Extracts single tweets, threads, and X Articles
  • Auto-detects login state; if logged out and content requires auth, JSON output will show
    "status": "needs_interaction"
  • Use
    --wait-for interaction
    for login-protected content
  • 支持提取单条推文、讨论线程、X专栏文章
  • 自动检测登录状态;如果未登录且内容需要权限,JSON输出会返回
    "status": "needs_interaction"
  • 对于登录保护的内容使用
    --wait-for interaction

Hacker News Notes

Hacker News使用注意

  • Parses threaded comments with proper nesting and reply hierarchy
  • Includes story metadata (title, URL, author, score, comment count)
  • Shows comment deletion/dead status
  • 解析带正确层级和回复关系的嵌套评论
  • 包含帖子元数据(标题、URL、作者、评分、评论数)
  • 显示评论删除/失效状态

Extension Support

扩展支持

Custom configurations via EXTEND.md. See Preferences section for paths and supported options.
通过EXTEND.md实现自定义配置,路径和支持的选项见偏好配置章节。