web-to-markdown

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Web to Markdown

网页转Markdown

Overview

概述

Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.
Key features:
  • Headless browser (no visible window)
  • Handles JavaScript-rendered content (SPAs, React, Vue, etc.)
  • Batch processing multiple URLs → single file
  • Self-contained (no MCP dependency)
使用无头模式Playwright浏览器自动化能力捕获网页(支持JavaScript渲染的内容),通过Turndown库将HTML转换为干净的markdown格式,并将单次请求中的所有URL内容保存到单个带时间戳的文件中。
核心特性:
  • 无头浏览器(无可见窗口)
  • 支持JavaScript渲染的内容(SPA、React、Vue等)
  • 批量处理多个URL → 输出单个文件
  • 自包含(无MCP依赖)

Prerequisites

前置要求

  • Node.js/pnpm for running TypeScript scripts
  • Dependencies installed:
    cd skills/web-to-markdown && pnpm install
  • Playwright browsers will auto-install on first run
  • 运行TypeScript脚本需要Node.js/pnpm环境
  • 安装依赖:
    cd skills/web-to-markdown && pnpm install
  • Playwright浏览器会在首次运行时自动安装

Usage

使用场景

When user provides URLs and asks to:
  • "capture these pages as markdown"
  • "save web content for documentation"
  • "fetch and convert these webpages"
  • "get markdown from these sites"
  • "download and convert to markdown"
Output:
docs/web-captures/YYYYMMDD_HHMMSS.md
containing all pages.
当用户提供URL并提出以下需求时使用:
  • "将这些页面捕获为markdown"
  • "保存网页内容用于文档制作"
  • "抓取并转换这些网页"
  • "从这些站点获取markdown内容"
  • "下载并转换为markdown格式"
输出路径:
docs/web-captures/YYYYMMDD_HHMMSS.md
,包含所有页面内容。

Workflow

工作流程

Single Command

单命令执行

bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...
That's it! Script handles:
  1. Creates timestamped output file with header
  2. Launches headless Playwright browser
  3. Scrapes each URL sequentially
  4. Converts HTML → markdown (Turndown)
  5. Appends to output file with formatted headers
  6. Closes browser and reports summary
bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...
完成! 脚本会自动处理以下步骤:
  1. 创建带时间戳的输出文件并写入头部信息
  2. 启动无头模式Playwright浏览器
  3. 依次抓取每个URL的内容
  4. 通过Turndown将HTML转换为markdown格式
  5. 为每个页面添加格式化头部后追加到输出文件
  6. 关闭浏览器并输出执行总结

Examples

示例

Single URL:
bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs
单个URL:
bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs

Output: docs/web-captures/20251103_143052.md

输出路径: docs/web-captures/20251103_143052.md


**Multiple URLs (Batch):**
```bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts \
  https://example.com/guide \
  https://example.com/api \
  https://example.com/faq

**多个URL(批量处理):**
```bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts \
  https://example.com/guide \
  https://example.com/api \
  https://example.com/faq

Output: docs/web-captures/20251103_143052.md (all 3 pages)

输出路径: docs/web-captures/20251103_143052.md(包含所有3个页面内容)


**From project root:**
```bash
pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>

**从项目根目录执行:**
```bash
pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>

Output Format

输出格式

markdown
undefined
markdown
undefined

Web Captures - YYYY-MM-DD HH:MM:SS

网页捕获记录 - YYYY-MM-DD HH:MM:SS

Generated: YYYYMMDD_HHMMSS URLs: N

生成时间: YYYYMMDD_HHMMSS URL数量: N

[Converted markdown content...]

[转换后的markdown内容...]

[Converted markdown content...]

undefined
[转换后的markdown内容...]

undefined

Implementation Notes

实现说明

TypeScript + FP Patterns:
  • Pure functions (no classes except custom errors)
  • Explicit error handling with typed errors:
    BrowserError
    ,
    FileError
    ,
    HtmlConversionError
  • Small, focused functions
  • Side effects isolated at edges
  • CLI-style logging with chalk/ora
File Structure:
skills/web-to-markdown/
├── SKILL.md                    # This file (workflow instructions)
├── package.json                # pnpm workspace config
├── tsconfig.json               # TypeScript config
├── scripts/
│   ├── scrape-and-convert.ts   # Main CLI (Playwright + Turndown)
│   ├── html-to-markdown.ts     # Pure conversion function (Turndown wrapper)
│   └── convert-and-append.ts   # Legacy CLI (deprecated, kept for reference)
└── tests/
    └── html-to-markdown.test.ts # Unit tests
TypeScript + 函数式编程模式:
  • 纯函数实现(除自定义错误外无类定义)
  • 带类型的显式错误处理:
    BrowserError
    FileError
    HtmlConversionError
  • 小巧、功能聚焦的函数设计
  • 副作用隔离在边缘逻辑中
  • 基于chalk/ora实现CLI风格的日志输出
文件结构:
skills/web-to-markdown/
├── SKILL.md                    # 当前文件(工作流说明)
├── package.json                # pnpm工作区配置
├── tsconfig.json               # TypeScript配置
├── scripts/
│   ├── scrape-and-convert.ts   # 主CLI入口(Playwright + Turndown实现)
│   ├── html-to-markdown.ts     # 纯转换函数(Turndown封装)
│   └── convert-and-append.ts   # 旧版CLI(已废弃,仅留作参考)
└── tests/
    └── html-to-markdown.test.ts # 单元测试

Error Handling

错误处理

  • Browser launch failure: Check Playwright installation, run
    pnpm exec playwright install chromium
  • Page not found (404): Logs error, continues with other URLs
  • Timeout (>30s): Reports slow page, continues to next URL
  • Navigation error: Logs error, continues to next URL
  • Conversion failure: Reports malformed HTML, skips page
  • 浏览器启动失败: 检查Playwright安装,执行
    pnpm exec playwright install chromium
    修复
  • 页面未找到(404): 记录错误,继续处理其他URL
  • 超时(>30秒): 上报页面加载过慢,继续处理下一个URL
  • 导航错误: 记录错误,继续处理下一个URL
  • 转换失败: 上报HTML格式异常,跳过当前页面

Configuration

配置项

Default timeout: 30 seconds per page
To customize, edit
DEFAULT_CONFIG
in
scripts/scrape-and-convert.ts
:
typescript
const DEFAULT_CONFIG = {
  outputDir: 'docs/web-captures',
  timeout: 30000, // milliseconds
};
默认超时:单页面30秒
如需自定义,编辑
scripts/scrape-and-convert.ts
中的
DEFAULT_CONFIG
typescript
const DEFAULT_CONFIG = {
  outputDir: 'docs/web-captures',
  timeout: 30000, // 单位:毫秒
};

Performance

性能说明

  • Headless mode (no GUI overhead)
  • Sequential processing (one URL at a time for stability)
  • Browser reuse across URLs (faster than launching per-page)
  • ~2-5 seconds per page (depends on site complexity)
  • 无头模式(无GUI额外开销)
  • 串行处理(一次处理一个URL,保证稳定性)
  • 多URL复用浏览器实例(比每页启动一个浏览器更快)
  • 单页面处理耗时约2-5秒(取决于站点复杂度)

Comparison with Alternatives

与其他方案的对比

Featureweb-to-markdownscratchpad-fetchJina AI Reader
TransportPlaywright (headless)curl (HTTP)Cloud API
JavaScript✅ Full rendering❌ No✅ Server-side
Conversion✅ Turndown❌ Raw HTML✅ LLM-powered
Self-hosted✅ Yes✅ Yes❌ Cloud only
Setuppnpm installNoneAPI key
SpeedMedium (2-5s/page)Fast (<1s)Fast (~2s)
Visible browser❌ No (headless)N/AN/A
特性web-to-markdownscratchpad-fetchJina AI Reader
传输方式Playwright(无头模式)curl(HTTP协议)云端API
JavaScript支持✅ 完整渲染❌ 不支持✅ 服务端渲染
转换能力✅ Turndown转换❌ 仅返回原始HTML✅ LLM驱动转换
自托管✅ 支持✅ 支持❌ 仅云端可用
安装要求pnpm install需要API密钥
速度中等(2-5秒/页)快(<1秒)快(约2秒)
可见浏览器❌ 无(无头模式)不适用不适用

Troubleshooting

故障排查

"Executable doesn't exist" error:
bash
cd skills/web-to-markdown
pnpm exec playwright install chromium
Pages timing out:
  • Increase timeout in
    DEFAULT_CONFIG
  • Check network connectivity
  • Some sites may block automated browsers (use Jina AI Reader alternative)
Empty markdown output:
  • Site may use heavy client-side rendering
  • Try waiting longer (increase timeout)
  • Check if site blocks headless browsers (User-Agent detection)
"可执行文件不存在"错误:
bash
cd skills/web-to-markdown
pnpm exec playwright install chromium
页面加载超时:
  • 调高
    DEFAULT_CONFIG
    中的超时时间
  • 检查网络连接
  • 部分站点可能屏蔽自动化浏览器,可改用Jina AI Reader方案
markdown输出为空:
  • 站点可能使用重度客户端渲染
  • 尝试调高超时时间,等待更长加载时间
  • 检查站点是否屏蔽无头浏览器(User-Agent检测)

Notes

注意事项

  • One request = one file (all URLs aggregated)
  • Handles JavaScript-rendered content (React, Vue, Angular, etc.)
  • Headless by default (no visible browser window)
  • Browser auto-installs on first run
  • Ideal for documentation scraping and archival
  • No external API dependencies
  • 一次请求对应一个文件(所有URL内容聚合到同一文件)
  • 支持JavaScript渲染的内容(React、Vue、Angular等)
  • 默认无头模式(无可见浏览器窗口)
  • 浏览器会在首次运行时自动安装
  • 非常适合文档抓取和归档场景
  • 无外部API依赖