web-to-markdown
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseWeb to Markdown
网页转Markdown
Overview
概述
Captures web pages using headless Playwright browser automation (handles JavaScript-rendered content), converts HTML to clean markdown via Turndown library, and saves all URLs from a single request into one timestamped file.
Key features:
- Headless browser (no visible window)
- Handles JavaScript-rendered content (SPAs, React, Vue, etc.)
- Batch processing multiple URLs → single file
- Self-contained (no MCP dependency)
使用无头模式Playwright浏览器自动化能力捕获网页(支持JavaScript渲染的内容),通过Turndown库将HTML转换为干净的markdown格式,并将单次请求中的所有URL内容保存到单个带时间戳的文件中。
核心特性:
- 无头浏览器(无可见窗口)
- 支持JavaScript渲染的内容(SPA、React、Vue等)
- 批量处理多个URL → 输出单个文件
- 自包含(无MCP依赖)
Prerequisites
前置要求
- Node.js/pnpm for running TypeScript scripts
- Dependencies installed:
cd skills/web-to-markdown && pnpm install - Playwright browsers will auto-install on first run
- 运行TypeScript脚本需要Node.js/pnpm环境
- 安装依赖:
cd skills/web-to-markdown && pnpm install - Playwright浏览器会在首次运行时自动安装
Usage
使用场景
When user provides URLs and asks to:
- "capture these pages as markdown"
- "save web content for documentation"
- "fetch and convert these webpages"
- "get markdown from these sites"
- "download and convert to markdown"
Output: containing all pages.
docs/web-captures/YYYYMMDD_HHMMSS.md当用户提供URL并提出以下需求时使用:
- "将这些页面捕获为markdown"
- "保存网页内容用于文档制作"
- "抓取并转换这些网页"
- "从这些站点获取markdown内容"
- "下载并转换为markdown格式"
输出路径: ,包含所有页面内容。
docs/web-captures/YYYYMMDD_HHMMSS.mdWorkflow
工作流程
Single Command
单命令执行
bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...That's it! Script handles:
- Creates timestamped output file with header
- Launches headless Playwright browser
- Scrapes each URL sequentially
- Converts HTML → markdown (Turndown)
- Appends to output file with formatted headers
- Closes browser and reports summary
bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts <url1> [url2] [url3] ...完成! 脚本会自动处理以下步骤:
- 创建带时间戳的输出文件并写入头部信息
- 启动无头模式Playwright浏览器
- 依次抓取每个URL的内容
- 通过Turndown将HTML转换为markdown格式
- 为每个页面添加格式化头部后追加到输出文件
- 关闭浏览器并输出执行总结
Examples
示例
Single URL:
bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts https://example.com/docs单个URL:
bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts https://example.com/docsOutput: docs/web-captures/20251103_143052.md
输出路径: docs/web-captures/20251103_143052.md
**Multiple URLs (Batch):**
```bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts \
https://example.com/guide \
https://example.com/api \
https://example.com/faq
**多个URL(批量处理):**
```bash
cd skills/web-to-markdown
pnpm tsx scripts/scrape-and-convert.ts \
https://example.com/guide \
https://example.com/api \
https://example.com/faqOutput: docs/web-captures/20251103_143052.md (all 3 pages)
输出路径: docs/web-captures/20251103_143052.md(包含所有3个页面内容)
**From project root:**
```bash
pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>
**从项目根目录执行:**
```bash
pnpm --filter @skills/web-to-markdown tsx scripts/scrape-and-convert.ts <urls...>Output Format
输出格式
markdown
undefinedmarkdown
undefinedWeb Captures - YYYY-MM-DD HH:MM:SS
网页捕获记录 - YYYY-MM-DD HH:MM:SS
Generated: YYYYMMDD_HHMMSS
URLs: N
生成时间: YYYYMMDD_HHMMSS
URL数量: N
[Converted markdown content...]
[转换后的markdown内容...]
[Converted markdown content...]
undefined[转换后的markdown内容...]
undefinedImplementation Notes
实现说明
TypeScript + FP Patterns:
- Pure functions (no classes except custom errors)
- Explicit error handling with typed errors: ,
BrowserError,FileErrorHtmlConversionError - Small, focused functions
- Side effects isolated at edges
- CLI-style logging with chalk/ora
File Structure:
skills/web-to-markdown/
├── SKILL.md # This file (workflow instructions)
├── package.json # pnpm workspace config
├── tsconfig.json # TypeScript config
├── scripts/
│ ├── scrape-and-convert.ts # Main CLI (Playwright + Turndown)
│ ├── html-to-markdown.ts # Pure conversion function (Turndown wrapper)
│ └── convert-and-append.ts # Legacy CLI (deprecated, kept for reference)
└── tests/
└── html-to-markdown.test.ts # Unit testsTypeScript + 函数式编程模式:
- 纯函数实现(除自定义错误外无类定义)
- 带类型的显式错误处理:、
BrowserError、FileErrorHtmlConversionError - 小巧、功能聚焦的函数设计
- 副作用隔离在边缘逻辑中
- 基于chalk/ora实现CLI风格的日志输出
文件结构:
skills/web-to-markdown/
├── SKILL.md # 当前文件(工作流说明)
├── package.json # pnpm工作区配置
├── tsconfig.json # TypeScript配置
├── scripts/
│ ├── scrape-and-convert.ts # 主CLI入口(Playwright + Turndown实现)
│ ├── html-to-markdown.ts # 纯转换函数(Turndown封装)
│ └── convert-and-append.ts # 旧版CLI(已废弃,仅留作参考)
└── tests/
└── html-to-markdown.test.ts # 单元测试Error Handling
错误处理
- Browser launch failure: Check Playwright installation, run
pnpm exec playwright install chromium - Page not found (404): Logs error, continues with other URLs
- Timeout (>30s): Reports slow page, continues to next URL
- Navigation error: Logs error, continues to next URL
- Conversion failure: Reports malformed HTML, skips page
- 浏览器启动失败: 检查Playwright安装,执行修复
pnpm exec playwright install chromium - 页面未找到(404): 记录错误,继续处理其他URL
- 超时(>30秒): 上报页面加载过慢,继续处理下一个URL
- 导航错误: 记录错误,继续处理下一个URL
- 转换失败: 上报HTML格式异常,跳过当前页面
Configuration
配置项
Default timeout: 30 seconds per page
To customize, edit in :
DEFAULT_CONFIGscripts/scrape-and-convert.tstypescript
const DEFAULT_CONFIG = {
outputDir: 'docs/web-captures',
timeout: 30000, // milliseconds
};默认超时:单页面30秒
如需自定义,编辑中的:
scripts/scrape-and-convert.tsDEFAULT_CONFIGtypescript
const DEFAULT_CONFIG = {
outputDir: 'docs/web-captures',
timeout: 30000, // 单位:毫秒
};Performance
性能说明
- Headless mode (no GUI overhead)
- Sequential processing (one URL at a time for stability)
- Browser reuse across URLs (faster than launching per-page)
- ~2-5 seconds per page (depends on site complexity)
- 无头模式(无GUI额外开销)
- 串行处理(一次处理一个URL,保证稳定性)
- 多URL复用浏览器实例(比每页启动一个浏览器更快)
- 单页面处理耗时约2-5秒(取决于站点复杂度)
Comparison with Alternatives
与其他方案的对比
| Feature | web-to-markdown | scratchpad-fetch | Jina AI Reader |
|---|---|---|---|
| Transport | Playwright (headless) | curl (HTTP) | Cloud API |
| JavaScript | ✅ Full rendering | ❌ No | ✅ Server-side |
| Conversion | ✅ Turndown | ❌ Raw HTML | ✅ LLM-powered |
| Self-hosted | ✅ Yes | ✅ Yes | ❌ Cloud only |
| Setup | pnpm install | None | API key |
| Speed | Medium (2-5s/page) | Fast (<1s) | Fast (~2s) |
| Visible browser | ❌ No (headless) | N/A | N/A |
| 特性 | web-to-markdown | scratchpad-fetch | Jina AI Reader |
|---|---|---|---|
| 传输方式 | Playwright(无头模式) | curl(HTTP协议) | 云端API |
| JavaScript支持 | ✅ 完整渲染 | ❌ 不支持 | ✅ 服务端渲染 |
| 转换能力 | ✅ Turndown转换 | ❌ 仅返回原始HTML | ✅ LLM驱动转换 |
| 自托管 | ✅ 支持 | ✅ 支持 | ❌ 仅云端可用 |
| 安装要求 | pnpm install | 无 | 需要API密钥 |
| 速度 | 中等(2-5秒/页) | 快(<1秒) | 快(约2秒) |
| 可见浏览器 | ❌ 无(无头模式) | 不适用 | 不适用 |
Troubleshooting
故障排查
"Executable doesn't exist" error:
bash
cd skills/web-to-markdown
pnpm exec playwright install chromiumPages timing out:
- Increase timeout in
DEFAULT_CONFIG - Check network connectivity
- Some sites may block automated browsers (use Jina AI Reader alternative)
Empty markdown output:
- Site may use heavy client-side rendering
- Try waiting longer (increase timeout)
- Check if site blocks headless browsers (User-Agent detection)
"可执行文件不存在"错误:
bash
cd skills/web-to-markdown
pnpm exec playwright install chromium页面加载超时:
- 调高中的超时时间
DEFAULT_CONFIG - 检查网络连接
- 部分站点可能屏蔽自动化浏览器,可改用Jina AI Reader方案
markdown输出为空:
- 站点可能使用重度客户端渲染
- 尝试调高超时时间,等待更长加载时间
- 检查站点是否屏蔽无头浏览器(User-Agent检测)
Notes
注意事项
- One request = one file (all URLs aggregated)
- Handles JavaScript-rendered content (React, Vue, Angular, etc.)
- Headless by default (no visible browser window)
- Browser auto-installs on first run
- Ideal for documentation scraping and archival
- No external API dependencies
- 一次请求对应一个文件(所有URL内容聚合到同一文件)
- 支持JavaScript渲染的内容(React、Vue、Angular等)
- 默认无头模式(无可见浏览器窗口)
- 浏览器会在首次运行时自动安装
- 非常适合文档抓取和归档场景
- 无外部API依赖