scrape-webpage
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseScrape Webpage
抓取网页
Extract content, metadata, and images from a webpage for import/migration.
从网页中提取内容、元数据和图片,用于导入/迁移。
When to Use This Skill
何时使用此Skill
Use this skill when:
- Starting a page import and need to extract content from source URL
- Need webpage analysis with local image downloads
- Want metadata extraction (Open Graph, JSON-LD, etc.)
Invoked by: page-import skill (Step 1)
在以下场景使用此Skill:
- 开始页面导入,需要从源URL提取内容时
- 需要带本地图片下载的网页分析时
- 想要提取元数据(Open Graph、JSON-LD等)时
调用方: page-import skill(步骤1)
Prerequisites
前提条件
Before using this skill, ensure:
- ✅ Node.js is available
- ✅ npm playwright is installed ()
npm install playwright - ✅ Chromium browser is installed ()
npx playwright install chromium - ✅ Sharp image library is installed ()
cd .claude/skills/scrape-webpage/scripts && npm install
使用此Skill前,请确保:
- ✅ Node.js已可用
- ✅ npm playwright已安装()
npm install playwright - ✅ Chromium浏览器已安装()
npx playwright install chromium - ✅ Sharp图片库已安装()
cd .claude/skills/scrape-webpage/scripts && npm install
Related Skills
相关技能
- page-import - Orchestrator that invokes this skill
- identify-page-structure - Uses this skill's output (screenshot, HTML, metadata)
- generate-import-html - Uses image mapping and paths from this skill
- page-import - 调用此Skill的编排器
- identify-page-structure - 使用此Skill的输出(截图、HTML、元数据)
- generate-import-html - 使用此Skill提供的图片映射和路径
Scraping Workflow
抓取工作流
Step 1: Run Analysis Script
步骤1:运行分析脚本
Command:
bash
node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-workWhat the script does:
- Sets up network interception to capture all images
- Loads page in headless Chromium
- Scrolls through entire page to trigger lazy-loaded images
- Downloads all images locally (converts WebP/AVIF/SVG to PNG)
- Captures full-page screenshot for visual reference
- Extracts metadata (title, description, Open Graph, JSON-LD, canonical)
- Fixes images in DOM (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img)
- Extracts cleaned HTML (removes scripts/styles)
- Replaces image URLs in HTML with local paths (./images/...)
- Generates document paths (sanitized, lowercase, no .html extension)
- Saves complete analysis with image mapping to metadata.json
For detailed explanation: See
resources/web-page-analysis.md命令:
bash
node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work脚本功能:
- 设置网络拦截以捕获所有图片
- 在无头Chromium中加载页面
- 滚动整个页面以触发懒加载图片
- 本地下载所有图片(将WebP/AVIF/SVG转换为PNG)
- 捕获全页截图作为视觉参考
- 提取元数据(标题、描述、Open Graph、JSON-LD、规范链接)
- 修复DOM中的图片(background-image→img、picture元素、srcset→src、相对路径→绝对路径、内联SVG→img)
- 提取清理后的HTML(移除脚本/样式)
- 将HTML中的图片URL替换为本地路径(./images/...)
- 生成文档路径(已清理、小写、无.html扩展名)
- 将包含图片映射的完整分析结果保存至metadata.json
详细说明: 请查看
resources/web-page-analysis.mdStep 2: Verify Output
步骤2:验证输出
Output files:
- - Complete analysis with paths and image mapping
./import-work/metadata.json - - Visual reference for layout comparison
./import-work/screenshot.png - - Main content HTML with local image paths
./import-work/cleaned.html - - All downloaded images (WebP/AVIF/SVG converted to PNG)
./import-work/images/
Verify files exist:
bash
ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html
ls -lh ./import-work/images/ | head -5输出文件:
- - 包含路径、元数据、图片映射的完整分析结果
./import-work/metadata.json - - 用于布局对比的视觉参考
./import-work/screenshot.png - - 包含本地图片路径的主内容HTML
./import-work/cleaned.html - - 所有已下载的图片(WebP/AVIF/SVG已转换为PNG)
./import-work/images/
验证文件是否存在:
bash
ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html
ls -lh ./import-work/images/ | head -5Step 3: Review Metadata JSON
步骤3:查看元数据JSON
Output JSON structure:
json
{
"url": "https://example.com/page",
"timestamp": "2025-01-12T10:30:00.000Z",
"paths": {
"documentPath": "/us/en/about",
"htmlFilePath": "us/en/about.plain.html",
"mdFilePath": "us/en/about.md",
"dirPath": "us/en",
"filename": "about"
},
"screenshot": "./import-work/screenshot.png",
"html": {
"filePath": "./import-work/cleaned.html",
"size": 45230
},
"metadata": {
"title": "Page Title",
"description": "Page description",
"og:image": "https://example.com/image.jpg",
"canonical": "https://example.com/page"
},
"images": {
"count": 15,
"mapping": {
"https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
"https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
},
"stats": {
"total": 15,
"converted": 3,
"skipped": 12,
"failed": 0
}
}
}Key fields:
- - Used for browser preview URL
paths.documentPath - - Where to save final HTML file
paths.htmlFilePath - - Original URLs → local paths
images.mapping - - Extracted page metadata
metadata
输出JSON结构:
json
{
"url": "https://example.com/page",
"timestamp": "2025-01-12T10:30:00.000Z",
"paths": {
"documentPath": "/us/en/about",
"htmlFilePath": "us/en/about.plain.html",
"mdFilePath": "us/en/about.md",
"dirPath": "us/en",
"filename": "about"
},
"screenshot": "./import-work/screenshot.png",
"html": {
"filePath": "./import-work/cleaned.html",
"size": 45230
},
"metadata": {
"title": "Page Title",
"description": "Page description",
"og:image": "https://example.com/image.jpg",
"canonical": "https://example.com/page"
},
"images": {
"count": 15,
"mapping": {
"https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
"https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
},
"stats": {
"total": 15,
"converted": 3,
"skipped": 12,
"failed": 0
}
}
}关键字段:
- - 用于浏览器预览URL
paths.documentPath - - 最终HTML文件的保存位置
paths.htmlFilePath - - 原始URL → 本地路径
images.mapping - - 提取的页面元数据
metadata
Output
输出结果
This skill provides:
- ✅ metadata.json with paths, metadata, image mapping
- ✅ screenshot.png for visual reference
- ✅ cleaned.html with local image references
- ✅ images/ folder with all downloaded images
Next step: Pass these outputs to identify-page-structure skill
此Skill提供:
- ✅ 包含路径、元数据、图片映射的metadata.json
- ✅ 作为视觉参考的screenshot.png
- ✅ 包含本地图片引用的cleaned.html
- ✅ 存放所有已下载图片的images/文件夹
下一步: 将这些输出传递给identify-page-structure skill
Troubleshooting
故障排除
Browser not installed:
bash
npx playwright install chromiumSharp not installed:
bash
cd .claude/skills/scrape-webpage/scripts && npm installImage download failures:
- Check images.stats.failed count in metadata.json
- Some images may require authentication or be blocked by CORS
- Failed images will be noted but won't stop the scraping process
Lazy-loaded images not captured:
- Script scrolls through page to trigger lazy loading
- Some advanced lazy-loading may need customization in scripts/analyze-webpage.js
浏览器未安装:
bash
npx playwright install chromiumSharp未安装:
bash
cd .claude/skills/scrape-webpage/scripts && npm install图片下载失败:
- 查看metadata.json中的images.stats.failed计数
- 部分图片可能需要身份验证或被CORS阻止
- 失败的图片会被记录,但不会终止抓取过程
懒加载图片未被捕获:
- 脚本会滚动页面以触发懒加载
- 某些高级懒加载可能需要在scripts/analyze-webpage.js中进行自定义