scrape-webpage

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Scrape Webpage

抓取网页

Extract content, metadata, and images from a webpage for import/migration.

从网页中提取内容、元数据和图片，用于导入/迁移。

When to Use This Skill

何时使用此Skill

Use this skill when:

Starting a page import and need to extract content from source URL
Need webpage analysis with local image downloads
Want metadata extraction (Open Graph, JSON-LD, etc.)

Invoked by: page-import skill (Step 1)

在以下场景使用此Skill：

开始页面导入，需要从源URL提取内容时
需要带本地图片下载的网页分析时
想要提取元数据（Open Graph、JSON-LD等）时

调用方： page-import skill（步骤1）

Prerequisites

前提条件

Before using this skill, ensure:

✅ Node.js is available
✅ npm playwright is installed (
```
npm install playwright
```
)
✅ Chromium browser is installed (
```
npx playwright install chromium
```
)

✅ Sharp image library is installed (

cd .claude/skills/scrape-webpage/scripts && npm install

)

使用此Skill前，请确保：

✅ Node.js已可用
✅ npm playwright已安装（
```
npm install playwright
```
）
✅ Chromium浏览器已安装（
```
npx playwright install chromium
```
）

✅ Sharp图片库已安装（

cd .claude/skills/scrape-webpage/scripts && npm install

）

Related Skills

Scraping Workflow

抓取工作流

Step 1: Run Analysis Script

步骤1：运行分析脚本

Command:

bash

node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work

What the script does:

Sets up network interception to capture all images
Loads page in headless Chromium
Scrolls through entire page to trigger lazy-loaded images
Downloads all images locally (converts WebP/AVIF/SVG to PNG)
Captures full-page screenshot for visual reference
Extracts metadata (title, description, Open Graph, JSON-LD, canonical)
Fixes images in DOM (background-image→img, picture elements, srcset→src, relative→absolute, inline SVG→img)
Extracts cleaned HTML (removes scripts/styles)
Replaces image URLs in HTML with local paths (./images/...)
Generates document paths (sanitized, lowercase, no .html extension)
Saves complete analysis with image mapping to metadata.json

For detailed explanation: See

resources/web-page-analysis.md

命令：

bash

node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work

脚本功能：

设置网络拦截以捕获所有图片
在无头Chromium中加载页面
滚动整个页面以触发懒加载图片
本地下载所有图片（将WebP/AVIF/SVG转换为PNG）
捕获全页截图作为视觉参考
提取元数据（标题、描述、Open Graph、JSON-LD、规范链接）
修复DOM中的图片（background-image→img、picture元素、srcset→src、相对路径→绝对路径、内联SVG→img）
提取清理后的HTML（移除脚本/样式）
将HTML中的图片URL替换为本地路径（./images/...）
生成文档路径（已清理、小写、无.html扩展名）
将包含图片映射的完整分析结果保存至metadata.json

详细说明： 请查看

resources/web-page-analysis.md

Step 2: Verify Output

步骤2：验证输出

Output files:

```
./import-work/metadata.json
```
- Complete analysis with paths and image mapping
```
./import-work/screenshot.png
```
- Visual reference for layout comparison
```
./import-work/cleaned.html
```
- Main content HTML with local image paths
```
./import-work/images/
```
- All downloaded images (WebP/AVIF/SVG converted to PNG)

Verify files exist:

bash

ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html
ls -lh ./import-work/images/ | head -5

输出文件：

```
./import-work/metadata.json
```
- 包含路径、元数据、图片映射的完整分析结果
```
./import-work/screenshot.png
```
- 用于布局对比的视觉参考
```
./import-work/cleaned.html
```
- 包含本地图片路径的主内容HTML
```
./import-work/images/
```
- 所有已下载的图片（WebP/AVIF/SVG已转换为PNG）

验证文件是否存在：

bash

ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html
ls -lh ./import-work/images/ | head -5

Step 3: Review Metadata JSON

步骤3：查看元数据JSON

Output JSON structure:

json

{
  "url": "https://example.com/page",
  "timestamp": "2025-01-12T10:30:00.000Z",
  "paths": {
    "documentPath": "/us/en/about",
    "htmlFilePath": "us/en/about.plain.html",
    "mdFilePath": "us/en/about.md",
    "dirPath": "us/en",
    "filename": "about"
  },
  "screenshot": "./import-work/screenshot.png",
  "html": {
    "filePath": "./import-work/cleaned.html",
    "size": 45230
  },
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "og:image": "https://example.com/image.jpg",
    "canonical": "https://example.com/page"
  },
  "images": {
    "count": 15,
    "mapping": {
      "https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
      "https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
    },
    "stats": {
      "total": 15,
      "converted": 3,
      "skipped": 12,
      "failed": 0
    }
  }
}

Key fields:

```
paths.documentPath
```
- Used for browser preview URL
```
paths.htmlFilePath
```
- Where to save final HTML file
```
images.mapping
```
- Original URLs → local paths
```
metadata
```
- Extracted page metadata

输出JSON结构：

json

{
  "url": "https://example.com/page",
  "timestamp": "2025-01-12T10:30:00.000Z",
  "paths": {
    "documentPath": "/us/en/about",
    "htmlFilePath": "us/en/about.plain.html",
    "mdFilePath": "us/en/about.md",
    "dirPath": "us/en",
    "filename": "about"
  },
  "screenshot": "./import-work/screenshot.png",
  "html": {
    "filePath": "./import-work/cleaned.html",
    "size": 45230
  },
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "og:image": "https://example.com/image.jpg",
    "canonical": "https://example.com/page"
  },
  "images": {
    "count": 15,
    "mapping": {
      "https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
      "https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
    },
    "stats": {
      "total": 15,
      "converted": 3,
      "skipped": 12,
      "failed": 0
    }
  }
}

关键字段：

```
paths.documentPath
```
- 用于浏览器预览URL
```
paths.htmlFilePath
```
- 最终HTML文件的保存位置
```
images.mapping
```
- 原始URL → 本地路径
```
metadata
```
- 提取的页面元数据

Output

输出结果

This skill provides:

✅ metadata.json with paths, metadata, image mapping
✅ screenshot.png for visual reference
✅ cleaned.html with local image references
✅ images/ folder with all downloaded images

Next step: Pass these outputs to identify-page-structure skill

此Skill提供：

✅ 包含路径、元数据、图片映射的metadata.json
✅ 作为视觉参考的screenshot.png
✅ 包含本地图片引用的cleaned.html
✅ 存放所有已下载图片的images/文件夹

下一步： 将这些输出传递给identify-page-structure skill

Troubleshooting

故障排除

Browser not installed:

bash

npx playwright install chromium

Sharp not installed:

bash

cd .claude/skills/scrape-webpage/scripts && npm install

Image download failures:

Check images.stats.failed count in metadata.json
Some images may require authentication or be blocked by CORS
Failed images will be noted but won't stop the scraping process

Lazy-loaded images not captured:

Script scrolls through page to trigger lazy loading
Some advanced lazy-loading may need customization in scripts/analyze-webpage.js

浏览器未安装：

bash

npx playwright install chromium

Sharp未安装：

bash

cd .claude/skills/scrape-webpage/scripts && npm install

图片下载失败：

查看metadata.json中的images.stats.failed计数
部分图片可能需要身份验证或被CORS阻止
失败的图片会被记录，但不会终止抓取过程

懒加载图片未被捕获：

脚本会滚动页面以触发懒加载
某些高级懒加载可能需要在scripts/analyze-webpage.js中进行自定义

scrape-webpage

Original

Translation

Scrape Webpage

抓取网页

When to Use This Skill

何时使用此Skill

Prerequisites

前提条件

Related Skills

相关技能

Scraping Workflow

抓取工作流

Step 1: Run Analysis Script

步骤1：运行分析脚本

Step 2: Verify Output

步骤2：验证输出

Step 3: Review Metadata JSON

步骤3：查看元数据JSON

Output

输出结果

Troubleshooting

故障排除