authenticated-web-scraper

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Authenticated Web Scraper

身份验证网页抓取方案

Purpose

用途

Scrapes content from websites that require authentication (2FA, SSO, corporate login) by leveraging the user's Windows Edge browser via Chrome DevTools Protocol (CDP). Designed for WSL2 environments where Playwright/Puppeteer can't directly reach Windows browser ports.

通过Chrome DevTools Protocol（CDP）调用用户的Windows Edge浏览器，抓取需要身份验证（2FA、SSO、企业登录）的网站内容。专为WSL2环境设计，解决Playwright/Puppeteer无法直接访问Windows浏览器端口的问题。

When to Use

适用场景

Mirroring internal documentation sites behind corporate auth
Scraping content from sites requiring 2FA/SSO that can't be automated
Extracting structured content (text, HTML, links) from authenticated web pages
Crawling site navigation trees and following links to a configurable depth

镜像受企业身份验证保护的内部文档站点
抓取无法自动完成2FA/SSO登录的网站内容
从需身份验证的网页中提取结构化内容（文本、HTML、链接）
爬取站点导航树，并按可配置的深度跟进链接

Architecture

架构

WSL2                          Windows
┌─────────────────┐           ┌──────────────────────┐
│ Claude Code     │           │ Edge Browser          │
│                 │  kill     │ (user's profile)      │
│ 1. Kill Edge ───┼──────────>│                       │
│                 │  launch   │                       │
│ 2. Launch Edge ─┼──────────>│ --remote-debug:9222   │
│                 │           │ --debug-addr:0.0.0.0  │
│ [User auths     │           │                       │
│  in browser]    │           │ CDP WebSocket on :9222│
│                 │  cmd.exe  │                       │
│ 3. Run scraper ─┼──────────>│ node scraper.mjs      │
│                 │           │ connects localhost:9222│
│ 4. Read output <┼───────────│ writes to C:\Temp\... │
└─────────────────┘           └──────────────────────┘

Key insight: WSL2 cannot reach Windows

localhost:9222

directly. The scraper script must run on the Windows side via

cmd.exe /c "node script.mjs"

WSL2                          Windows
┌─────────────────┐           ┌──────────────────────┐
│ Claude Code     │           │ Edge Browser          │
│                 │  kill     │ (user's profile)      │
│ 1. 关闭Edge ───┼──────────>│                       │
│                 │  launch   │                       │
│ 2. 启动Edge ─┼──────────>│ --remote-debug:9222   │
│                 │           │ --debug-addr:0.0.0.0  │
│ [用户在浏览器中
│ 完成身份验证]    │           │ CDP WebSocket on :9222│
│                 │  cmd.exe  │                       │
│ 3. 运行抓取脚本 ─┼──────────>│ node scraper.mjs      │
│                 │           │ connects localhost:9222│
│ 4. 读取输出结果 <┼───────────│ writes to C:\Temp\... │
└─────────────────┘           └──────────────────────┘

核心要点：WSL2无法直接访问Windows的

localhost:9222

端口。抓取脚本必须通过

cmd.exe /c "node script.mjs"

在Windows侧运行。

Quick Start

快速开始

When a user asks to scrape an authenticated website:

Kill existing Edge processes and relaunch with debug flags
User authenticates in the headed browser
Copy scraper script to Windows temp and run via
```
cmd.exe
```
Script connects to CDP, navigates pages, extracts content
Read results from shared filesystem (
```
/mnt/c/Temp/...
```
)

当用户需要抓取需身份验证的网站时：

关闭所有现有Edge进程，然后用调试参数重新启动
用户在带界面的浏览器中完成身份验证
将抓取脚本复制到Windows临时目录，通过
```
cmd.exe
```
运行
脚本连接到CDP，导航页面并提取内容
从共享文件系统（
```
/mnt/c/Temp/...
```
）读取结果

Core Workflow

核心流程

Phase 0: Prerequisites

阶段0：前置条件

Node.js must be installed on Windows (
```
cmd.exe /c "where node"
```
)

The

ws

npm package on Windows side (

cmd.exe /c "cd C:\Temp && npm install ws"

)

Edge browser installed (check

/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe

)

Windows系统需安装Node.js（可通过
```
cmd.exe /c "where node"
```
检查）

Windows侧需安装

ws

npm包（执行

cmd.exe /c "cd C:\Temp && npm install ws"

）

已安装Edge浏览器（检查路径

/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe

）

Phase 1: Launch Edge with Remote Debugging

阶段1：启动带远程调试的Edge浏览器

javascript

import { execSync, spawn } from "child_process";

// CRITICAL: Kill ALL Edge processes first, otherwise debug flags are ignored
execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"');
await sleep(3000);

const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe";
spawn(
  EDGE,
  [
    "--remote-debugging-port=9222",
    "--remote-debugging-address=0.0.0.0",
    "--remote-allow-origins=*",
    targetUrl,
  ],
  { detached: true, stdio: "ignore" }
).unref();

javascript

import { execSync, spawn } from "child_process";

// 关键步骤：先关闭所有Edge进程，否则调试参数会被忽略
execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"');
await sleep(3000);

const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe";
spawn(
  EDGE,
  [
    "--remote-debugging-port=9222",
    "--remote-debugging-address=0.0.0.0",
    "--remote-allow-origins=*",
    targetUrl,
  ],
  { detached: true, stdio: "ignore" }
).unref();

Phase 2: Verify CDP and User Auth

阶段2：验证CDP运行状态与用户身份验证

bash

undefined

bash

undefined

Verify CDP is running (must query from Windows side)

验证CDP是否运行（必须从Windows侧查询）

powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"


Tell user to authenticate, then confirm they can see content.

powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"


告知用户完成身份验证，然后确认他们可以看到目标内容。

Phase 3: Scrape via CDP

阶段3：通过CDP执行抓取

Write a Node.js script that:

Queries
```
http://localhost:9222/json/list
```
for open pages
Connects to the target page via WebSocket (
```
ws
```
package)
Uses
```
Runtime.evaluate
```
to extract DOM content
Uses
```
Page.navigate
```
+
```
Page.enable
```
for crawling
Saves
```
.txt
```
(clean text),
```
.html
```
(full),
```
_links.json
```
per page

Run on Windows side:

bash

cp script.mjs /mnt/c/Temp/scraper.mjs
cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1

编写一个Node.js脚本，实现以下功能：

查询
```
http://localhost:9222/json/list
```
获取已打开的页面
通过WebSocket（
```
ws
```
包）连接到目标页面
使用
```
Runtime.evaluate
```
提取DOM内容
结合
```
Page.navigate
```
+
```
Page.enable
```
实现爬取
为每个页面保存
```
.txt
```
（纯文本）、
```
.html
```
（完整内容）、
```
_links.json
```
（链接列表）文件

在Windows侧运行脚本：

bash

cp script.mjs /mnt/c/Temp/scraper.mjs
cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1

Phase 4: Crawl Navigation

阶段4：导航栏爬取

Extract sidebar/nav links from the initial page
Filter to same-domain pages (skip anchor links)
Visit each nav page, extract content + links
Follow discovered links one level deep (deduplicating)
Write summary JSON with page inventory

从初始页面提取侧边栏/导航栏链接
筛选同域名页面（跳过锚点链接）
访问每个导航页面，提取内容与链接
跟进发现的链接（深度为1，去重）
编写包含页面清单的汇总JSON文件

CDP Command Reference

CDP命令参考

javascript

// Navigate to a page
await cdpSend(ws, "Page.navigate", { url });

// Extract text content
await cdpSend(ws, "Runtime.evaluate", {
  expression: 'document.querySelector("main").innerText',
  returnByValue: true,
});

// Extract links as JSON
await cdpSend(ws, "Runtime.evaluate", {
  expression:
    'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))',
  returnByValue: true,
});

// Get full HTML
await cdpSend(ws, "Runtime.evaluate", {
  expression: "document.documentElement.outerHTML",
  returnByValue: true,
});

javascript

// 导航到指定页面
await cdpSend(ws, "Page.navigate", { url });

// 提取文本内容
await cdpSend(ws, "Runtime.evaluate", {
  expression: 'document.querySelector("main").innerText',
  returnByValue: true,
});

// 以JSON格式提取链接
await cdpSend(ws, "Runtime.evaluate", {
  expression:
    'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))',
  returnByValue: true,
});

// 获取完整HTML内容
await cdpSend(ws, "Runtime.evaluate", {
  expression: "document.documentElement.outerHTML",
  returnByValue: true,
});

Critical Details

关键注意事项

Must kill Edge first: If Edge is already running, new instances join the existing process and ignore
```
--remote-debugging-port
```
WSL2 networking: WSL2 has its own network stack;
```
127.0.0.1
```
in WSL does NOT reach Windows. Scripts must run on Windows via
```
cmd.exe
```
Respectful crawling: Add 2-second delays between page loads
Auth persistence: Edge uses the user's default profile with saved sessions
Output path: Use Windows paths (
```
C:\Temp\...
```
) in scripts, read via
```
/mnt/c/Temp/...
```
from WSL

必须先关闭Edge：如果Edge已在运行，新实例会加入现有进程并忽略
```
--remote-debugging-port
```
参数
WSL2网络限制：WSL2有独立的网络栈，WSL中的
```
127.0.0.1
```
无法访问Windows本地端口。脚本必须通过
```
cmd.exe
```
在Windows侧运行
友好爬取：页面加载之间添加2秒延迟
身份验证持久化：Edge使用用户的默认配置文件，保留已保存的会话
输出路径：脚本中使用Windows路径（
```
C:\Temp\...
```
），从WSL侧通过
```
/mnt/c/Temp/...
```
读取

Integration Points

集成场景

Works with any documentation site behind corporate auth (SSO, SAML, FIDO2, etc.)
Output can be fed to other skills for analysis, summarization, or knowledge base building
Pairs well with
```
investigation-workflow
```
and
```
knowledge-builder
```
skills

适用于任何受企业身份验证（SSO、SAML、FIDO2等）保护的文档站点
输出结果可传入其他工具进行分析、汇总或构建知识库
与
```
investigation-workflow
```
和
```
knowledge-builder
```
工具搭配使用效果更佳