authenticated-web-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAuthenticated Web Scraper
身份验证网页抓取方案
Purpose
用途
Scrapes content from websites that require authentication (2FA, SSO, corporate login) by leveraging the user's Windows Edge browser via Chrome DevTools Protocol (CDP). Designed for WSL2 environments where Playwright/Puppeteer can't directly reach Windows browser ports.
通过Chrome DevTools Protocol(CDP)调用用户的Windows Edge浏览器,抓取需要身份验证(2FA、SSO、企业登录)的网站内容。专为WSL2环境设计,解决Playwright/Puppeteer无法直接访问Windows浏览器端口的问题。
When to Use
适用场景
- Mirroring internal documentation sites behind corporate auth
- Scraping content from sites requiring 2FA/SSO that can't be automated
- Extracting structured content (text, HTML, links) from authenticated web pages
- Crawling site navigation trees and following links to a configurable depth
- 镜像受企业身份验证保护的内部文档站点
- 抓取无法自动完成2FA/SSO登录的网站内容
- 从需身份验证的网页中提取结构化内容(文本、HTML、链接)
- 爬取站点导航树,并按可配置的深度跟进链接
Architecture
架构
WSL2 Windows
┌─────────────────┐ ┌──────────────────────┐
│ Claude Code │ │ Edge Browser │
│ │ kill │ (user's profile) │
│ 1. Kill Edge ───┼──────────>│ │
│ │ launch │ │
│ 2. Launch Edge ─┼──────────>│ --remote-debug:9222 │
│ │ │ --debug-addr:0.0.0.0 │
│ [User auths │ │ │
│ in browser] │ │ CDP WebSocket on :9222│
│ │ cmd.exe │ │
│ 3. Run scraper ─┼──────────>│ node scraper.mjs │
│ │ │ connects localhost:9222│
│ 4. Read output <┼───────────│ writes to C:\Temp\... │
└─────────────────┘ └──────────────────────┘Key insight: WSL2 cannot reach Windows directly. The scraper script must run on the Windows side via .
localhost:9222cmd.exe /c "node script.mjs"WSL2 Windows
┌─────────────────┐ ┌──────────────────────┐
│ Claude Code │ │ Edge Browser │
│ │ kill │ (user's profile) │
│ 1. 关闭Edge ───┼──────────>│ │
│ │ launch │ │
│ 2. 启动Edge ─┼──────────>│ --remote-debug:9222 │
│ │ │ --debug-addr:0.0.0.0 │
│ [用户在浏览器中
│ 完成身份验证] │ │ CDP WebSocket on :9222│
│ │ cmd.exe │ │
│ 3. 运行抓取脚本 ─┼──────────>│ node scraper.mjs │
│ │ │ connects localhost:9222│
│ 4. 读取输出结果 <┼───────────│ writes to C:\Temp\... │
└─────────────────┘ └──────────────────────┘核心要点:WSL2无法直接访问Windows的端口。抓取脚本必须通过在Windows侧运行。
localhost:9222cmd.exe /c "node script.mjs"Quick Start
快速开始
When a user asks to scrape an authenticated website:
- Kill existing Edge processes and relaunch with debug flags
- User authenticates in the headed browser
- Copy scraper script to Windows temp and run via
cmd.exe - Script connects to CDP, navigates pages, extracts content
- Read results from shared filesystem ()
/mnt/c/Temp/...
当用户需要抓取需身份验证的网站时:
- 关闭所有现有Edge进程,然后用调试参数重新启动
- 用户在带界面的浏览器中完成身份验证
- 将抓取脚本复制到Windows临时目录,通过运行
cmd.exe - 脚本连接到CDP,导航页面并提取内容
- 从共享文件系统()读取结果
/mnt/c/Temp/...
Core Workflow
核心流程
Phase 0: Prerequisites
阶段0:前置条件
- Node.js must be installed on Windows ()
cmd.exe /c "where node" - The npm package on Windows side (
ws)cmd.exe /c "cd C:\Temp && npm install ws" - Edge browser installed (check )
/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe
- Windows系统需安装Node.js(可通过检查)
cmd.exe /c "where node" - Windows侧需安装npm包(执行
ws)cmd.exe /c "cd C:\Temp && npm install ws" - 已安装Edge浏览器(检查路径)
/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe
Phase 1: Launch Edge with Remote Debugging
阶段1:启动带远程调试的Edge浏览器
javascript
import { execSync, spawn } from "child_process";
// CRITICAL: Kill ALL Edge processes first, otherwise debug flags are ignored
execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"');
await sleep(3000);
const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe";
spawn(
EDGE,
[
"--remote-debugging-port=9222",
"--remote-debugging-address=0.0.0.0",
"--remote-allow-origins=*",
targetUrl,
],
{ detached: true, stdio: "ignore" }
).unref();javascript
import { execSync, spawn } from "child_process";
// 关键步骤:先关闭所有Edge进程,否则调试参数会被忽略
execSync('cmd.exe /c "taskkill /F /IM msedge.exe /T"');
await sleep(3000);
const EDGE = "/mnt/c/Program Files (x86)/Microsoft/Edge/Application/msedge.exe";
spawn(
EDGE,
[
"--remote-debugging-port=9222",
"--remote-debugging-address=0.0.0.0",
"--remote-allow-origins=*",
targetUrl,
],
{ detached: true, stdio: "ignore" }
).unref();Phase 2: Verify CDP and User Auth
阶段2:验证CDP运行状态与用户身份验证
bash
undefinedbash
undefinedVerify CDP is running (must query from Windows side)
验证CDP是否运行(必须从Windows侧查询)
powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"
Tell user to authenticate, then confirm they can see content.powershell.exe -Command "Invoke-RestMethod -Uri http://localhost:9222/json/version"
告知用户完成身份验证,然后确认他们可以看到目标内容。Phase 3: Scrape via CDP
阶段3:通过CDP执行抓取
Write a Node.js script that:
- Queries for open pages
http://localhost:9222/json/list - Connects to the target page via WebSocket (package)
ws - Uses to extract DOM content
Runtime.evaluate - Uses +
Page.navigatefor crawlingPage.enable - Saves (clean text),
.txt(full),.htmlper page_links.json
Run on Windows side:
bash
cp script.mjs /mnt/c/Temp/scraper.mjs
cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1编写一个Node.js脚本,实现以下功能:
- 查询获取已打开的页面
http://localhost:9222/json/list - 通过WebSocket(包)连接到目标页面
ws - 使用提取DOM内容
Runtime.evaluate - 结合+
Page.navigate实现爬取Page.enable - 为每个页面保存(纯文本)、
.txt(完整内容)、.html(链接列表)文件_links.json
在Windows侧运行脚本:
bash
cp script.mjs /mnt/c/Temp/scraper.mjs
cmd.exe /c "cd C:\Temp && node scraper.mjs C:\Temp\output" 2>&1Phase 4: Crawl Navigation
阶段4:导航栏爬取
- Extract sidebar/nav links from the initial page
- Filter to same-domain pages (skip anchor links)
- Visit each nav page, extract content + links
- Follow discovered links one level deep (deduplicating)
- Write summary JSON with page inventory
- 从初始页面提取侧边栏/导航栏链接
- 筛选同域名页面(跳过锚点链接)
- 访问每个导航页面,提取内容与链接
- 跟进发现的链接(深度为1,去重)
- 编写包含页面清单的汇总JSON文件
CDP Command Reference
CDP命令参考
javascript
// Navigate to a page
await cdpSend(ws, "Page.navigate", { url });
// Extract text content
await cdpSend(ws, "Runtime.evaluate", {
expression: 'document.querySelector("main").innerText',
returnByValue: true,
});
// Extract links as JSON
await cdpSend(ws, "Runtime.evaluate", {
expression:
'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))',
returnByValue: true,
});
// Get full HTML
await cdpSend(ws, "Runtime.evaluate", {
expression: "document.documentElement.outerHTML",
returnByValue: true,
});javascript
// 导航到指定页面
await cdpSend(ws, "Page.navigate", { url });
// 提取文本内容
await cdpSend(ws, "Runtime.evaluate", {
expression: 'document.querySelector("main").innerText',
returnByValue: true,
});
// 以JSON格式提取链接
await cdpSend(ws, "Runtime.evaluate", {
expression:
'JSON.stringify([...document.querySelectorAll("a[href]")].map(a => ({href: a.href, text: a.textContent.trim()})))',
returnByValue: true,
});
// 获取完整HTML内容
await cdpSend(ws, "Runtime.evaluate", {
expression: "document.documentElement.outerHTML",
returnByValue: true,
});Critical Details
关键注意事项
- Must kill Edge first: If Edge is already running, new instances join the existing process and ignore
--remote-debugging-port - WSL2 networking: WSL2 has its own network stack; in WSL does NOT reach Windows. Scripts must run on Windows via
127.0.0.1cmd.exe - Respectful crawling: Add 2-second delays between page loads
- Auth persistence: Edge uses the user's default profile with saved sessions
- Output path: Use Windows paths () in scripts, read via
C:\Temp\...from WSL/mnt/c/Temp/...
- 必须先关闭Edge:如果Edge已在运行,新实例会加入现有进程并忽略参数
--remote-debugging-port - WSL2网络限制:WSL2有独立的网络栈,WSL中的无法访问Windows本地端口。脚本必须通过
127.0.0.1在Windows侧运行cmd.exe - 友好爬取:页面加载之间添加2秒延迟
- 身份验证持久化:Edge使用用户的默认配置文件,保留已保存的会话
- 输出路径:脚本中使用Windows路径(),从WSL侧通过
C:\Temp\...读取/mnt/c/Temp/...
Integration Points
集成场景
- Works with any documentation site behind corporate auth (SSO, SAML, FIDO2, etc.)
- Output can be fed to other skills for analysis, summarization, or knowledge base building
- Pairs well with and
investigation-workflowskillsknowledge-builder
- 适用于任何受企业身份验证(SSO、SAML、FIDO2等)保护的文档站点
- 输出结果可传入其他工具进行分析、汇总或构建知识库
- 与和
investigation-workflow工具搭配使用效果更佳knowledge-builder