linkedin-scraper

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LinkedIn Scraper — Chrome Profile Web Scraping

LinkedIn 爬取工具 — 基于Chrome配置文件的网页爬取

Scrape LinkedIn profiles and search results using the user's authenticated Chrome browser session. No API keys needed — uses the browser tool with the Chrome profile relay.
使用用户已认证的Chrome浏览器会话爬取LinkedIn个人资料和搜索结果。无需API密钥——通过浏览器工具配合Chrome配置文件中继实现。

Prerequisites

前置条件

  • Chrome browser with active LinkedIn login
  • Browser relay connected (Chrome extension or openclaw browser profile)
  • DuckDB workspace for storing results (optional)
  • 已登录LinkedIn的Chrome浏览器
  • 已连接浏览器中继(Chrome扩展或openclaw浏览器配置文件)
  • 用于存储结果的DuckDB工作区(可选)

Core Workflow

核心工作流程

1. Single Profile Scrape

1. 单个个人资料爬取

browser → open LinkedIn profile URL
browser → snapshot (extract structured data)
→ Parse: name, headline, title, company, location, education, experience, connections, about
→ Return structured JSON or insert into DuckDB
browser → 打开LinkedIn个人资料URL
browser → 快照(提取结构化数据)
→ 解析:姓名、头衔、职位、公司、所在地、教育背景、工作经历、人脉数量、个人简介
→ 返回结构化JSON或插入DuckDB

2. Search + Bulk Scrape

2. 搜索+批量爬取

browser → open LinkedIn search URL with filters
browser → snapshot (extract result cards)
→ Parse each result: name, title, company, profile URL
→ For each profile URL: open → snapshot → parse full profile
→ Batch insert into DuckDB
browser → 打开带筛选条件的LinkedIn搜索URL
browser → 快照(提取结果卡片)
→ 解析每个结果:姓名、职位、公司、个人资料URL
→ 针对每个个人资料URL:打开→快照→解析完整个人资料
→ 批量插入DuckDB

3. Company Page Scrape

3. 公司页面爬取

browser → open LinkedIn company page
→ Parse: company name, industry, size, description, specialties, employee count
→ Navigate to /people tab for employee list
browser → 打开LinkedIn公司页面
→ 解析:公司名称、行业、规模、描述、业务专长、员工数量
→ 导航至/people标签页获取员工列表

Implementation Rules

实施规则

Rate Limiting (CRITICAL)

速率限制(至关重要)

  • Minimum 3-5 second delay between page loads
  • Maximum 80 profiles per session (LinkedIn rate limits)
  • Randomize delays between 3-8 seconds (avoid detection)
  • After every 20 profiles, take a 60-second break
  • If CAPTCHA or "unusual activity" detected, stop immediately and alert user
  • 页面加载之间至少保持3-5秒延迟
  • 每个会话最多爬取80个个人资料(LinkedIn有速率限制)
  • 随机化延迟时间在3-8秒之间(避免被检测)
  • 每爬取20个个人资料后,暂停60秒
  • 若检测到CAPTCHA或“异常活动”提示,立即停止并提醒用户

Stealth Patterns

隐匿模式

  • Use natural scrolling (scroll down slowly, pause, scroll more)
  • Don't scrape the same search results page more than twice
  • Vary the order of profile visits (don't go sequentially)
  • Close and reopen tabs periodically
  • 使用自然滚动(缓慢向下滚动、暂停、继续滚动)
  • 同一搜索结果页面爬取不超过两次
  • 改变个人资料访问顺序(不要按顺序访问)
  • 定期关闭并重新打开标签页

Data Extraction — Profile Page

数据提取 — 个人资料页面

From a LinkedIn profile snapshot, extract these fields:
FieldLocationNotes
nameMain heading h1Full name
headlineBelow nameTitle + Company usually
locationLocation sectionCity, State/Country
current_titleExperience section, first entryMost recent role
current_companyExperience section, first entryCompany name
educationEducation sectionSchool, degree, dates
connectionsConnections countNumber or "500+"
aboutAbout sectionBio text (may need "see more" click)
experienceExperience sectionAll roles with dates
profile_urlBrowser URL barCanonical LinkedIn URL
从LinkedIn个人资料快照中提取以下字段:
字段位置说明
name主标题h1全名
headline姓名下方通常包含职位+公司
location所在地板块城市、州/国家
current_title工作经历板块第一条最新职位
current_company工作经历板块第一条公司名称
education教育背景板块学校、学位、就读时间
connections人脉数量具体数字或“500+”
about个人简介板块个人简介文本(可能需要点击“查看更多”)
experience工作经历板块所有带时间的职位
profile_url浏览器地址栏标准LinkedIn URL

Data Extraction — Search Results

数据提取 — 搜索结果页面

From LinkedIn search results page:
FieldLocation
nameResult card heading
headlineBelow name in card
locationCard metadata
profile_urlLink href on name
mutual_connectionsCard footer
从LinkedIn搜索结果页面提取:
字段位置
name结果卡片标题
headline卡片内姓名下方
location卡片元数据
profile_url姓名上的链接地址
mutual_connections卡片页脚

Search URL Patterns

搜索URL格式

undefined
undefined

People search

人员搜索

With filters

带筛选条件

&geoUrn=%5B%22103644278%22%5D # United States &network=%5B%22F%22%2C%22S%22%5D # 1st + 2nd connections &currentCompany=%5B%22{company_id}%22%5D # Current company &schoolFilter=%5B%22{school_id}%22%5D # School filter
&geoUrn=%5B%22103644278%22%5D # 美国 &network=%5B%22F%22%2C%22S%22%5D # 一级+二级人脉 &currentCompany=%5B%22{company_id}%22%5D # 当前公司 &schoolFilter=%5B%22{school_id}%22%5D # 学校筛选

YC founders (common query)

YC创始人(常见查询)

Company employees

公司员工

DuckDB Integration

DuckDB集成

When storing to DuckDB, use the Ironclaw workspace database:
sql
-- Check if leads/contacts object exists
SELECT * FROM objects WHERE name = 'leads' OR name = 'contacts';

-- Insert via the EAV pattern or direct pivot view
INSERT INTO v_leads ("Name", "Title", "Company", "LinkedIn URL", "Location", "Source")
VALUES (?, ?, ?, ?, ?, 'LinkedIn Scrape');
If no suitable object exists, create one:
sql
-- Use Ironclaw's object creation pattern from the dench skill
当存储至DuckDB时,使用Ironclaw工作区数据库:
sql
-- 检查leads/contacts对象是否存在
SELECT * FROM objects WHERE name = 'leads' OR name = 'contacts';

-- 通过EAV模式或直接透视视图插入数据
INSERT INTO v_leads ("Name", "Title", "Company", "LinkedIn URL", "Location", "Source")
VALUES (?, ?, ?, ?, ?, 'LinkedIn Scrape');
如果没有合适的对象,创建一个:
sql
-- 使用Ironclaw的对象创建模式(来自dench skill)

Error Handling

错误处理

ErrorAction
"Sign in" pageLinkedIn session expired — alert user to re-login in Chrome
CAPTCHA / Security checkStop immediately, wait 30+ min, alert user
"Profile not found"Skip, log URL as invalid
Rate limit (429)Stop, wait 15 min, retry with longer delays
Empty snapshotPage still loading — wait 3s and re-snapshot
错误操作
“登录”页面LinkedIn会话已过期 — 提醒用户在Chrome中重新登录
CAPTCHA / 安全检查立即停止,等待30分钟以上,提醒用户
“个人资料未找到”跳过,记录该URL为无效
速率限制(429错误)停止,等待15分钟,重试时延长延迟时间
空快照页面仍在加载 — 等待3秒后重新快照

Output Formats

输出格式

JSON (default)

JSON(默认)

json
{
  "name": "Jane Doe",
  "headline": "CEO at Acme Corp",
  "current_title": "CEO",
  "current_company": "Acme Corp",
  "location": "San Francisco, CA",
  "linkedin_url": "https://www.linkedin.com/in/janedoe",
  "connections": "500+",
  "education": [{"school": "Stanford", "degree": "BS CS", "years": "2010-2014"}],
  "experience": [{"title": "CEO", "company": "Acme Corp", "duration": "2020-Present"}],
  "scraped_at": "2026-02-17T14:30:00Z"
}
json
{
  "name": "Jane Doe",
  "headline": "CEO at Acme Corp",
  "current_title": "CEO",
  "current_company": "Acme Corp",
  "location": "San Francisco, CA",
  "linkedin_url": "https://www.linkedin.com/in/janedoe",
  "connections": "500+",
  "education": [{"school": "Stanford", "degree": "BS CS", "years": "2010-2014"}],
  "experience": [{"title": "CEO", "company": "Acme Corp", "duration": "2020-Present"}],
  "scraped_at": "2026-02-17T14:30:00Z"
}

Progress Reporting

进度报告

For bulk scrapes, report progress:
Scraping: 15/50 profiles (30%) — Last: Jane Doe (Acme Corp)
Rate: ~4 profiles/min — ETA: 9 min remaining
对于批量爬取,报告进度:
爬取进度:15/50个个人资料(30%)—— 最后一个:Jane Doe(Acme Corp)
速率:约4个个人资料/分钟 — 预计剩余时间:9分钟

Safety

安全注意事项

  • Never scrape private/restricted profiles
  • Respect LinkedIn's robots.txt for public pages
  • Store data locally only (DuckDB) — never exfiltrate
  • User must have legitimate LinkedIn access
  • This tool assists the user's own manual browsing at scale
  • 切勿爬取私人/受限个人资料
  • 遵守LinkedIn针对公开页面的robots.txt规则
  • 仅本地存储数据(DuckDB)—— 绝不向外泄露
  • 用户必须拥有合法的LinkedIn访问权限
  • 本工具仅辅助用户在合法范围内批量完成手动浏览可实现的操作