linkedin-scraper
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLinkedIn Scraper — Chrome Profile Web Scraping
LinkedIn 爬取工具 — 基于Chrome配置文件的网页爬取
Scrape LinkedIn profiles and search results using the user's authenticated Chrome browser session. No API keys needed — uses the browser tool with the Chrome profile relay.
使用用户已认证的Chrome浏览器会话爬取LinkedIn个人资料和搜索结果。无需API密钥——通过浏览器工具配合Chrome配置文件中继实现。
Prerequisites
前置条件
- Chrome browser with active LinkedIn login
- Browser relay connected (Chrome extension or openclaw browser profile)
- DuckDB workspace for storing results (optional)
- 已登录LinkedIn的Chrome浏览器
- 已连接浏览器中继(Chrome扩展或openclaw浏览器配置文件)
- 用于存储结果的DuckDB工作区(可选)
Core Workflow
核心工作流程
1. Single Profile Scrape
1. 单个个人资料爬取
browser → open LinkedIn profile URL
browser → snapshot (extract structured data)
→ Parse: name, headline, title, company, location, education, experience, connections, about
→ Return structured JSON or insert into DuckDBbrowser → 打开LinkedIn个人资料URL
browser → 快照(提取结构化数据)
→ 解析:姓名、头衔、职位、公司、所在地、教育背景、工作经历、人脉数量、个人简介
→ 返回结构化JSON或插入DuckDB2. Search + Bulk Scrape
2. 搜索+批量爬取
browser → open LinkedIn search URL with filters
browser → snapshot (extract result cards)
→ Parse each result: name, title, company, profile URL
→ For each profile URL: open → snapshot → parse full profile
→ Batch insert into DuckDBbrowser → 打开带筛选条件的LinkedIn搜索URL
browser → 快照(提取结果卡片)
→ 解析每个结果:姓名、职位、公司、个人资料URL
→ 针对每个个人资料URL:打开→快照→解析完整个人资料
→ 批量插入DuckDB3. Company Page Scrape
3. 公司页面爬取
browser → open LinkedIn company page
→ Parse: company name, industry, size, description, specialties, employee count
→ Navigate to /people tab for employee listbrowser → 打开LinkedIn公司页面
→ 解析:公司名称、行业、规模、描述、业务专长、员工数量
→ 导航至/people标签页获取员工列表Implementation Rules
实施规则
Rate Limiting (CRITICAL)
速率限制(至关重要)
- Minimum 3-5 second delay between page loads
- Maximum 80 profiles per session (LinkedIn rate limits)
- Randomize delays between 3-8 seconds (avoid detection)
- After every 20 profiles, take a 60-second break
- If CAPTCHA or "unusual activity" detected, stop immediately and alert user
- 页面加载之间至少保持3-5秒延迟
- 每个会话最多爬取80个个人资料(LinkedIn有速率限制)
- 随机化延迟时间在3-8秒之间(避免被检测)
- 每爬取20个个人资料后,暂停60秒
- 若检测到CAPTCHA或“异常活动”提示,立即停止并提醒用户
Stealth Patterns
隐匿模式
- Use natural scrolling (scroll down slowly, pause, scroll more)
- Don't scrape the same search results page more than twice
- Vary the order of profile visits (don't go sequentially)
- Close and reopen tabs periodically
- 使用自然滚动(缓慢向下滚动、暂停、继续滚动)
- 同一搜索结果页面爬取不超过两次
- 改变个人资料访问顺序(不要按顺序访问)
- 定期关闭并重新打开标签页
Data Extraction — Profile Page
数据提取 — 个人资料页面
From a LinkedIn profile snapshot, extract these fields:
| Field | Location | Notes |
|---|---|---|
| name | Main heading h1 | Full name |
| headline | Below name | Title + Company usually |
| location | Location section | City, State/Country |
| current_title | Experience section, first entry | Most recent role |
| current_company | Experience section, first entry | Company name |
| education | Education section | School, degree, dates |
| connections | Connections count | Number or "500+" |
| about | About section | Bio text (may need "see more" click) |
| experience | Experience section | All roles with dates |
| profile_url | Browser URL bar | Canonical LinkedIn URL |
从LinkedIn个人资料快照中提取以下字段:
| 字段 | 位置 | 说明 |
|---|---|---|
| name | 主标题h1 | 全名 |
| headline | 姓名下方 | 通常包含职位+公司 |
| location | 所在地板块 | 城市、州/国家 |
| current_title | 工作经历板块第一条 | 最新职位 |
| current_company | 工作经历板块第一条 | 公司名称 |
| education | 教育背景板块 | 学校、学位、就读时间 |
| connections | 人脉数量 | 具体数字或“500+” |
| about | 个人简介板块 | 个人简介文本(可能需要点击“查看更多”) |
| experience | 工作经历板块 | 所有带时间的职位 |
| profile_url | 浏览器地址栏 | 标准LinkedIn URL |
Data Extraction — Search Results
数据提取 — 搜索结果页面
From LinkedIn search results page:
| Field | Location |
|---|---|
| name | Result card heading |
| headline | Below name in card |
| location | Card metadata |
| profile_url | Link href on name |
| mutual_connections | Card footer |
从LinkedIn搜索结果页面提取:
| 字段 | 位置 |
|---|---|
| name | 结果卡片标题 |
| headline | 卡片内姓名下方 |
| location | 卡片元数据 |
| profile_url | 姓名上的链接地址 |
| mutual_connections | 卡片页脚 |
Search URL Patterns
搜索URL格式
undefinedundefinedPeople search
人员搜索
With filters
带筛选条件
&geoUrn=%5B%22103644278%22%5D # United States
&network=%5B%22F%22%2C%22S%22%5D # 1st + 2nd connections
¤tCompany=%5B%22{company_id}%22%5D # Current company
&schoolFilter=%5B%22{school_id}%22%5D # School filter
&geoUrn=%5B%22103644278%22%5D # 美国
&network=%5B%22F%22%2C%22S%22%5D # 一级+二级人脉
¤tCompany=%5B%22{company_id}%22%5D # 当前公司
&schoolFilter=%5B%22{school_id}%22%5D # 学校筛选
YC founders (common query)
YC创始人(常见查询)
Company employees
公司员工
DuckDB Integration
DuckDB集成
When storing to DuckDB, use the Ironclaw workspace database:
sql
-- Check if leads/contacts object exists
SELECT * FROM objects WHERE name = 'leads' OR name = 'contacts';
-- Insert via the EAV pattern or direct pivot view
INSERT INTO v_leads ("Name", "Title", "Company", "LinkedIn URL", "Location", "Source")
VALUES (?, ?, ?, ?, ?, 'LinkedIn Scrape');If no suitable object exists, create one:
sql
-- Use Ironclaw's object creation pattern from the dench skill当存储至DuckDB时,使用Ironclaw工作区数据库:
sql
-- 检查leads/contacts对象是否存在
SELECT * FROM objects WHERE name = 'leads' OR name = 'contacts';
-- 通过EAV模式或直接透视视图插入数据
INSERT INTO v_leads ("Name", "Title", "Company", "LinkedIn URL", "Location", "Source")
VALUES (?, ?, ?, ?, ?, 'LinkedIn Scrape');如果没有合适的对象,创建一个:
sql
-- 使用Ironclaw的对象创建模式(来自dench skill)Error Handling
错误处理
| Error | Action |
|---|---|
| "Sign in" page | LinkedIn session expired — alert user to re-login in Chrome |
| CAPTCHA / Security check | Stop immediately, wait 30+ min, alert user |
| "Profile not found" | Skip, log URL as invalid |
| Rate limit (429) | Stop, wait 15 min, retry with longer delays |
| Empty snapshot | Page still loading — wait 3s and re-snapshot |
| 错误 | 操作 |
|---|---|
| “登录”页面 | LinkedIn会话已过期 — 提醒用户在Chrome中重新登录 |
| CAPTCHA / 安全检查 | 立即停止,等待30分钟以上,提醒用户 |
| “个人资料未找到” | 跳过,记录该URL为无效 |
| 速率限制(429错误) | 停止,等待15分钟,重试时延长延迟时间 |
| 空快照 | 页面仍在加载 — 等待3秒后重新快照 |
Output Formats
输出格式
JSON (default)
JSON(默认)
json
{
"name": "Jane Doe",
"headline": "CEO at Acme Corp",
"current_title": "CEO",
"current_company": "Acme Corp",
"location": "San Francisco, CA",
"linkedin_url": "https://www.linkedin.com/in/janedoe",
"connections": "500+",
"education": [{"school": "Stanford", "degree": "BS CS", "years": "2010-2014"}],
"experience": [{"title": "CEO", "company": "Acme Corp", "duration": "2020-Present"}],
"scraped_at": "2026-02-17T14:30:00Z"
}json
{
"name": "Jane Doe",
"headline": "CEO at Acme Corp",
"current_title": "CEO",
"current_company": "Acme Corp",
"location": "San Francisco, CA",
"linkedin_url": "https://www.linkedin.com/in/janedoe",
"connections": "500+",
"education": [{"school": "Stanford", "degree": "BS CS", "years": "2010-2014"}],
"experience": [{"title": "CEO", "company": "Acme Corp", "duration": "2020-Present"}],
"scraped_at": "2026-02-17T14:30:00Z"
}Progress Reporting
进度报告
For bulk scrapes, report progress:
Scraping: 15/50 profiles (30%) — Last: Jane Doe (Acme Corp)
Rate: ~4 profiles/min — ETA: 9 min remaining对于批量爬取,报告进度:
爬取进度:15/50个个人资料(30%)—— 最后一个:Jane Doe(Acme Corp)
速率:约4个个人资料/分钟 — 预计剩余时间:9分钟Safety
安全注意事项
- Never scrape private/restricted profiles
- Respect LinkedIn's robots.txt for public pages
- Store data locally only (DuckDB) — never exfiltrate
- User must have legitimate LinkedIn access
- This tool assists the user's own manual browsing at scale
- 切勿爬取私人/受限个人资料
- 遵守LinkedIn针对公开页面的robots.txt规则
- 仅本地存储数据(DuckDB)—— 绝不向外泄露
- 用户必须拥有合法的LinkedIn访问权限
- 本工具仅辅助用户在合法范围内批量完成手动浏览可实现的操作