karpathy-jobs-bls-visualizer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

karpathy/jobs — BLS Job Market Visualizer

karpathy/jobs — BLS就业市场可视化工具

Skill by ara.so — Daily 2026 Skills collection.
A research tool for visually exploring Bureau of Labor Statistics Occupational Outlook Handbook data across 342 occupations. The interactive treemap colors rectangles by employment size (area) and any chosen metric (color): BLS growth outlook, median pay, education requirements, or LLM-scored AI exposure. The pipeline is fully forkable — write a new prompt, re-run scoring, get a new color layer.
Live demo: karpathy.ai/jobs

ara.so开发的工具 — 属于Daily 2026 Skills系列。
这是一款用于可视化探索美国劳工统计局(BLS)《职业展望手册》中342种职业数据的研究工具。交互式树形图以矩形面积代表就业规模,颜色可根据所选指标切换:BLS增长前景、薪资中位数、教育要求或LLM评分的AI暴露程度。整个流程完全可复刻——只需编写新提示词,重新运行评分,即可获得新的颜色分层。
在线演示: karpathy.ai/jobs

Installation & Setup

安装与设置

bash
undefined
bash
undefined

Clone the repo

克隆仓库

Install dependencies (uses uv)

安装依赖(使用uv)

uv sync uv run playwright install chromium

Create a `.env` file with your OpenRouter API key (required only for LLM scoring):

```bash
OPENROUTER_API_KEY=your_openrouter_key_here

uv sync uv run playwright install chromium

创建`.env`文件并填入你的OpenRouter API密钥(仅在使用LLM评分时需要):

```bash
OPENROUTER_API_KEY=your_openrouter_key_here

Full Pipeline — Key Commands

完整流程 — 关键命令

Run these in order for a complete fresh build:
bash
undefined
按以下顺序运行以完成全新构建:
bash
undefined

1. Scrape BLS pages (non-headless Playwright; BLS blocks bots)

1. 爬取BLS页面(使用非无头模式Playwright;BLS会拦截机器人)

Results cached in html/ — only needed once

结果缓存于html/目录 — 仅需运行一次

uv run python scrape.py
uv run python scrape.py

2. Convert raw HTML → clean Markdown in pages/

2. 将原始HTML转换为pages/目录下的整洁Markdown文件

uv run python process.py
uv run python process.py

3. Extract structured fields → occupations.csv

3. 提取结构化字段生成occupations.csv

uv run python make_csv.py
uv run python make_csv.py

4. Score AI exposure via LLM (uses OpenRouter API, saves scores.json)

4. 通过LLM评估AI暴露程度(使用OpenRouter API,结果保存至scores.json)

uv run python score.py
uv run python score.py

5. Merge CSV + scores → site/data.json for the frontend

5. 合并CSV与评分数据,生成前端所需的site/data.json

uv run python build_site_data.py
uv run python build_site_data.py

6. Serve the visualization locally

6. 本地启动可视化服务

cd site && python -m http.server 8000
cd site && python -m http.server 8000

---

---

Key Files Reference

关键文件参考

FileDescription
occupations.json
Master list of 342 occupations (title, URL, category, slug)
occupations.csv
Summary stats: pay, education, job count, growth projections
scores.json
AI exposure scores (0–10) + rationales for all 342 occupations
prompt.md
All data in one ~45K-token file for pasting into an LLM
html/
Raw HTML pages from BLS (~40MB, source of truth)
pages/
Clean Markdown versions of each occupation page
site/index.html
The treemap visualization (single HTML file)
site/data.json
Compact merged data consumed by the frontend
score.py
LLM scoring pipeline — fork this to write custom prompts

文件描述
occupations.json
包含342种职业的主列表(标题、URL、分类、短标识)
occupations.csv
汇总统计数据:薪资、教育要求、岗位数量、增长预测
scores.json
AI暴露程度评分(0–10分)及所有342种职业的评分依据
prompt.md
整合所有数据的单文件(约45K token),可直接粘贴至LLM中使用
html/
从BLS爬取的原始HTML页面(约40MB,为数据来源)
pages/
每种职业页面的整洁Markdown版本
site/index.html
树形图可视化页面(独立HTML文件)
site/data.json
供前端使用的合并后精简数据
score.py
LLM评分流程脚本 — 可复刻该文件编写自定义提示词

Writing a Custom LLM Scoring Layer

编写自定义LLM评分分层

The most powerful feature: write any scoring prompt, run
score.py
, get a new treemap color layer.
最强大的功能:编写任意评分提示词,运行
score.py
,即可获得新的树形图颜色分层。

1. Edit the prompt in
score.py

1. 编辑
score.py
中的提示词

python
undefined
python
undefined

score.py (simplified structure)

score.py(简化结构)

SYSTEM_PROMPT = """ You are evaluating occupations for exposure to humanoid robotics over the next 10 years.
Score each occupation from 0 to 10:
  • 0 = no meaningful exposure (e.g., requires fine social judgment, non-physical)
  • 5 = moderate exposure (some tasks automatable, but humans still central)
  • 10 = high exposure (repetitive physical tasks, predictable environments)
Consider: physical task complexity, environment predictability, dexterity requirements, cost of robot vs human, regulatory barriers.
Respond ONLY with JSON: {"score": <int 0-10>, "rationale": "<1-2 sentences>"} """
undefined
SYSTEM_PROMPT = """ 你需要评估未来10年各职业受类人机器人的影响程度。
为每种职业评分(0-10分):
  • 0分 = 无显著影响(例如:需要精细社交判断、非体力劳动)
  • 5分 = 中等影响(部分任务可自动化,但人类仍为核心)
  • 10分 = 高影响(重复性体力劳动、可预测环境)
评估需考虑:体力任务复杂度、环境可预测性、灵活性要求、机器人与人力成本对比、监管障碍。
仅返回JSON格式:{"score": <0-10的整数>, "rationale": "<1-2句话>"} """
undefined

2. Run the scoring pipeline

2. 运行评分流程

python
undefined
python
undefined

The pipeline reads each occupation's Markdown from pages/,

该流程读取pages/目录下的职业Markdown文件,

sends it to the LLM, and writes results to scores.json

发送至LLM,并将结果保存至scores.json

scores.json structure:

scores.json结构示例:

{ "software-developers": { "score": 1, "rationale": "Software development is digital and cognitive; humanoid robots provide no advantage." }, "construction-laborers": { "score": 7, "rationale": "Physical, repetitive outdoor tasks are targets for humanoid robotics, though unstructured environments remain challenging." } // ... 342 occupations total }
undefined
{ "software-developers": { "score": 1, "rationale": "软件开发属于数字化认知工作;类人机器人无优势。" }, "construction-laborers": { "score": 7, "rationale": "体力重复性户外工作是类人机器人的目标场景,不过非结构化环境仍具挑战。" } // ... 共342种职业 }
undefined

3. Rebuild site data

3. 重建站点数据

bash
uv run python build_site_data.py
cd site && python -m http.server 8000

bash
uv run python build_site_data.py
cd site && python -m http.server 8000

Data Structures

数据结构

occupations.json
entry

occupations.json
条目示例

json
{
  "title": "Software Developers",
  "url": "https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm",
  "category": "Computer and Information Technology",
  "slug": "software-developers"
}
json
{
  "title": "Software Developers",
  "url": "https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm",
  "category": "Computer and Information Technology",
  "slug": "software-developers"
}

occupations.csv
columns

occupations.csv
列名

slug, title, category, median_pay, education, job_count, growth_percent, growth_outlook
Example row:
software-developers, Software Developers, Computer and Information Technology,
130160, Bachelor's degree, 1847900, 17, Much faster than average
slug, title, category, median_pay, education, job_count, growth_percent, growth_outlook
示例行:
software-developers, Software Developers, Computer and Information Technology,
130160, Bachelor's degree, 1847900, 17, Much faster than average

site/data.json
entry (merged frontend data)

site/data.json
条目示例(合并后的前端数据)

json
{
  "slug": "software-developers",
  "title": "Software Developers",
  "category": "Computer and Information Technology",
  "median_pay": 130160,
  "education": "Bachelor's degree",
  "job_count": 1847900,
  "growth_percent": 17,
  "growth_outlook": "Much faster than average",
  "ai_score": 9,
  "ai_rationale": "AI is deeply transforming software development workflows..."
}

json
{
  "slug": "software-developers",
  "title": "Software Developers",
  "category": "Computer and Information Technology",
  "median_pay": 130160,
  "education": "Bachelor's degree",
  "job_count": 1847900,
  "growth_percent": 17,
  "growth_outlook": "Much faster than average",
  "ai_score": 9,
  "ai_rationale": "AI is deeply transforming software development workflows..."
}

Frontend Treemap (
site/index.html
)

前端树形图(
site/index.html

The visualization is a single self-contained HTML file using D3.js.
该可视化是一个使用D3.js的独立HTML文件。

Color layers (toggle in UI)

颜色分层(可在UI中切换)

LayerWhat it shows
BLS OutlookBLS projected growth category (green = fast growth)
Median PayAnnual median wage (color gradient)
EducationMinimum education required
Digital AI ExposureLLM-scored 0–10 AI impact estimate
分层展示内容
BLS前景BLS预测的增长类别(绿色=快速增长)
薪资中位数年度薪资中位数(颜色渐变)
教育要求所需最低教育水平
数字AI暴露程度LLM评分的AI影响估算(0–10分)

Adding a new color layer to the frontend

向前端添加新颜色分层

html
<!-- In site/index.html, find the layer toggle buttons -->
<button onclick="setLayer('ai_score')">Digital AI Exposure</button>

<!-- Add your new layer button -->
<button onclick="setLayer('robotics_score')">Humanoid Robotics</button>
javascript
// In the colorScale function, add a case for your new field:
function getColor(d, layer) {
  if (layer === 'robotics_score') {
    // scores 0-10, blue = low exposure, red = high
    return d3.interpolateRdYlBu(1 - d.robotics_score / 10);
  }
  // ... existing cases
}
Then update
build_site_data.py
to include your new score field in
data.json
.

html
<!-- 在site/index.html中找到分层切换按钮 -->
<button onclick="setLayer('ai_score')">数字AI暴露程度</button>

<!-- 添加你的新分层按钮 -->
<button onclick="setLayer('robotics_score')">类人机器人影响</button>
javascript
// 在colorScale函数中,为新字段添加分支:
function getColor(d, layer) {
  if (layer === 'robotics_score') {
    // 评分0-10,蓝色=低影响,红色=高影响
    return d3.interpolateRdYlBu(1 - d.robotics_score / 10);
  }
  // ... 现有分支
}
然后更新
build_site_data.py
,将新的评分字段加入
data.json

Generating the LLM-Ready Prompt File

生成适用于LLM的提示词文件

Package all 342 occupations + aggregate stats into a single file for LLM chat:
bash
uv run python make_prompt.py
将所有342种职业+汇总统计数据打包为单个文件,用于LLM对话:
bash
uv run python make_prompt.py

Produces prompt.md (~45K tokens)

生成prompt.md文件(约45K token)

Paste into Claude, GPT-4, Gemini, etc. for data-grounded conversation

可粘贴至Claude、GPT-4、Gemini等工具中,开展基于数据的对话


---

---

Scraping Notes

爬取注意事项

The BLS blocks automated bots, so
scrape.py
uses non-headless Playwright (real visible browser window):
python
undefined
BLS会拦截自动化机器人,因此
scrape.py
使用非无头模式Playwright(可见的真实浏览器窗口):
python
undefined

scrape.py key behavior

scrape.py核心行为

browser = await p.chromium.launch(headless=False) # Must be visible
browser = await p.chromium.launch(headless=False) # 必须为可见模式

Pages saved to html/<slug>.html

页面保存至html/<slug>.html

Already-scraped pages are skipped (cached)

已爬取的页面会被跳过(缓存机制)


If scraping fails or is rate-limited:
- The `html/` directory already contains cached pages in the repo
- You can skip scraping entirely and run from `process.py` onward
- If re-scraping, add delays between requests to avoid blocks

---

如果爬取失败或被限流:
- 仓库中已包含缓存的`html/`目录
- 可直接跳过爬取步骤,从`process.py`开始运行
- 如需重新爬取,需在请求间添加延迟以避免被拦截

---

Common Patterns

常见用法

Re-score only missing occupations

仅为缺失的职业重新评分

python
import json, os

with open("scores.json") as f:
    existing = json.load(f)

with open("occupations.json") as f:
    all_occupations = json.load(f)
python
import json, os

with open("scores.json") as f:
    existing = json.load(f)

with open("occupations.json") as f:
    all_occupations = json.load(f)

Find gaps

找出缺失的职业

missing = [o for o in all_occupations if o["slug"] not in existing] print(f"Missing scores: {len(missing)}")
missing = [o for o in all_occupations if o["slug"] not in existing] print(f"缺失评分的职业数量: {len(missing)}")

Then run score.py with a filter for missing slugs

随后运行score.py并过滤出缺失的slug

undefined
undefined

Parse a single occupation page manually

手动解析单个职业页面

python
from parse_detail import parse_occupation_page
from pathlib import Path

html = Path("html/software-developers.html").read_text()
data = parse_occupation_page(html)
print(data["median_pay"])     # e.g. 130160
print(data["job_count"])      # e.g. 1847900
print(data["growth_outlook"]) # e.g. "Much faster than average"
python
from parse_detail import parse_occupation_page
from pathlib import Path

html = Path("html/software-developers.html").read_text()
data = parse_occupation_page(html)
print(data["median_pay"])     # 示例:130160
print(data["job_count"])      # 示例:1847900
print(data["growth_outlook"]) # 示例:"Much faster than average"

Load and query occupations.csv

加载并查询occupations.csv

python
import pandas as pd

df = pd.read_csv("occupations.csv")
python
import pandas as pd

df = pd.read_csv("occupations.csv")

Top 10 highest paying occupations

薪资最高的10种职业

top_pay = df.nlargest(10, "median_pay")[["title", "median_pay", "growth_outlook"]] print(top_pay)
top_pay = df.nlargest(10, "median_pay")[["title", "median_pay", "growth_outlook"]] print(top_pay)

Filter: fast growth + high pay

筛选:快速增长 + 高薪

high_value = df[ (df["growth_percent"] > 10) & (df["median_pay"] > 80000) ].sort_values("median_pay", ascending=False)
undefined
high_value = df[ (df["growth_percent"] > 10) & (df["median_pay"] > 80000) ].sort_values("median_pay", ascending=False)
undefined

Combine CSV with AI scores for analysis

合并CSV与AI评分进行分析

python
import pandas as pd, json

df = pd.read_csv("occupations.csv")

with open("scores.json") as f:
    scores = json.load(f)

df["ai_score"] = df["slug"].map(lambda s: scores.get(s, {}).get("score"))
df["ai_rationale"] = df["slug"].map(lambda s: scores.get(s, {}).get("rationale"))
python
import pandas as pd, json

df = pd.read_csv("occupations.csv")

with open("scores.json") as f:
    scores = json.load(f)

df["ai_score"] = df["slug"].map(lambda s: scores.get(s, {}).get("score"))
df["ai_rationale"] = df["slug"].map(lambda s: scores.get(s, {}).get("rationale"))

High AI exposure, high pay — reshaping, not disappearing

高AI暴露程度 + 高薪 — 职业正在转型,而非消失

high_exposure_high_pay = df[ (df["ai_score"] >= 8) & (df["median_pay"] > 100000) ][["title", "median_pay", "ai_score", "growth_outlook"]] print(high_exposure_high_pay)

---
high_exposure_high_pay = df[ (df["ai_score"] >= 8) & (df["median_pay"] > 100000) ][["title", "median_pay", "ai_score", "growth_outlook"]] print(high_exposure_high_pay)

---

Troubleshooting

故障排除

playwright install
fails
bash
uv run playwright install --with-deps chromium
BLS scraping blocked / returns empty pages
  • Ensure
    headless=False
    in
    scrape.py
    (already the default)
  • Add manual delays; do not run in CI
  • The cached
    html/
    directory in the repo can be used directly
score.py
OpenRouter errors
  • Verify
    OPENROUTER_API_KEY
    is set in
    .env
  • Check your OpenRouter account has credits
  • Default model is Gemini Flash — change
    model
    in
    score.py
    for a different LLM
site/data.json
not updating after re-scoring
bash
undefined
playwright install
失败
bash
uv run playwright install --with-deps chromium
BLS爬取被拦截 / 返回空白页面
  • 确保
    scrape.py
    headless=False
    (已为默认设置)
  • 添加手动延迟;不要在CI环境中运行
  • 可直接使用仓库中缓存的
    html/
    目录
score.py
出现OpenRouter错误
  • 确认
    .env
    文件中已正确设置
    OPENROUTER_API_KEY
  • 检查你的OpenRouter账户是否有可用额度
  • 默认模型为Gemini Flash — 可修改
    score.py
    中的
    model
    参数切换其他LLM
重新评分后
site/data.json
未更新
bash
undefined

Always rebuild site data after changing scores.json

修改scores.json后务必重建站点数据

uv run python build_site_data.py

**Treemap shows blank / no data**
- Confirm `site/data.json` exists and is valid JSON
- Serve with `python -m http.server` (not `file://` — CORS blocks local JSON fetch)
- Check browser console for fetch errors

---
uv run python build_site_data.py

**树形图显示空白 / 无数据**
- 确认`site/data.json`存在且为有效JSON
- 使用`python -m http.server`启动服务(不要用`file://` — CORS会拦截本地JSON请求)
- 检查浏览器控制台的请求错误

---

Important Caveats (from the project)

重要提示(来自项目说明)

  • AI Exposure ≠ job disappearance. A score of 9/10 means AI is transforming the work, not eliminating demand. Software developers score 9/10 but demand is growing.
  • Scores are rough LLM estimates (Gemini Flash via OpenRouter), not rigorous economic predictions.
  • The tool does not account for demand elasticity, latent demand, regulatory barriers, or social preferences for human workers.
  • This is a development/research tool, not an economic publication.
  • AI暴露程度 ≠ 岗位消失。9/10分意味着AI正在改变工作方式,而非消除需求。例如软件开发人员评分9/10,但岗位需求仍在增长。
  • 评分是LLM的粗略估算(通过OpenRouter使用Gemini Flash),并非严谨的经济预测。
  • 本工具未考虑需求弹性、潜在需求、监管障碍或人类对人工服务的偏好。
  • 这是一款开发/研究工具,而非经济出版物。