karpathy-jobs-bls-visualizer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

karpathy/jobs — BLS Job Market Visualizer

karpathy/jobs — BLS就业市场可视化工具

Skill by ara.so — Daily 2026 Skills collection.

A research tool for visually exploring Bureau of Labor Statistics Occupational Outlook Handbook data across 342 occupations. The interactive treemap colors rectangles by employment size (area) and any chosen metric (color): BLS growth outlook, median pay, education requirements, or LLM-scored AI exposure. The pipeline is fully forkable — write a new prompt, re-run scoring, get a new color layer.

Live demo: karpathy.ai/jobs

由ara.so开发的工具 — 属于Daily 2026 Skills系列。

这是一款用于可视化探索美国劳工统计局（BLS）《职业展望手册》中342种职业数据的研究工具。交互式树形图以矩形面积代表就业规模，颜色可根据所选指标切换：BLS增长前景、薪资中位数、教育要求或LLM评分的AI暴露程度。整个流程完全可复刻——只需编写新提示词，重新运行评分，即可获得新的颜色分层。

在线演示： karpathy.ai/jobs

Installation & Setup

安装与设置

bash

undefined

bash

undefined

Clone the repo

克隆仓库

git clone https://github.com/karpathy/jobs cd jobs

Install dependencies (uses uv)

安装依赖（使用uv）

uv sync uv run playwright install chromium


Create a `.env` file with your OpenRouter API key (required only for LLM scoring):

```bash
OPENROUTER_API_KEY=your_openrouter_key_here

uv sync uv run playwright install chromium


创建`.env`文件并填入你的OpenRouter API密钥（仅在使用LLM评分时需要）：

```bash
OPENROUTER_API_KEY=your_openrouter_key_here

Full Pipeline — Key Commands

完整流程 — 关键命令

Run these in order for a complete fresh build:

bash

undefined

按以下顺序运行以完成全新构建：

bash

undefined

1. Scrape BLS pages (non-headless Playwright; BLS blocks bots)

1. 爬取BLS页面（使用非无头模式Playwright；BLS会拦截机器人）

Results cached in html/ — only needed once

结果缓存于html/目录 — 仅需运行一次

uv run python scrape.py

2. Convert raw HTML → clean Markdown in pages/

2. 将原始HTML转换为pages/目录下的整洁Markdown文件

uv run python process.py

3. Extract structured fields → occupations.csv

3. 提取结构化字段生成occupations.csv

uv run python make_csv.py

4. Score AI exposure via LLM (uses OpenRouter API, saves scores.json)

4. 通过LLM评估AI暴露程度（使用OpenRouter API，结果保存至scores.json）

uv run python score.py

5. Merge CSV + scores → site/data.json for the frontend

5. 合并CSV与评分数据，生成前端所需的site/data.json

uv run python build_site_data.py

6. Serve the visualization locally

6. 本地启动可视化服务

cd site && python -m http.server 8000

Open http://localhost:8000

打开http://localhost:8000

---

---

Key Files Reference

关键文件参考

File	Description
`occupations.json`	Master list of 342 occupations (title, URL, category, slug)
`occupations.csv`	Summary stats: pay, education, job count, growth projections
`scores.json`	AI exposure scores (0–10) + rationales for all 342 occupations
`prompt.md`	All data in one ~45K-token file for pasting into an LLM
`html/`	Raw HTML pages from BLS (~40MB, source of truth)
`pages/`	Clean Markdown versions of each occupation page
`site/index.html`	The treemap visualization (single HTML file)
`site/data.json`	Compact merged data consumed by the frontend
`score.py`	LLM scoring pipeline — fork this to write custom prompts

文件	描述
`occupations.json`	包含342种职业的主列表（标题、URL、分类、短标识）
`occupations.csv`	汇总统计数据：薪资、教育要求、岗位数量、增长预测
`scores.json`	AI暴露程度评分（0–10分）及所有342种职业的评分依据
`prompt.md`	整合所有数据的单文件（约45K token），可直接粘贴至LLM中使用
`html/`	从BLS爬取的原始HTML页面（约40MB，为数据来源）
`pages/`	每种职业页面的整洁Markdown版本
`site/index.html`	树形图可视化页面（独立HTML文件）
`site/data.json`	供前端使用的合并后精简数据
`score.py`	LLM评分流程脚本 — 可复刻该文件编写自定义提示词

Writing a Custom LLM Scoring Layer

编写自定义LLM评分分层

The most powerful feature: write any scoring prompt, run

score.py

, get a new treemap color layer.

最强大的功能：编写任意评分提示词，运行

score.py

，即可获得新的树形图颜色分层。

1. Edit the prompt in

score.py

1. 编辑

score.py

中的提示词

python

undefined

python

undefined

score.py (simplified structure)

score.py（简化结构）

SYSTEM_PROMPT = """ You are evaluating occupations for exposure to humanoid robotics over the next 10 years.

Score each occupation from 0 to 10:

0 = no meaningful exposure (e.g., requires fine social judgment, non-physical)
5 = moderate exposure (some tasks automatable, but humans still central)
10 = high exposure (repetitive physical tasks, predictable environments)

Consider: physical task complexity, environment predictability, dexterity requirements, cost of robot vs human, regulatory barriers.

Respond ONLY with JSON: {"score": <int 0-10>, "rationale": "<1-2 sentences>"} """

undefined

SYSTEM_PROMPT = """ 你需要评估未来10年各职业受类人机器人的影响程度。

为每种职业评分（0-10分）：

0分 = 无显著影响（例如：需要精细社交判断、非体力劳动）
5分 = 中等影响（部分任务可自动化，但人类仍为核心）
10分 = 高影响（重复性体力劳动、可预测环境）

评估需考虑：体力任务复杂度、环境可预测性、灵活性要求、机器人与人力成本对比、监管障碍。

仅返回JSON格式：{"score": <0-10的整数>, "rationale": "<1-2句话>"} """

undefined

2. Run the scoring pipeline

2. 运行评分流程

python

undefined

python

undefined

The pipeline reads each occupation's Markdown from pages/,

该流程读取pages/目录下的职业Markdown文件，

sends it to the LLM, and writes results to scores.json

发送至LLM，并将结果保存至scores.json

scores.json structure:

scores.json结构示例：

{ "software-developers": { "score": 1, "rationale": "Software development is digital and cognitive; humanoid robots provide no advantage." }, "construction-laborers": { "score": 7, "rationale": "Physical, repetitive outdoor tasks are targets for humanoid robotics, though unstructured environments remain challenging." } // ... 342 occupations total }

undefined

{ "software-developers": { "score": 1, "rationale": "软件开发属于数字化认知工作；类人机器人无优势。" }, "construction-laborers": { "score": 7, "rationale": "体力重复性户外工作是类人机器人的目标场景，不过非结构化环境仍具挑战。" } // ... 共342种职业 }

undefined

3. Rebuild site data

3. 重建站点数据

bash

uv run python build_site_data.py
cd site && python -m http.server 8000

bash

uv run python build_site_data.py
cd site && python -m http.server 8000

Data Structures

数据结构

occupations.json

entry

occupations.json

条目示例

json

{
  "title": "Software Developers",
  "url": "https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm",
  "category": "Computer and Information Technology",
  "slug": "software-developers"
}

json

{
  "title": "Software Developers",
  "url": "https://www.bls.gov/ooh/computer-and-information-technology/software-developers.htm",
  "category": "Computer and Information Technology",
  "slug": "software-developers"
}

occupations.csv

columns

occupations.csv

列名

slug, title, category, median_pay, education, job_count, growth_percent, growth_outlook

Example row:

software-developers, Software Developers, Computer and Information Technology,
130160, Bachelor's degree, 1847900, 17, Much faster than average

slug, title, category, median_pay, education, job_count, growth_percent, growth_outlook

示例行：

software-developers, Software Developers, Computer and Information Technology,
130160, Bachelor's degree, 1847900, 17, Much faster than average

site/data.json

entry (merged frontend data)

site/data.json

条目示例（合并后的前端数据）

json

{
  "slug": "software-developers",
  "title": "Software Developers",
  "category": "Computer and Information Technology",
  "median_pay": 130160,
  "education": "Bachelor's degree",
  "job_count": 1847900,
  "growth_percent": 17,
  "growth_outlook": "Much faster than average",
  "ai_score": 9,
  "ai_rationale": "AI is deeply transforming software development workflows..."
}

json

{
  "slug": "software-developers",
  "title": "Software Developers",
  "category": "Computer and Information Technology",
  "median_pay": 130160,
  "education": "Bachelor's degree",
  "job_count": 1847900,
  "growth_percent": 17,
  "growth_outlook": "Much faster than average",
  "ai_score": 9,
  "ai_rationale": "AI is deeply transforming software development workflows..."
}

Frontend Treemap (

site/index.html

)

前端树形图（

site/index.html

）

The visualization is a single self-contained HTML file using D3.js.

该可视化是一个使用D3.js的独立HTML文件。

Color layers (toggle in UI)

颜色分层（可在UI中切换）

Layer	What it shows
BLS Outlook	BLS projected growth category (green = fast growth)
Median Pay	Annual median wage (color gradient)
Education	Minimum education required
Digital AI Exposure	LLM-scored 0–10 AI impact estimate

分层	展示内容
BLS前景	BLS预测的增长类别（绿色=快速增长）
薪资中位数	年度薪资中位数（颜色渐变）
教育要求	所需最低教育水平
数字AI暴露程度	LLM评分的AI影响估算（0–10分）

Adding a new color layer to the frontend

向前端添加新颜色分层

html

<!-- In site/index.html, find the layer toggle buttons -->
<button onclick="setLayer('ai_score')">Digital AI Exposure</button>

<!-- Add your new layer button -->
<button onclick="setLayer('robotics_score')">Humanoid Robotics</button>

javascript

// In the colorScale function, add a case for your new field:
function getColor(d, layer) {
  if (layer === 'robotics_score') {
    // scores 0-10, blue = low exposure, red = high
    return d3.interpolateRdYlBu(1 - d.robotics_score / 10);
  }
  // ... existing cases
}

Then update

build_site_data.py

to include your new score field in

data.json

html

<!-- 在site/index.html中找到分层切换按钮 -->
<button onclick="setLayer('ai_score')">数字AI暴露程度</button>

<!-- 添加你的新分层按钮 -->
<button onclick="setLayer('robotics_score')">类人机器人影响</button>

javascript

// 在colorScale函数中，为新字段添加分支：
function getColor(d, layer) {
  if (layer === 'robotics_score') {
    // 评分0-10，蓝色=低影响，红色=高影响
    return d3.interpolateRdYlBu(1 - d.robotics_score / 10);
  }
  // ... 现有分支
}

然后更新

build_site_data.py

，将新的评分字段加入

data.json

。

Generating the LLM-Ready Prompt File

生成适用于LLM的提示词文件

Package all 342 occupations + aggregate stats into a single file for LLM chat:

bash

uv run python make_prompt.py

将所有342种职业+汇总统计数据打包为单个文件，用于LLM对话：

bash

uv run python make_prompt.py

Produces prompt.md (~45K tokens)

生成prompt.md文件（约45K token）

Paste into Claude, GPT-4, Gemini, etc. for data-grounded conversation

可粘贴至Claude、GPT-4、Gemini等工具中，开展基于数据的对话

---

---

Scraping Notes

爬取注意事项

The BLS blocks automated bots, so

scrape.py

uses non-headless Playwright (real visible browser window):

python

undefined

BLS会拦截自动化机器人，因此

scrape.py

使用非无头模式Playwright（可见的真实浏览器窗口）：

python

undefined

scrape.py key behavior

scrape.py核心行为

browser = await p.chromium.launch(headless=False) # Must be visible

browser = await p.chromium.launch(headless=False) # 必须为可见模式

Pages saved to html/<slug>.html

页面保存至html/<slug>.html

Already-scraped pages are skipped (cached)

已爬取的页面会被跳过（缓存机制）


If scraping fails or is rate-limited:
- The `html/` directory already contains cached pages in the repo
- You can skip scraping entirely and run from `process.py` onward
- If re-scraping, add delays between requests to avoid blocks

---


如果爬取失败或被限流：
- 仓库中已包含缓存的`html/`目录
- 可直接跳过爬取步骤，从`process.py`开始运行
- 如需重新爬取，需在请求间添加延迟以避免被拦截

---

Common Patterns

常见用法

Re-score only missing occupations

仅为缺失的职业重新评分

python

import json, os

with open("scores.json") as f:
    existing = json.load(f)

with open("occupations.json") as f:
    all_occupations = json.load(f)

python

import json, os

with open("scores.json") as f:
    existing = json.load(f)

with open("occupations.json") as f:
    all_occupations = json.load(f)

Find gaps

找出缺失的职业

missing = [o for o in all_occupations if o["slug"] not in existing] print(f"Missing scores: {len(missing)}")

missing = [o for o in all_occupations if o["slug"] not in existing] print(f"缺失评分的职业数量: {len(missing)}")

Then run score.py with a filter for missing slugs

随后运行score.py并过滤出缺失的slug

undefined

undefined

Parse a single occupation page manually

手动解析单个职业页面

python

from parse_detail import parse_occupation_page
from pathlib import Path

html = Path("html/software-developers.html").read_text()
data = parse_occupation_page(html)
print(data["median_pay"])     # e.g. 130160
print(data["job_count"])      # e.g. 1847900
print(data["growth_outlook"]) # e.g. "Much faster than average"

python

from parse_detail import parse_occupation_page
from pathlib import Path

html = Path("html/software-developers.html").read_text()
data = parse_occupation_page(html)
print(data["median_pay"])     # 示例：130160
print(data["job_count"])      # 示例：1847900
print(data["growth_outlook"]) # 示例："Much faster than average"

Load and query occupations.csv

加载并查询occupations.csv

python

import pandas as pd

df = pd.read_csv("occupations.csv")

python

import pandas as pd

df = pd.read_csv("occupations.csv")

Top 10 highest paying occupations

薪资最高的10种职业

top_pay = df.nlargest(10, "median_pay")[["title", "median_pay", "growth_outlook"]] print(top_pay)

Filter: fast growth + high pay

筛选：快速增长 + 高薪

high_value = df[ (df["growth_percent"] > 10) & (df["median_pay"] > 80000) ].sort_values("median_pay", ascending=False)

undefined

high_value = df[ (df["growth_percent"] > 10) & (df["median_pay"] > 80000) ].sort_values("median_pay", ascending=False)

undefined

Combine CSV with AI scores for analysis

合并CSV与AI评分进行分析

python

import pandas as pd, json

df = pd.read_csv("occupations.csv")

with open("scores.json") as f:
    scores = json.load(f)

df["ai_score"] = df["slug"].map(lambda s: scores.get(s, {}).get("score"))
df["ai_rationale"] = df["slug"].map(lambda s: scores.get(s, {}).get("rationale"))

python

import pandas as pd, json

df = pd.read_csv("occupations.csv")

with open("scores.json") as f:
    scores = json.load(f)

df["ai_score"] = df["slug"].map(lambda s: scores.get(s, {}).get("score"))
df["ai_rationale"] = df["slug"].map(lambda s: scores.get(s, {}).get("rationale"))

High AI exposure, high pay — reshaping, not disappearing

高AI暴露程度 + 高薪 — 职业正在转型，而非消失

high_exposure_high_pay = df[ (df["ai_score"] >= 8) & (df["median_pay"] > 100000) ][["title", "median_pay", "ai_score", "growth_outlook"]] print(high_exposure_high_pay)

---

high_exposure_high_pay = df[ (df["ai_score"] >= 8) & (df["median_pay"] > 100000) ][["title", "median_pay", "ai_score", "growth_outlook"]] print(high_exposure_high_pay)

---

Troubleshooting

故障排除

playwright install
fails

bash

uv run playwright install --with-deps chromium

BLS scraping blocked / returns empty pages

Ensure
```
headless=False
```
in
```
scrape.py
```
(already the default)
Add manual delays; do not run in CI
The cached
```
html/
```
directory in the repo can be used directly

score.py
OpenRouter errors

Verify
```
OPENROUTER_API_KEY
```
is set in
```
.env
```
Check your OpenRouter account has credits
Default model is Gemini Flash — change
```
model
```
in
```
score.py
```
for a different LLM

site/data.json
not updating after re-scoring

bash

undefined

playwright install
失败

bash

uv run playwright install --with-deps chromium

BLS爬取被拦截 / 返回空白页面

确保
```
scrape.py
```
中
```
headless=False
```
（已为默认设置）
添加手动延迟；不要在CI环境中运行
可直接使用仓库中缓存的
```
html/
```
目录

score.py
出现OpenRouter错误

确认
```
.env
```
文件中已正确设置
```
OPENROUTER_API_KEY
```
检查你的OpenRouter账户是否有可用额度
默认模型为Gemini Flash — 可修改
```
score.py
```
中的
```
model
```
参数切换其他LLM

重新评分后
site/data.json
未更新

bash

undefined

Always rebuild site data after changing scores.json

修改scores.json后务必重建站点数据

uv run python build_site_data.py


**Treemap shows blank / no data**
- Confirm `site/data.json` exists and is valid JSON
- Serve with `python -m http.server` (not `file://` — CORS blocks local JSON fetch)
- Check browser console for fetch errors

---

uv run python build_site_data.py


**树形图显示空白 / 无数据**
- 确认`site/data.json`存在且为有效JSON
- 使用`python -m http.server`启动服务（不要用`file://` — CORS会拦截本地JSON请求）
- 检查浏览器控制台的请求错误

---

Important Caveats (from the project)

重要提示（来自项目说明）

AI Exposure ≠ job disappearance. A score of 9/10 means AI is transforming the work, not eliminating demand. Software developers score 9/10 but demand is growing.
Scores are rough LLM estimates (Gemini Flash via OpenRouter), not rigorous economic predictions.
The tool does not account for demand elasticity, latent demand, regulatory barriers, or social preferences for human workers.
This is a development/research tool, not an economic publication.

AI暴露程度 ≠ 岗位消失。9/10分意味着AI正在改变工作方式，而非消除需求。例如软件开发人员评分9/10，但岗位需求仍在增长。
评分是LLM的粗略估算（通过OpenRouter使用Gemini Flash），并非严谨的经济预测。
本工具未考虑需求弹性、潜在需求、监管障碍或人类对人工服务的偏好。
这是一款开发/研究工具，而非经济出版物。

karpathy-jobs-bls-visualizer

Original

Translation

karpathy/jobs — BLS Job Market Visualizer

karpathy/jobs — BLS就业市场可视化工具

Installation & Setup

安装与设置

Clone the repo

克隆仓库

Install dependencies (uses uv)

安装依赖（使用uv）

Full Pipeline — Key Commands

完整流程 — 关键命令

1. Scrape BLS pages (non-headless Playwright; BLS blocks bots)

1. 爬取BLS页面（使用非无头模式Playwright；BLS会拦截机器人）

Results cached in html/ — only needed once

结果缓存于html/目录 — 仅需运行一次

2. Convert raw HTML → clean Markdown in pages/

2. 将原始HTML转换为pages/目录下的整洁Markdown文件

3. Extract structured fields → occupations.csv

3. 提取结构化字段生成occupations.csv

4. Score AI exposure via LLM (uses OpenRouter API, saves scores.json)

4. 通过LLM评估AI暴露程度（使用OpenRouter API，结果保存至scores.json）

5. Merge CSV + scores → site/data.json for the frontend

5. 合并CSV与评分数据，生成前端所需的site/data.json

6. Serve the visualization locally

6. 本地启动可视化服务

Open http://localhost:8000

打开http://localhost:8000

Key Files Reference

关键文件参考

Writing a Custom LLM Scoring Layer

编写自定义LLM评分分层

1. Edit the prompt in score.py

1. 编辑score.py中的提示词

score.py (simplified structure)

score.py（简化结构）

2. Run the scoring pipeline

2. 运行评分流程

The pipeline reads each occupation's Markdown from pages/,

该流程读取pages/目录下的职业Markdown文件，

sends it to the LLM, and writes results to scores.json

发送至LLM，并将结果保存至scores.json

scores.json structure:

scores.json结构示例：

3. Rebuild site data

3. 重建站点数据

Data Structures

数据结构

occupations.json entry

occupations.json条目示例

occupations.csv columns

occupations.csv列名

site/data.json entry (merged frontend data)

site/data.json条目示例（合并后的前端数据）

Frontend Treemap (site/index.html)

前端树形图（site/index.html）

Color layers (toggle in UI)

颜色分层（可在UI中切换）

Adding a new color layer to the frontend

向前端添加新颜色分层

Generating the LLM-Ready Prompt File

生成适用于LLM的提示词文件

Produces prompt.md (~45K tokens)

生成prompt.md文件（约45K token）

Paste into Claude, GPT-4, Gemini, etc. for data-grounded conversation

可粘贴至Claude、GPT-4、Gemini等工具中，开展基于数据的对话

Scraping Notes

爬取注意事项

scrape.py key behavior

scrape.py核心行为

Pages saved to html/<slug>.html

页面保存至html/<slug>.html

Already-scraped pages are skipped (cached)

已爬取的页面会被跳过（缓存机制）

Common Patterns

常见用法

Re-score only missing occupations

仅为缺失的职业重新评分

Find gaps

1. Edit the prompt in
`score.py`

1. 编辑
`score.py`
中的提示词

`occupations.json`
entry

`occupations.json`
条目示例

`occupations.csv`
columns

`occupations.csv`
列名

`site/data.json`
entry (merged frontend data)

`site/data.json`
条目示例（合并后的前端数据）

Frontend Treemap (
`site/index.html`
)

前端树形图（
`site/index.html`
）