ml-research-engineer-safeguards

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ML / Research Engineer, Safeguards

ML / 研究工程师(安全防护方向)

When to Use

适用场景

  • Define research questions on harm detection, jailbreak resistance, or policy categories
  • Curate or audit safety datasets — labeling guidelines, bias checks, version control
  • Train or fine-tune classifiers, rankers, or small LLM judges for moderation
  • Design benchmarks and eval suites — golden sets, adversarial slices, regression harnesses
  • Run ablations — architecture, threshold, data mix, ensemble vs single model
  • Analyze metrics — precision/recall, calibration, false positive/negative slices
  • Write research memos — methods, results, limitations, production recommendation
  • Specify promotion bar for a new safeguard model version
  • 定义关于危害检测、越狱抗性或政策类别的研究问题
  • 整理或审核安全数据集——标注指南、偏差检查、版本控制
  • 训练或微调用于审核的分类器、排序器或小型LLM评判模型
  • 设计基准和评估套件——黄金数据集、对抗切片、回归测试框架
  • 开展消融实验——架构、阈值、数据混合、集成模型vs单一模型
  • 分析指标——精确率/召回率、校准、假阳性/阴性切片
  • 撰写研究备忘录——方法、结果、局限性、生产部署建议
  • 制定新防护模型版本的推广标准

When NOT to Use

不适用场景

  • Deploy gateways, GPU serving, canary routing →
    ml-infrastructure-engineer-safeguards
  • Execute structured red-team engagements on prod →
    ai-redteam
  • Draft acceptable-use policy or risk tiers →
    ai-risk-governance
  • Build customer-facing RAG/agents →
    ai-engineer
  • General literature survey unrelated to safety →
    ai-researcher
  • Token/context compression research →
    research-engineer-scientist-tokens
  • Product A/B and business metrics →
    data-scientist
  • PII detection benchmarks, memorization, logging minimization →
    privacy-research-engineer-safeguards
  • 部署网关、GPU服务、金丝雀路由 →
    ml-infrastructure-engineer-safeguards
  • 在生产环境执行结构化红队测试 →
    ai-redteam
  • 起草可接受使用政策或风险等级 →
    ai-risk-governance
  • 构建面向客户的RAG/Agent →
    ai-engineer
  • 与安全无关的通用文献调研 →
    ai-researcher
  • 令牌/上下文压缩研究 →
    research-engineer-scientist-tokens
  • 产品A/B测试与业务指标分析 →
    data-scientist
  • PII检测基准、记忆性研究、日志最小化 →
    privacy-research-engineer-safeguards

Related skills

相关技能

NeedSkill
Privacy research for safeguards
privacy-research-engineer-safeguards
Production safeguard path and rollout
ml-infrastructure-engineer-safeguards
Adversarial attack campaigns
ai-redteam
Governance sign-off and model cards
ai-risk-governance
Production eval harness in app
ai-engineer
General research methodology
ai-researcher
Classical ML and statistics
data-scientist
Token efficiency ablations
research-engineer-scientist-tokens
Release gates and ops cadence
ai-lead-ops
需求技能
安全防护方向的隐私研究
privacy-research-engineer-safeguards
生产级防护模型的部署与推广
ml-infrastructure-engineer-safeguards
对抗攻击活动
ai-redteam
治理审批与模型卡片制作
ai-risk-governance
应用内的生产级评估框架
ai-engineer
通用研究方法论
ai-researcher
经典机器学习与统计学
data-scientist
令牌效率消融实验
research-engineer-scientist-tokens
发布管控与运维节奏
ai-lead-ops

Core Workflows

核心工作流

1. Research framing (safety)

1. 安全研究框架搭建

Hypotheses, harm taxonomy, success metrics.
See
references/research_framing_safety.md
.
假设设定、危害分类体系、成功指标定义。
详见
references/research_framing_safety.md

2. Benchmarks and datasets

2. 基准与数据集构建

Golden sets, labeling, versioning.
See
references/safety_benchmarks_datasets.md
.
黄金数据集、标注工作、版本管理。
详见
references/safety_benchmarks_datasets.md

3. Model development

3. 模型开发

Training, fine-tuning, ensembles.
See
references/classifier_model_development.md
.
训练、微调、集成模型构建。
详见
references/classifier_model_development.md

4. Evaluation and metrics

4. 评估与指标分析

Slices, calibration, error analysis.
See
references/evaluation_metrics_analysis.md
.
切片分析、校准、错误分析。
详见
references/evaluation_metrics_analysis.md

5. Ablations and experiments

5. 消融实验设计

Controls, reproducibility.
See
references/ablation_experiment_design.md
.
控制变量、可复现性保障。
详见
references/ablation_experiment_design.md

6. Handoff to production

6. 向生产环境交付

Promotion criteria, monitoring hooks.
See
references/research_to_production_handoff.md
.
推广标准制定、监控钩子设计。
详见
references/research_to_production_handoff.md

Outputs

产出物

  • Research brief — question, baseline, hypothesis, metrics
  • Dataset card — sources, label schema, known limitations
  • Benchmark spec — cases, categories, pass/fail rubric
  • Results table — metrics by slice with confidence intervals where possible
  • Error analysis — representative FP/FN clusters
  • Promotion recommendation — go/no-go vs current production classifier
  • 研究简报——问题、基准线、假设、指标
  • 数据集卡片——来源、标签体系、已知局限性
  • 基准规范——测试案例、分类、通过/失败规则
  • 结果表格——各切片的指标数据(尽可能包含置信区间)
  • 错误分析报告——典型假阳性/假阴性聚类
  • 推广建议——与当前生产分类器对比后的上线/不上线建议

Principles

工作原则

  • Measure what policy cares about — category-level recall on high-severity harms
  • Report failures honestly — FPs hurt UX; FNs hurt safety
  • Hold out adversarial refresh — do not train on the only test set
  • Reproducible — seeds, data version, model hash, eval script
  • Separate research from ops — research proves lift; infra ships it
  • 聚焦政策关注重点——高严重性危害类别的召回率
  • 如实报告问题——假阳性影响用户体验;假阴性威胁安全
  • 保留对抗性测试数据——切勿用唯一的测试集进行训练
  • 确保可复现——随机种子、数据版本、模型哈希、评估脚本
  • 区分研究与运维——研究验证性能提升;运维负责部署落地