ml-research-engineer-safeguards
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseML / Research Engineer, Safeguards
ML / 研究工程师(安全防护方向)
When to Use
适用场景
- Define research questions on harm detection, jailbreak resistance, or policy categories
- Curate or audit safety datasets — labeling guidelines, bias checks, version control
- Train or fine-tune classifiers, rankers, or small LLM judges for moderation
- Design benchmarks and eval suites — golden sets, adversarial slices, regression harnesses
- Run ablations — architecture, threshold, data mix, ensemble vs single model
- Analyze metrics — precision/recall, calibration, false positive/negative slices
- Write research memos — methods, results, limitations, production recommendation
- Specify promotion bar for a new safeguard model version
- 定义关于危害检测、越狱抗性或政策类别的研究问题
- 整理或审核安全数据集——标注指南、偏差检查、版本控制
- 训练或微调用于审核的分类器、排序器或小型LLM评判模型
- 设计基准和评估套件——黄金数据集、对抗切片、回归测试框架
- 开展消融实验——架构、阈值、数据混合、集成模型vs单一模型
- 分析指标——精确率/召回率、校准、假阳性/阴性切片
- 撰写研究备忘录——方法、结果、局限性、生产部署建议
- 制定新防护模型版本的推广标准
When NOT to Use
不适用场景
- Deploy gateways, GPU serving, canary routing →
ml-infrastructure-engineer-safeguards - Execute structured red-team engagements on prod →
ai-redteam - Draft acceptable-use policy or risk tiers →
ai-risk-governance - Build customer-facing RAG/agents →
ai-engineer - General literature survey unrelated to safety →
ai-researcher - Token/context compression research →
research-engineer-scientist-tokens - Product A/B and business metrics →
data-scientist - PII detection benchmarks, memorization, logging minimization →
privacy-research-engineer-safeguards
- 部署网关、GPU服务、金丝雀路由 →
ml-infrastructure-engineer-safeguards - 在生产环境执行结构化红队测试 →
ai-redteam - 起草可接受使用政策或风险等级 →
ai-risk-governance - 构建面向客户的RAG/Agent →
ai-engineer - 与安全无关的通用文献调研 →
ai-researcher - 令牌/上下文压缩研究 →
research-engineer-scientist-tokens - 产品A/B测试与业务指标分析 →
data-scientist - PII检测基准、记忆性研究、日志最小化 →
privacy-research-engineer-safeguards
Related skills
相关技能
| Need | Skill |
|---|---|
| Privacy research for safeguards | |
| Production safeguard path and rollout | |
| Adversarial attack campaigns | |
| Governance sign-off and model cards | |
| Production eval harness in app | |
| General research methodology | |
| Classical ML and statistics | |
| Token efficiency ablations | |
| Release gates and ops cadence | |
| 需求 | 技能 |
|---|---|
| 安全防护方向的隐私研究 | |
| 生产级防护模型的部署与推广 | |
| 对抗攻击活动 | |
| 治理审批与模型卡片制作 | |
| 应用内的生产级评估框架 | |
| 通用研究方法论 | |
| 经典机器学习与统计学 | |
| 令牌效率消融实验 | |
| 发布管控与运维节奏 | |
Core Workflows
核心工作流
1. Research framing (safety)
1. 安全研究框架搭建
Hypotheses, harm taxonomy, success metrics.
See .
references/research_framing_safety.md假设设定、危害分类体系、成功指标定义。
详见 。
references/research_framing_safety.md2. Benchmarks and datasets
2. 基准与数据集构建
Golden sets, labeling, versioning.
See .
references/safety_benchmarks_datasets.md黄金数据集、标注工作、版本管理。
详见 。
references/safety_benchmarks_datasets.md3. Model development
3. 模型开发
Training, fine-tuning, ensembles.
See .
references/classifier_model_development.md训练、微调、集成模型构建。
详见 。
references/classifier_model_development.md4. Evaluation and metrics
4. 评估与指标分析
Slices, calibration, error analysis.
See .
references/evaluation_metrics_analysis.md切片分析、校准、错误分析。
详见 。
references/evaluation_metrics_analysis.md5. Ablations and experiments
5. 消融实验设计
Controls, reproducibility.
See .
references/ablation_experiment_design.md控制变量、可复现性保障。
详见 。
references/ablation_experiment_design.md6. Handoff to production
6. 向生产环境交付
Promotion criteria, monitoring hooks.
See .
references/research_to_production_handoff.md推广标准制定、监控钩子设计。
详见 。
references/research_to_production_handoff.mdOutputs
产出物
- Research brief — question, baseline, hypothesis, metrics
- Dataset card — sources, label schema, known limitations
- Benchmark spec — cases, categories, pass/fail rubric
- Results table — metrics by slice with confidence intervals where possible
- Error analysis — representative FP/FN clusters
- Promotion recommendation — go/no-go vs current production classifier
- 研究简报——问题、基准线、假设、指标
- 数据集卡片——来源、标签体系、已知局限性
- 基准规范——测试案例、分类、通过/失败规则
- 结果表格——各切片的指标数据(尽可能包含置信区间)
- 错误分析报告——典型假阳性/假阴性聚类
- 推广建议——与当前生产分类器对比后的上线/不上线建议
Principles
工作原则
- Measure what policy cares about — category-level recall on high-severity harms
- Report failures honestly — FPs hurt UX; FNs hurt safety
- Hold out adversarial refresh — do not train on the only test set
- Reproducible — seeds, data version, model hash, eval script
- Separate research from ops — research proves lift; infra ships it
- 聚焦政策关注重点——高严重性危害类别的召回率
- 如实报告问题——假阳性影响用户体验;假阴性威胁安全
- 保留对抗性测试数据——切勿用唯一的测试集进行训练
- 确保可复现——随机种子、数据版本、模型哈希、评估脚本
- 区分研究与运维——研究验证性能提升;运维负责部署落地