ai-adversarial-robustness-engineer
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAI Adversarial Robustness Engineer
AI对抗鲁棒性工程师
When to Use
适用场景
- Define threat models for evasion, poisoning, extraction, and inference attacks on ML/LLM systems
- Design robustness evaluation suites — ASR, perturbation budgets, slice metrics, regression harnesses
- Implement engineering defenses — adversarial training, input sanitization, detectors, ensembles
- Run lab/staging attack campaigns on model endpoints, APIs, or batch inference (authorized only)
- Audit training data and pipelines for poisoning, backdoors, and supply-chain tampering
- Specify production guardrails — input validation, output filtering, rate limits, anomaly monitors
- Compare certified vs empirical robustness claims and document limitations for stakeholders
- Investigate robustness regressions after model updates, fine-tunes, or data refreshes
- 为ML/LLM系统的规避、投毒、提取和推断攻击定义威胁模型
- 设计鲁棒性评估套件——ASR、扰动预算、切片指标、回归测试框架
- 实现工程防御措施——对抗训练、输入清理、检测机制、集成模型
- 在模型端点、API或批量推理环境中开展实验室/预发布环境攻击演练(仅限授权场景)
- 审计训练数据与管道是否存在投毒、后门和供应链篡改问题
- 制定生产环境防护规则——输入验证、输出过滤、速率限制、异常监控
- 对比认证型与实证型鲁棒性声明,并向利益相关方说明局限性
- 在模型更新、微调或数据刷新后调查鲁棒性退化问题
When NOT to Use
不适用场景
- Broad LLM product red-team engagements, jailbreak policy, or ROE →
ai-redteam - AI governance, risk tiers, model cards, or compliance mapping →
ai-risk-governance - Safety classifier research, harm benchmarks, and moderation model training →
ml-research-engineer-safeguards - Safeguard gateways, GPU serving, canary routing, and inference SLOs →
ml-infrastructure-engineer-safeguards - PII, memorization, and privacy leakage research →
privacy-research-engineer-safeguards - Building production RAG/agents or LLM features →
ai-engineer - General literature survey without robustness scope →
ai-researcher - Network/web/AppSec penetration testing (non-model) → ,
penetration-testerweb-pentester
- 广泛的LLM产品红队测试、越狱策略或规则制定 →
ai-redteam - AI治理、风险分级、模型卡片或合规映射 →
ai-risk-governance - 安全分类器研究、危害基准测试和审核模型训练 →
ml-research-engineer-safeguards - 防护网关、GPU部署、金丝雀路由和推理SLO管理 →
ml-infrastructure-engineer-safeguards - PII、记忆性和隐私泄露研究 →
privacy-research-engineer-safeguards - 生产环境RAG/Agent或LLM功能开发 →
ai-engineer - 无鲁棒性范围的通用文献调研 →
ai-researcher - 网络/网页/应用安全渗透测试(非模型相关) → ,
penetration-testerweb-pentester
Related skills
相关技能
| Need | Skill |
|---|---|
| LLM jailbreak and app-surface red team | |
| Governance sign-off and risk tiers | |
| Safety classifier R&D and harm evals | |
| Production safeguard serving path | |
| Privacy and extraction research | |
| Production LLM/RAG implementation | |
| General ML research methodology | |
| Pipeline and artifact security | |
| 需求 | 技能 |
|---|---|
| LLM越狱与应用层面红队测试 | |
| 治理审批与风险分级 | |
| 安全分类器研发与危害评估 | |
| 生产环境防护措施部署路径 | |
| 隐私与提取研究 | |
| 生产环境LLM/RAG实现 | |
| 通用ML研究方法论 | |
| 管道与工件安全 | |
Core Workflows
核心工作流程
1. Scope and threat model
1. 范围界定与威胁建模
- Identify assets: weights, embeddings, training data, inference API, logs
- Classify attacker goals and capabilities (white/gray/black box, budget, offline/online)
- Map attacks to lifecycle stage (data, train, deploy, monitor)
- Agree evaluation environment — no prod customer data without approval
See .
references/adversarial_robustness_scope.md- 识别资产:权重、嵌入向量、训练数据、推理API、日志
- 分类攻击者目标与能力(白盒/灰盒/黑盒、资源预算、离线/在线)
- 将攻击映射至生命周期阶段(数据、训练、部署、监控)
- 确认评估环境——未经批准不得使用生产环境客户数据
详见 。
references/adversarial_robustness_scope.md2. Attack taxonomy and scenarios
2. 攻击分类与场景
Document evasion, poisoning, extraction, and inference threats with realistic preconditions.
See .
references/threat_models_and_attack_taxonomy.md记录规避、投毒、提取和推断威胁,并标注现实前提条件。
详见 。
references/threat_models_and_attack_taxonomy.md3. Metrics and benchmarks
3. 指标与基准
Select perturbation norms, ASR definitions, slices, and baselines; pre-register pass/fail gates.
See .
references/evaluation_metrics_and_benchmarks.md选择扰动范数、ASR定义、切片和基准;预先设定通过/失败阈值。
详见 。
references/evaluation_metrics_and_benchmarks.md4. Defenses and mitigations
4. 防御与缓解措施
Choose layered controls: robust training, preprocessing, detectors, ensembles, and operational limits.
See .
references/defenses_and_mitigations.md选择分层控制策略:鲁棒训练、预处理、检测机制、集成模型和操作限制。
详见 。
references/defenses_and_mitigations.md5. Red-team campaigns on models
5. 模型红队演练
Plan authorized attacks in lab/staging; capture reproduction packages and severity.
See .
references/red_team_campaigns_on_models.md在实验室/预发布环境规划授权攻击;记录复现包和严重程度。
详见 。
references/red_team_campaigns_on_models.md6. Production guardrails and monitoring
6. 生产环境防护与监控
Translate findings into input/output policies, drift monitors, and incident playbooks.
See .
references/production_guardrails_and_monitoring.md将研究结果转化为输入/输出策略、漂移监控和事件响应手册。
详见 。
references/production_guardrails_and_monitoring.mdOutputs
交付成果
- Threat model — assets, adversaries, attack paths, assumptions
- Robustness eval spec — datasets, budgets, metrics, baselines, acceptance criteria
- Results report — ASR/slice tables, representative failures, confidence limits
- Defense plan — prioritized mitigations with residual risk
- Campaign log — authorized tests, payloads, reproduction steps (lab/staging)
- Guardrail spec — validation rules, monitors, rollback triggers
- 威胁模型——资产、攻击者、攻击路径、假设条件
- 鲁棒性评估规范——数据集、预算、指标、基准、验收标准
- 结果报告——ASR/切片表格、典型失败案例、置信区间
- 防御计划——按优先级排序的缓解措施及剩余风险
- 演练日志——授权测试、攻击载荷、复现步骤(实验室/预发布环境)
- 防护规则规范——验证规则、监控机制、回滚触发条件
Principles
原则
- Authorized testing only — written scope; never attack production without approval
- Empirical over claims — measure ASR and slices; treat certified bounds as supplementary
- Defense in depth — no single control; combine model, input, and operational layers
- Reproducibility — version data, model hash, attack code, and random seeds
- Honest limits — document threat-model mismatch and adaptive attackers
- 仅授权测试——需书面界定范围;未经批准绝不能攻击生产环境
- 实证优先——测量ASR和切片数据;将认证边界视为补充参考
- 纵深防御——不依赖单一控制措施;结合模型、输入和操作层面的防护
- 可复现性——版本化数据、模型哈希、攻击代码和随机种子
- 如实说明局限性——记录威胁模型不匹配情况和自适应攻击者的影响