ai-adversarial-robustness-engineer

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

AI Adversarial Robustness Engineer

AI对抗鲁棒性工程师

When to Use

适用场景

Define threat models for evasion, poisoning, extraction, and inference attacks on ML/LLM systems
Design robustness evaluation suites — ASR, perturbation budgets, slice metrics, regression harnesses
Implement engineering defenses — adversarial training, input sanitization, detectors, ensembles
Run lab/staging attack campaigns on model endpoints, APIs, or batch inference (authorized only)
Audit training data and pipelines for poisoning, backdoors, and supply-chain tampering
Specify production guardrails — input validation, output filtering, rate limits, anomaly monitors
Compare certified vs empirical robustness claims and document limitations for stakeholders
Investigate robustness regressions after model updates, fine-tunes, or data refreshes

为ML/LLM系统的规避、投毒、提取和推断攻击定义威胁模型
设计鲁棒性评估套件——ASR、扰动预算、切片指标、回归测试框架
实现工程防御措施——对抗训练、输入清理、检测机制、集成模型
在模型端点、API或批量推理环境中开展实验室/预发布环境攻击演练（仅限授权场景）
审计训练数据与管道是否存在投毒、后门和供应链篡改问题
制定生产环境防护规则——输入验证、输出过滤、速率限制、异常监控
对比认证型与实证型鲁棒性声明，并向利益相关方说明局限性
在模型更新、微调或数据刷新后调查鲁棒性退化问题

When NOT to Use

不适用场景

Broad LLM product red-team engagements, jailbreak policy, or ROE →
```
ai-redteam
```
AI governance, risk tiers, model cards, or compliance mapping →
```
ai-risk-governance
```
Safety classifier research, harm benchmarks, and moderation model training →
```
ml-research-engineer-safeguards
```
Safeguard gateways, GPU serving, canary routing, and inference SLOs →
```
ml-infrastructure-engineer-safeguards
```
PII, memorization, and privacy leakage research →
```
privacy-research-engineer-safeguards
```
Building production RAG/agents or LLM features →
```
ai-engineer
```
General literature survey without robustness scope →
```
ai-researcher
```
Network/web/AppSec penetration testing (non-model) →
```
penetration-tester
```
,
```
web-pentester
```

广泛的LLM产品红队测试、越狱策略或规则制定 →
```
ai-redteam
```
AI治理、风险分级、模型卡片或合规映射 →
```
ai-risk-governance
```
安全分类器研究、危害基准测试和审核模型训练 →
```
ml-research-engineer-safeguards
```
防护网关、GPU部署、金丝雀路由和推理SLO管理 →
```
ml-infrastructure-engineer-safeguards
```
PII、记忆性和隐私泄露研究 →
```
privacy-research-engineer-safeguards
```
生产环境RAG/Agent或LLM功能开发 →
```
ai-engineer
```
无鲁棒性范围的通用文献调研 →
```
ai-researcher
```
网络/网页/应用安全渗透测试（非模型相关） →
```
penetration-tester
```
,
```
web-pentester
```

Related skills

Need	Skill
LLM jailbreak and app-surface red team	`ai-redteam`
Governance sign-off and risk tiers	`ai-risk-governance`
Safety classifier R&D and harm evals	`ml-research-engineer-safeguards`
Production safeguard serving path	`ml-infrastructure-engineer-safeguards`
Privacy and extraction research	`privacy-research-engineer-safeguards`
Production LLM/RAG implementation	`ai-engineer`
General ML research methodology	`ai-researcher`
Pipeline and artifact security	`devsecops`

需求	技能
LLM越狱与应用层面红队测试	`ai-redteam`
治理审批与风险分级	`ai-risk-governance`
安全分类器研发与危害评估	`ml-research-engineer-safeguards`
生产环境防护措施部署路径	`ml-infrastructure-engineer-safeguards`
隐私与提取研究	`privacy-research-engineer-safeguards`
生产环境LLM/RAG实现	`ai-engineer`
通用ML研究方法论	`ai-researcher`
管道与工件安全	`devsecops`

Core Workflows

核心工作流程

1. Scope and threat model

1. 范围界定与威胁建模

Identify assets: weights, embeddings, training data, inference API, logs
Classify attacker goals and capabilities (white/gray/black box, budget, offline/online)
Map attacks to lifecycle stage (data, train, deploy, monitor)
Agree evaluation environment — no prod customer data without approval

See
references/adversarial_robustness_scope.md
.

识别资产：权重、嵌入向量、训练数据、推理API、日志
分类攻击者目标与能力（白盒/灰盒/黑盒、资源预算、离线/在线）
将攻击映射至生命周期阶段（数据、训练、部署、监控）
确认评估环境——未经批准不得使用生产环境客户数据

详见
references/adversarial_robustness_scope.md
。

2. Attack taxonomy and scenarios

2. 攻击分类与场景

Document evasion, poisoning, extraction, and inference threats with realistic preconditions.

See
references/threat_models_and_attack_taxonomy.md
.

记录规避、投毒、提取和推断威胁，并标注现实前提条件。

详见
references/threat_models_and_attack_taxonomy.md
。

3. Metrics and benchmarks

3. 指标与基准

Select perturbation norms, ASR definitions, slices, and baselines; pre-register pass/fail gates.

See
references/evaluation_metrics_and_benchmarks.md
.

选择扰动范数、ASR定义、切片和基准；预先设定通过/失败阈值。

详见
references/evaluation_metrics_and_benchmarks.md
。

4. Defenses and mitigations

4. 防御与缓解措施

Choose layered controls: robust training, preprocessing, detectors, ensembles, and operational limits.

See
references/defenses_and_mitigations.md
.

选择分层控制策略：鲁棒训练、预处理、检测机制、集成模型和操作限制。

详见
references/defenses_and_mitigations.md
。

5. Red-team campaigns on models

5. 模型红队演练

Plan authorized attacks in lab/staging; capture reproduction packages and severity.

See
references/red_team_campaigns_on_models.md
.

在实验室/预发布环境规划授权攻击；记录复现包和严重程度。

详见
references/red_team_campaigns_on_models.md
。

6. Production guardrails and monitoring

6. 生产环境防护与监控

Translate findings into input/output policies, drift monitors, and incident playbooks.

See
references/production_guardrails_and_monitoring.md
.

将研究结果转化为输入/输出策略、漂移监控和事件响应手册。

详见
references/production_guardrails_and_monitoring.md
。

Outputs

交付成果

Threat model — assets, adversaries, attack paths, assumptions
Robustness eval spec — datasets, budgets, metrics, baselines, acceptance criteria
Results report — ASR/slice tables, representative failures, confidence limits
Defense plan — prioritized mitigations with residual risk
Campaign log — authorized tests, payloads, reproduction steps (lab/staging)
Guardrail spec — validation rules, monitors, rollback triggers

威胁模型——资产、攻击者、攻击路径、假设条件
鲁棒性评估规范——数据集、预算、指标、基准、验收标准
结果报告——ASR/切片表格、典型失败案例、置信区间
防御计划——按优先级排序的缓解措施及剩余风险
演练日志——授权测试、攻击载荷、复现步骤（实验室/预发布环境）
防护规则规范——验证规则、监控机制、回滚触发条件

Principles

原则

Authorized testing only — written scope; never attack production without approval
Empirical over claims — measure ASR and slices; treat certified bounds as supplementary
Defense in depth — no single control; combine model, input, and operational layers
Reproducibility — version data, model hash, attack code, and random seeds
Honest limits — document threat-model mismatch and adaptive attackers

仅授权测试——需书面界定范围；未经批准绝不能攻击生产环境
实证优先——测量ASR和切片数据；将认证边界视为补充参考
纵深防御——不依赖单一控制措施；结合模型、输入和操作层面的防护
可复现性——版本化数据、模型哈希、攻击代码和随机种子
如实说明局限性——记录威胁模型不匹配情况和自适应攻击者的影响