ai-adversarial-robustness-engineer

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

AI Adversarial Robustness Engineer

AI对抗鲁棒性工程师

When to Use

适用场景

  • Define threat models for evasion, poisoning, extraction, and inference attacks on ML/LLM systems
  • Design robustness evaluation suites — ASR, perturbation budgets, slice metrics, regression harnesses
  • Implement engineering defenses — adversarial training, input sanitization, detectors, ensembles
  • Run lab/staging attack campaigns on model endpoints, APIs, or batch inference (authorized only)
  • Audit training data and pipelines for poisoning, backdoors, and supply-chain tampering
  • Specify production guardrails — input validation, output filtering, rate limits, anomaly monitors
  • Compare certified vs empirical robustness claims and document limitations for stakeholders
  • Investigate robustness regressions after model updates, fine-tunes, or data refreshes
  • 为ML/LLM系统的规避、投毒、提取和推断攻击定义威胁模型
  • 设计鲁棒性评估套件——ASR、扰动预算、切片指标、回归测试框架
  • 实现工程防御措施——对抗训练、输入清理、检测机制、集成模型
  • 在模型端点、API或批量推理环境中开展实验室/预发布环境攻击演练(仅限授权场景)
  • 审计训练数据与管道是否存在投毒、后门和供应链篡改问题
  • 制定生产环境防护规则——输入验证、输出过滤、速率限制、异常监控
  • 对比认证型与实证型鲁棒性声明,并向利益相关方说明局限性
  • 在模型更新、微调或数据刷新后调查鲁棒性退化问题

When NOT to Use

不适用场景

  • Broad LLM product red-team engagements, jailbreak policy, or ROE →
    ai-redteam
  • AI governance, risk tiers, model cards, or compliance mapping →
    ai-risk-governance
  • Safety classifier research, harm benchmarks, and moderation model training →
    ml-research-engineer-safeguards
  • Safeguard gateways, GPU serving, canary routing, and inference SLOs →
    ml-infrastructure-engineer-safeguards
  • PII, memorization, and privacy leakage research →
    privacy-research-engineer-safeguards
  • Building production RAG/agents or LLM features →
    ai-engineer
  • General literature survey without robustness scope →
    ai-researcher
  • Network/web/AppSec penetration testing (non-model) →
    penetration-tester
    ,
    web-pentester
  • 广泛的LLM产品红队测试、越狱策略或规则制定 →
    ai-redteam
  • AI治理、风险分级、模型卡片或合规映射 →
    ai-risk-governance
  • 安全分类器研究、危害基准测试和审核模型训练 →
    ml-research-engineer-safeguards
  • 防护网关、GPU部署、金丝雀路由和推理SLO管理 →
    ml-infrastructure-engineer-safeguards
  • PII、记忆性和隐私泄露研究 →
    privacy-research-engineer-safeguards
  • 生产环境RAG/Agent或LLM功能开发 →
    ai-engineer
  • 无鲁棒性范围的通用文献调研 →
    ai-researcher
  • 网络/网页/应用安全渗透测试(非模型相关) →
    penetration-tester
    ,
    web-pentester

Related skills

相关技能

NeedSkill
LLM jailbreak and app-surface red team
ai-redteam
Governance sign-off and risk tiers
ai-risk-governance
Safety classifier R&D and harm evals
ml-research-engineer-safeguards
Production safeguard serving path
ml-infrastructure-engineer-safeguards
Privacy and extraction research
privacy-research-engineer-safeguards
Production LLM/RAG implementation
ai-engineer
General ML research methodology
ai-researcher
Pipeline and artifact security
devsecops
需求技能
LLM越狱与应用层面红队测试
ai-redteam
治理审批与风险分级
ai-risk-governance
安全分类器研发与危害评估
ml-research-engineer-safeguards
生产环境防护措施部署路径
ml-infrastructure-engineer-safeguards
隐私与提取研究
privacy-research-engineer-safeguards
生产环境LLM/RAG实现
ai-engineer
通用ML研究方法论
ai-researcher
管道与工件安全
devsecops

Core Workflows

核心工作流程

1. Scope and threat model

1. 范围界定与威胁建模

  1. Identify assets: weights, embeddings, training data, inference API, logs
  2. Classify attacker goals and capabilities (white/gray/black box, budget, offline/online)
  3. Map attacks to lifecycle stage (data, train, deploy, monitor)
  4. Agree evaluation environment — no prod customer data without approval
See
references/adversarial_robustness_scope.md
.
  1. 识别资产:权重、嵌入向量、训练数据、推理API、日志
  2. 分类攻击者目标与能力(白盒/灰盒/黑盒、资源预算、离线/在线)
  3. 将攻击映射至生命周期阶段(数据、训练、部署、监控)
  4. 确认评估环境——未经批准不得使用生产环境客户数据
详见
references/adversarial_robustness_scope.md

2. Attack taxonomy and scenarios

2. 攻击分类与场景

Document evasion, poisoning, extraction, and inference threats with realistic preconditions.
See
references/threat_models_and_attack_taxonomy.md
.
记录规避、投毒、提取和推断威胁,并标注现实前提条件。
详见
references/threat_models_and_attack_taxonomy.md

3. Metrics and benchmarks

3. 指标与基准

Select perturbation norms, ASR definitions, slices, and baselines; pre-register pass/fail gates.
See
references/evaluation_metrics_and_benchmarks.md
.
选择扰动范数、ASR定义、切片和基准;预先设定通过/失败阈值。
详见
references/evaluation_metrics_and_benchmarks.md

4. Defenses and mitigations

4. 防御与缓解措施

Choose layered controls: robust training, preprocessing, detectors, ensembles, and operational limits.
See
references/defenses_and_mitigations.md
.
选择分层控制策略:鲁棒训练、预处理、检测机制、集成模型和操作限制。
详见
references/defenses_and_mitigations.md

5. Red-team campaigns on models

5. 模型红队演练

Plan authorized attacks in lab/staging; capture reproduction packages and severity.
See
references/red_team_campaigns_on_models.md
.
在实验室/预发布环境规划授权攻击;记录复现包和严重程度。
详见
references/red_team_campaigns_on_models.md

6. Production guardrails and monitoring

6. 生产环境防护与监控

Translate findings into input/output policies, drift monitors, and incident playbooks.
See
references/production_guardrails_and_monitoring.md
.
将研究结果转化为输入/输出策略、漂移监控和事件响应手册。
详见
references/production_guardrails_and_monitoring.md

Outputs

交付成果

  • Threat model — assets, adversaries, attack paths, assumptions
  • Robustness eval spec — datasets, budgets, metrics, baselines, acceptance criteria
  • Results report — ASR/slice tables, representative failures, confidence limits
  • Defense plan — prioritized mitigations with residual risk
  • Campaign log — authorized tests, payloads, reproduction steps (lab/staging)
  • Guardrail spec — validation rules, monitors, rollback triggers
  • 威胁模型——资产、攻击者、攻击路径、假设条件
  • 鲁棒性评估规范——数据集、预算、指标、基准、验收标准
  • 结果报告——ASR/切片表格、典型失败案例、置信区间
  • 防御计划——按优先级排序的缓解措施及剩余风险
  • 演练日志——授权测试、攻击载荷、复现步骤(实验室/预发布环境)
  • 防护规则规范——验证规则、监控机制、回滚触发条件

Principles

原则

  • Authorized testing only — written scope; never attack production without approval
  • Empirical over claims — measure ASR and slices; treat certified bounds as supplementary
  • Defense in depth — no single control; combine model, input, and operational layers
  • Reproducibility — version data, model hash, attack code, and random seeds
  • Honest limits — document threat-model mismatch and adaptive attackers
  • 仅授权测试——需书面界定范围;未经批准绝不能攻击生产环境
  • 实证优先——测量ASR和切片数据;将认证边界视为补充参考
  • 纵深防御——不依赖单一控制措施;结合模型、输入和操作层面的防护
  • 可复现性——版本化数据、模型哈希、攻击代码和随机种子
  • 如实说明局限性——记录威胁模型不匹配情况和自适应攻击者的影响