advanced-guardrails
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAdvanced Guardrails
高级安全护栏
Production LLM safety using NeMo Guardrails, Guardrails AI, and OpenAI moderation with red-teaming validation.
**NeMo Guardrails **: LangChain 1.x compatible, parallel rails execution, OpenTelemetry tracing. DeepTeam: 40+ vulnerabilities, OWASP Top 10 alignment.
借助NeMo Guardrails、Guardrails AI和OpenAI内容审核,结合红队测试验证,实现生产环境下的大语言模型(LLM)安全防护。
**NeMo Guardrails **:兼容LangChain 1.x,支持并行护栏执行、OpenTelemetry追踪。DeepTeam:覆盖40+漏洞,对齐OWASP Top 10标准。
Overview
概述
- Implementing input/output validation for LLM applications
- Preventing hallucinations and enforcing factuality
- Detecting and filtering toxic, harmful, or off-topic content
- Restricting LLM responses to specific domains/topics
- PII detection and redaction in LLM outputs
- Red-teaming and adversarial testing of LLM systems
- OWASP Top 10 for LLMs compliance
- 为LLM应用实现输入/输出验证
- 预防幻觉生成,确保内容真实性
- 检测并过滤有毒、有害或偏离主题的内容
- 限制LLM响应范围至特定领域/主题
- 在LLM输出中检测并脱敏个人可识别信息(PII)
- 对LLM系统进行红队测试与对抗性测试
- 符合LLM版OWASP Top 10安全标准
Framework Comparison
框架对比
| Framework | Best For | Key Features |
|---|---|---|
| NeMo Guardrails | Programmable flows, Colang 2.0 | Input/output rails, fact-checking, dialog control |
| Guardrails AI | Validator-based, modular | 100+ validators, PII, toxicity, structured output |
| OpenAI Guardrails | Drop-in wrapper | Simple integration, moderation API |
| DeepTeam | Red teaming, adversarial | GOAT attacks, multi-turn jailbreaking, vulnerability scanning |
| 框架 | 适用场景 | 核心特性 |
|---|---|---|
| NeMo Guardrails | 可编程流程、Colang 2.0 | 输入/输出护栏、事实核查、对话控制 |
| Guardrails AI | 基于验证器、模块化设计 | 100+验证器、PII检测、毒性检测、结构化输出 |
| OpenAI Guardrails | 即插即用封装 | 简单集成、内容审核API |
| DeepTeam | 红队测试、对抗性测试 | GOAT攻击、多轮越狱、漏洞扫描 |
Quick Reference
快速参考
NeMo Guardrails with Guardrails AI Integration
NeMo Guardrails与Guardrails AI集成
yaml
undefinedyaml
undefinedconfig.yml
config.yml
models:
- type: main engine: openai model: gpt-5.2
rails:
config:
guardrails_ai:
validators:
- name: toxic_language
parameters:
threshold: 0.5
validation_method: "sentence"
- name: guardrails_pii
parameters:
entities: ["phone_number", "email", "ssn", "credit_card"]
- name: restricttotopic
parameters:
valid_topics: ["technology", "support"]
- name: valid_length
parameters:
min: 10
max: 500
input:
flows:
- guardrailsai check input $validator="guardrails_pii"
- guardrailsai check input $validator="competitor_check"
output:
flows:
- guardrailsai check output $validator="toxic_language"
- guardrailsai check output $validator="restricttotopic"
- guardrailsai check output $validator="valid_length"
undefinedmodels:
- type: main engine: openai model: gpt-5.2
rails:
config:
guardrails_ai:
validators:
- name: toxic_language
parameters:
threshold: 0.5
validation_method: "sentence"
- name: guardrails_pii
parameters:
entities: ["phone_number", "email", "ssn", "credit_card"]
- name: restricttotopic
parameters:
valid_topics: ["technology", "support"]
- name: valid_length
parameters:
min: 10
max: 500
input:
flows:
- guardrailsai check input $validator="guardrails_pii"
- guardrailsai check input $validator="competitor_check"
output:
flows:
- guardrailsai check output $validator="toxic_language"
- guardrailsai check output $validator="restricttotopic"
- guardrailsai check output $validator="valid_length"
undefinedColang 2.0 Fact-Checking Rails
Colang 2.0事实核查护栏
colang
define flow answer question with facts
"""Enable fact-checking for RAG responses."""
user ...
$answer = execute rag()
$check_facts = True # Enables fact-checking rail
bot $answer
define flow check hallucination
"""Block responses about people without verification."""
user ask about people
$check_hallucination = True # Blocking mode
bot respond about people
define flow restrict competitor mentions
"""Prevent discussing competitor products."""
user ask about $competitor
if $competitor in ["CompetitorA", "CompetitorB"]
bot "I can only discuss our products."
else
bot respond normallycolang
define flow answer question with facts
"""Enable fact-checking for RAG responses."""
user ...
$answer = execute rag()
$check_facts = True # Enables fact-checking rail
bot $answer
define flow check hallucination
"""Block responses about people without verification."""
user ask about people
$check_hallucination = True # Blocking mode
bot respond about people
define flow restrict competitor mentions
"""Prevent discussing competitor products."""
user ask about $competitor
if $competitor in ["CompetitorA", "CompetitorB"]
bot "I can only discuss our products."
else
bot respond normallyGuardrails AI Validators
Guardrails AI验证器
python
from guardrails import Guard
from guardrails.hub import (
ToxicLanguage,
DetectPII,
RestrictToTopic,
ValidLength,
ResponseEvaluator,
)python
from guardrails import Guard
from guardrails.hub import (
ToxicLanguage,
DetectPII,
RestrictToTopic,
ValidLength,
ResponseEvaluator,
)Create guard with multiple validators
Create guard with multiple validators
guard = Guard().use_many(
ToxicLanguage(threshold=0.5, on_fail="filter"),
DetectPII(
pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"],
on_fail="fix" # Redacts PII
),
RestrictToTopic(
valid_topics=["technology", "customer support"],
invalid_topics=["politics", "religion"],
on_fail="refrain"
),
ValidLength(min=10, max=500, on_fail="reask"),
)
guard = Guard().use_many(
ToxicLanguage(threshold=0.5, on_fail="filter"),
DetectPII(
pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"],
on_fail="fix" # Redacts PII
),
RestrictToTopic(
valid_topics=["technology", "customer support"],
invalid_topics=["politics", "religion"],
on_fail="refrain"
),
ValidLength(min=10, max=500, on_fail="reask"),
)
Validate LLM output
Validate LLM output
result = guard(
llm_api=openai.chat.completions.create,
model="gpt-5.2",
messages=[{"role": "user", "content": user_input}],
)
if result.validation_passed:
return result.validated_output
else:
return "I cannot respond to that request."
undefinedresult = guard(
llm_api=openai.chat.completions.create,
model="gpt-5.2",
messages=[{"role": "user", "content": user_input}],
)
if result.validation_passed:
return result.validated_output
else:
return "I cannot respond to that request."
undefinedDeepTeam Red Teaming
DeepTeam红队测试
python
from deepteam import red_team
from deepteam.vulnerabilities import (
Bias, Toxicity, PIILeakage,
PromptInjection, Jailbreaking,
Misinformation, CompetitorEndorsement
)
async def run_red_team_audit(
target_model: callable,
attacks_per_vulnerability: int = 10
) -> dict:
"""Run comprehensive red team audit against target LLM."""
results = await red_team(
model=target_model,
vulnerabilities=[
Bias(categories=["gender", "race", "religion", "age"]),
Toxicity(threshold=0.7),
PIILeakage(types=["email", "phone", "ssn", "credit_card"]),
PromptInjection(techniques=["direct", "indirect", "context"]),
Jailbreaking(
multi_turn=True, # GOAT-style multi-turn attacks
techniques=["dan", "roleplay", "context_manipulation"]
),
Misinformation(domains=["health", "finance", "legal"]),
CompetitorEndorsement(competitors=["competitor_list"]),
],
attacks_per_vulnerability=attacks_per_vulnerability,
)
return {
"total_attacks": results.total_attacks,
"successful_attacks": results.successful_attacks,
"attack_success_rate": results.successful_attacks / results.total_attacks,
"vulnerabilities": [
{
"type": v.type,
"severity": v.severity,
"successful_prompts": v.successful_prompts[:3],
"mitigation": v.suggested_mitigation,
}
for v in results.vulnerabilities
],
}python
from deepteam import red_team
from deepteam.vulnerabilities import (
Bias, Toxicity, PIILeakage,
PromptInjection, Jailbreaking,
Misinformation, CompetitorEndorsement
)
async def run_red_team_audit(
target_model: callable,
attacks_per_vulnerability: int = 10
) -> dict:
"""Run comprehensive red team audit against target LLM."""
results = await red_team(
model=target_model,
vulnerabilities=[
Bias(categories=["gender", "race", "religion", "age"]),
Toxicity(threshold=0.7),
PIILeakage(types=["email", "phone", "ssn", "credit_card"]),
PromptInjection(techniques=["direct", "indirect", "context"]),
Jailbreaking(
multi_turn=True, # GOAT-style multi-turn attacks
techniques=["dan", "roleplay", "context_manipulation"]
),
Misinformation(domains=["health", "finance", "legal"]),
CompetitorEndorsement(competitors=["competitor_list"]),
],
attacks_per_vulnerability=attacks_per_vulnerability,
)
return {
"total_attacks": results.total_attacks,
"successful_attacks": results.successful_attacks,
"attack_success_rate": results.successful_attacks / results.total_attacks,
"vulnerabilities": [
{
"type": v.type,
"severity": v.severity,
"successful_prompts": v.successful_prompts[:3],
"mitigation": v.suggested_mitigation,
}
for v in results.vulnerabilities
],
}OWASP Top 10 for LLMs 2025 Mapping
2025年LLM版OWASP Top 10对应方案
| OWASP LLM Risk | Guardrail Solution |
|---|---|
| LLM01: Prompt Injection | NeMo input rails, Guardrails AI validators |
| LLM02: Insecure Output | Output rails, structured validation, sanitization |
| LLM03: Training Data Poisoning | N/A (training-time concern) |
| LLM04: Model Denial of Service | Rate limiting, token budgets, timeout rails |
| LLM05: Supply Chain Vulnerabilities | Dependency scanning, model provenance |
| LLM06: Sensitive Info Disclosure | PII detection, context separation, output filtering |
| LLM07: Insecure Plugin Design | Tool validation, permission boundaries |
| LLM08: Excessive Agency | Human-in-loop rails, action confirmation |
| LLM09: Overreliance | Factuality checking, confidence thresholds |
| LLM10: Model Theft | N/A (infrastructure concern) |
| OWASP LLM风险项 | 护栏解决方案 |
|---|---|
| LLM01: 提示注入 | NeMo输入护栏、Guardrails AI验证器 |
| LLM02: 不安全输出 | 输出护栏、结构化验证、内容清理 |
| LLM03: 训练数据投毒 | 不适用(属于训练阶段关注点) |
| LLM04: 模型拒绝服务 | 速率限制、令牌预算、超时护栏 |
| LLM05: 供应链漏洞 | 依赖扫描、模型溯源验证 |
| LLM06: 敏感信息泄露 | PII检测、上下文隔离、输出过滤 |
| LLM07: 不安全插件设计 | 工具验证、权限边界控制 |
| LLM08: 过度自主 | 人工介入护栏、操作确认机制 |
| LLM09: 过度依赖 | 事实核查、置信度阈值控制 |
| LLM10: 模型窃取 | 不适用(属于基础设施关注点) |
Anti-Patterns (FORBIDDEN)
反模式(严禁使用)
python
undefinedpython
undefinedNEVER trust LLM output without validation
NEVER trust LLM output without validation
response = llm.generate(prompt)
return response # Raw, unvalidated output!
response = llm.generate(prompt)
return response # Raw, unvalidated output!
NEVER skip input sanitization
NEVER skip input sanitization
user_input = request.json["message"]
llm.generate(user_input) # Prompt injection risk!
user_input = request.json["message"]
llm.generate(user_input) # Prompt injection risk!
NEVER use single validation layer
NEVER use single validation layer
if not is_toxic(output): # Only one check
return output
if not is_toxic(output): # Only one check
return output
ALWAYS use layered validation
ALWAYS use layered validation
guard = Guard().use_many(
ToxicLanguage(threshold=0.5),
DetectPII(on_fail="fix"),
ValidLength(max=500),
)
guard = Guard().use_many(
ToxicLanguage(threshold=0.5),
DetectPII(on_fail="fix"),
ValidLength(max=500),
)
ALWAYS validate both input and output
ALWAYS validate both input and output
input_result = input_guard.validate(user_input)
if not input_result.validation_passed:
return "Invalid input"
llm_output = llm.generate(input_result.validated_output)
output_result = output_guard.validate(llm_output)
return output_result.validated_output
undefinedinput_result = input_guard.validate(user_input)
if not input_result.validation_passed:
return "Invalid input"
llm_output = llm.generate(input_result.validated_output)
output_result = output_guard.validate(llm_output)
return output_result.validated_output
undefinedKey Decisions
关键决策建议
| Decision | Recommendation |
|---|---|
| Framework choice | NeMo for flows, Guardrails AI for validators |
| Toxicity threshold | 0.5 for content apps, 0.3 for children's apps |
| PII handling | Redact for logs, block for outputs |
| Topic restriction | Allowlist preferred over blocklist |
| Fact-checking | Required for factual domains (health, finance, legal) |
| Red-teaming frequency | Pre-release + quarterly |
| 决策项 | 推荐方案 |
|---|---|
| 框架选择 | 流程控制选NeMo,验证器选Guardrails AI |
| 毒性检测阈值 | 内容类应用设为0.5,儿童类应用设为0.3 |
| PII处理方式 | 日志中脱敏,输出中拦截 |
| 主题限制策略 | 优先使用白名单而非黑名单 |
| 事实核查要求 | 事实类领域(医疗、金融、法律)必须启用 |
| 红队测试频率 | 发布前+每季度一次 |
Detailed Documentation
详细文档资源
| Resource | Description |
|---|---|
| references/nemo-guardrails.md | NeMo Guardrails with Colang 2.0 |
| references/guardrails-ai.md | Guardrails AI validators and patterns |
| references/openai-guardrails.md | OpenAI Moderation API integration |
| references/factuality-checking.md | Hallucination detection and grounding |
| references/red-teaming.md | DeepTeam and adversarial testing |
| scripts/nemo-config.yaml | Production NeMo configuration |
| scripts/rails-pipeline.py | Complete guardrails pipeline |
| 资源 | 说明 |
|---|---|
| references/nemo-guardrails.md | NeMo Guardrails与Colang 2.0使用指南 |
| references/guardrails-ai.md | Guardrails AI验证器与使用模式 |
| references/openai-guardrails.md | OpenAI内容审核API集成指南 |
| references/factuality-checking.md | 幻觉检测与事实锚定方案 |
| references/red-teaming.md | DeepTeam与对抗性测试指南 |
| scripts/nemo-config.yaml | 生产环境NeMo配置示例 |
| scripts/rails-pipeline.py | 完整护栏工作流示例 |
Related Skills
相关技能
- - Context separation and attribution
llm-safety-patterns - - Quality assessment and hallucination detection
llm-evaluation - - Request sanitization patterns
input-validation - - Web security fundamentals
owasp-top-10
- - 上下文隔离与归因
llm-safety-patterns - - 质量评估与幻觉检测
llm-evaluation - - 请求清理模式
input-validation - - Web安全基础
owasp-top-10
Capability Details
能力详情
nemo-guardrails
nemo-guardrails
Keywords: NeMo, guardrails, rails, Colang, dialog flow, input rails, output rails
Solves:
- Configure NeMo Guardrails for LLM safety
- Implement Colang 2.0 dialog flows
- Create input/output validation rails
关键词: NeMo, guardrails, rails, Colang, dialog flow, input rails, output rails
解决场景:
- 为LLM安全配置NeMo Guardrails
- 实现Colang 2.0对话流程
- 创建输入/输出验证护栏
guardrails-ai-validators
guardrails-ai-validators
Keywords: Guardrails AI, validator, PII, toxicity, topic restriction, structured output
Solves:
- Use Guardrails AI validators for output validation
- Detect and redact PII from LLM responses
- Restrict LLM to specific topics
关键词: Guardrails AI, validator, PII, toxicity, topic restriction, structured output
解决场景:
- 使用Guardrails AI验证器进行输出验证
- 检测并脱敏LLM响应中的PII
- 限制LLM输出至特定主题
factuality-checking
factuality-checking
Keywords: fact-check, hallucination, grounding, RAG verification, NLI
Solves:
- Verify LLM claims against source documents
- Detect hallucinations in generated content
- Implement grounding checks for RAG
关键词: fact-check, hallucination, grounding, RAG verification, NLI
解决场景:
- 对照源文档验证LLM生成内容的真实性
- 检测生成内容中的幻觉
- 为RAG实现事实锚定检查
red-teaming
red-teaming
Keywords: red team, adversarial, jailbreak, GOAT, prompt injection, DeepTeam
Solves:
- Run adversarial testing on LLM systems
- Detect jailbreaking vulnerabilities
- Test prompt injection resistance
关键词: red team, adversarial, jailbreak, GOAT, prompt injection, DeepTeam
解决场景:
- 对LLM系统进行对抗性测试
- 检测越狱漏洞
- 测试提示注入抗性
owasp-llm-compliance
owasp-llm-compliance
Keywords: OWASP LLM, LLM security, LLM vulnerabilities, LLM Top 10
Solves:
- Implement OWASP Top 10 for LLMs mitigations
- Audit LLM systems for security compliance
- Design secure LLM architectures
关键词: OWASP LLM, LLM security, LLM vulnerabilities, LLM Top 10
解决场景:
- 实现LLM版OWASP Top 10风险缓解方案
- 审计LLM系统的安全合规性
- 设计安全的LLM架构