advanced-guardrails

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Advanced Guardrails

高级安全护栏

Production LLM safety using NeMo Guardrails, Guardrails AI, and OpenAI moderation with red-teaming validation.
**NeMo Guardrails **: LangChain 1.x compatible, parallel rails execution, OpenTelemetry tracing. DeepTeam: 40+ vulnerabilities, OWASP Top 10 alignment.
借助NeMo Guardrails、Guardrails AI和OpenAI内容审核,结合红队测试验证,实现生产环境下的大语言模型(LLM)安全防护。
**NeMo Guardrails **:兼容LangChain 1.x,支持并行护栏执行、OpenTelemetry追踪。DeepTeam:覆盖40+漏洞,对齐OWASP Top 10标准。

Overview

概述

  • Implementing input/output validation for LLM applications
  • Preventing hallucinations and enforcing factuality
  • Detecting and filtering toxic, harmful, or off-topic content
  • Restricting LLM responses to specific domains/topics
  • PII detection and redaction in LLM outputs
  • Red-teaming and adversarial testing of LLM systems
  • OWASP Top 10 for LLMs compliance
  • 为LLM应用实现输入/输出验证
  • 预防幻觉生成,确保内容真实性
  • 检测并过滤有毒、有害或偏离主题的内容
  • 限制LLM响应范围至特定领域/主题
  • 在LLM输出中检测并脱敏个人可识别信息(PII)
  • 对LLM系统进行红队测试与对抗性测试
  • 符合LLM版OWASP Top 10安全标准

Framework Comparison

框架对比

FrameworkBest ForKey Features
NeMo GuardrailsProgrammable flows, Colang 2.0Input/output rails, fact-checking, dialog control
Guardrails AIValidator-based, modular100+ validators, PII, toxicity, structured output
OpenAI GuardrailsDrop-in wrapperSimple integration, moderation API
DeepTeamRed teaming, adversarialGOAT attacks, multi-turn jailbreaking, vulnerability scanning
框架适用场景核心特性
NeMo Guardrails可编程流程、Colang 2.0输入/输出护栏、事实核查、对话控制
Guardrails AI基于验证器、模块化设计100+验证器、PII检测、毒性检测、结构化输出
OpenAI Guardrails即插即用封装简单集成、内容审核API
DeepTeam红队测试、对抗性测试GOAT攻击、多轮越狱、漏洞扫描

Quick Reference

快速参考

NeMo Guardrails with Guardrails AI Integration

NeMo Guardrails与Guardrails AI集成

yaml
undefined
yaml
undefined

config.yml

config.yml

models:
  • type: main engine: openai model: gpt-5.2
rails: config: guardrails_ai: validators: - name: toxic_language parameters: threshold: 0.5 validation_method: "sentence" - name: guardrails_pii parameters: entities: ["phone_number", "email", "ssn", "credit_card"] - name: restricttotopic parameters: valid_topics: ["technology", "support"] - name: valid_length parameters: min: 10 max: 500
input: flows: - guardrailsai check input $validator="guardrails_pii" - guardrailsai check input $validator="competitor_check"
output: flows: - guardrailsai check output $validator="toxic_language" - guardrailsai check output $validator="restricttotopic" - guardrailsai check output $validator="valid_length"
undefined
models:
  • type: main engine: openai model: gpt-5.2
rails: config: guardrails_ai: validators: - name: toxic_language parameters: threshold: 0.5 validation_method: "sentence" - name: guardrails_pii parameters: entities: ["phone_number", "email", "ssn", "credit_card"] - name: restricttotopic parameters: valid_topics: ["technology", "support"] - name: valid_length parameters: min: 10 max: 500
input: flows: - guardrailsai check input $validator="guardrails_pii" - guardrailsai check input $validator="competitor_check"
output: flows: - guardrailsai check output $validator="toxic_language" - guardrailsai check output $validator="restricttotopic" - guardrailsai check output $validator="valid_length"
undefined

Colang 2.0 Fact-Checking Rails

Colang 2.0事实核查护栏

colang
define flow answer question with facts
  """Enable fact-checking for RAG responses."""
  user ...
  $answer = execute rag()
  $check_facts = True  # Enables fact-checking rail
  bot $answer

define flow check hallucination
  """Block responses about people without verification."""
  user ask about people
  $check_hallucination = True  # Blocking mode
  bot respond about people

define flow restrict competitor mentions
  """Prevent discussing competitor products."""
  user ask about $competitor
  if $competitor in ["CompetitorA", "CompetitorB"]
    bot "I can only discuss our products."
  else
    bot respond normally
colang
define flow answer question with facts
  """Enable fact-checking for RAG responses."""
  user ...
  $answer = execute rag()
  $check_facts = True  # Enables fact-checking rail
  bot $answer

define flow check hallucination
  """Block responses about people without verification."""
  user ask about people
  $check_hallucination = True  # Blocking mode
  bot respond about people

define flow restrict competitor mentions
  """Prevent discussing competitor products."""
  user ask about $competitor
  if $competitor in ["CompetitorA", "CompetitorB"]
    bot "I can only discuss our products."
  else
    bot respond normally

Guardrails AI Validators

Guardrails AI验证器

python
from guardrails import Guard
from guardrails.hub import (
    ToxicLanguage,
    DetectPII,
    RestrictToTopic,
    ValidLength,
    ResponseEvaluator,
)
python
from guardrails import Guard
from guardrails.hub import (
    ToxicLanguage,
    DetectPII,
    RestrictToTopic,
    ValidLength,
    ResponseEvaluator,
)

Create guard with multiple validators

Create guard with multiple validators

guard = Guard().use_many( ToxicLanguage(threshold=0.5, on_fail="filter"), DetectPII( pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"], on_fail="fix" # Redacts PII ), RestrictToTopic( valid_topics=["technology", "customer support"], invalid_topics=["politics", "religion"], on_fail="refrain" ), ValidLength(min=10, max=500, on_fail="reask"), )
guard = Guard().use_many( ToxicLanguage(threshold=0.5, on_fail="filter"), DetectPII( pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"], on_fail="fix" # Redacts PII ), RestrictToTopic( valid_topics=["technology", "customer support"], invalid_topics=["politics", "religion"], on_fail="refrain" ), ValidLength(min=10, max=500, on_fail="reask"), )

Validate LLM output

Validate LLM output

result = guard( llm_api=openai.chat.completions.create, model="gpt-5.2", messages=[{"role": "user", "content": user_input}], )
if result.validation_passed: return result.validated_output else: return "I cannot respond to that request."
undefined
result = guard( llm_api=openai.chat.completions.create, model="gpt-5.2", messages=[{"role": "user", "content": user_input}], )
if result.validation_passed: return result.validated_output else: return "I cannot respond to that request."
undefined

DeepTeam Red Teaming

DeepTeam红队测试

python
from deepteam import red_team
from deepteam.vulnerabilities import (
    Bias, Toxicity, PIILeakage,
    PromptInjection, Jailbreaking,
    Misinformation, CompetitorEndorsement
)

async def run_red_team_audit(
    target_model: callable,
    attacks_per_vulnerability: int = 10
) -> dict:
    """Run comprehensive red team audit against target LLM."""
    results = await red_team(
        model=target_model,
        vulnerabilities=[
            Bias(categories=["gender", "race", "religion", "age"]),
            Toxicity(threshold=0.7),
            PIILeakage(types=["email", "phone", "ssn", "credit_card"]),
            PromptInjection(techniques=["direct", "indirect", "context"]),
            Jailbreaking(
                multi_turn=True,  # GOAT-style multi-turn attacks
                techniques=["dan", "roleplay", "context_manipulation"]
            ),
            Misinformation(domains=["health", "finance", "legal"]),
            CompetitorEndorsement(competitors=["competitor_list"]),
        ],
        attacks_per_vulnerability=attacks_per_vulnerability,
    )

    return {
        "total_attacks": results.total_attacks,
        "successful_attacks": results.successful_attacks,
        "attack_success_rate": results.successful_attacks / results.total_attacks,
        "vulnerabilities": [
            {
                "type": v.type,
                "severity": v.severity,
                "successful_prompts": v.successful_prompts[:3],
                "mitigation": v.suggested_mitigation,
            }
            for v in results.vulnerabilities
        ],
    }
python
from deepteam import red_team
from deepteam.vulnerabilities import (
    Bias, Toxicity, PIILeakage,
    PromptInjection, Jailbreaking,
    Misinformation, CompetitorEndorsement
)

async def run_red_team_audit(
    target_model: callable,
    attacks_per_vulnerability: int = 10
) -> dict:
    """Run comprehensive red team audit against target LLM."""
    results = await red_team(
        model=target_model,
        vulnerabilities=[
            Bias(categories=["gender", "race", "religion", "age"]),
            Toxicity(threshold=0.7),
            PIILeakage(types=["email", "phone", "ssn", "credit_card"]),
            PromptInjection(techniques=["direct", "indirect", "context"]),
            Jailbreaking(
                multi_turn=True,  # GOAT-style multi-turn attacks
                techniques=["dan", "roleplay", "context_manipulation"]
            ),
            Misinformation(domains=["health", "finance", "legal"]),
            CompetitorEndorsement(competitors=["competitor_list"]),
        ],
        attacks_per_vulnerability=attacks_per_vulnerability,
    )

    return {
        "total_attacks": results.total_attacks,
        "successful_attacks": results.successful_attacks,
        "attack_success_rate": results.successful_attacks / results.total_attacks,
        "vulnerabilities": [
            {
                "type": v.type,
                "severity": v.severity,
                "successful_prompts": v.successful_prompts[:3],
                "mitigation": v.suggested_mitigation,
            }
            for v in results.vulnerabilities
        ],
    }

OWASP Top 10 for LLMs 2025 Mapping

2025年LLM版OWASP Top 10对应方案

OWASP LLM RiskGuardrail Solution
LLM01: Prompt InjectionNeMo input rails, Guardrails AI validators
LLM02: Insecure OutputOutput rails, structured validation, sanitization
LLM03: Training Data PoisoningN/A (training-time concern)
LLM04: Model Denial of ServiceRate limiting, token budgets, timeout rails
LLM05: Supply Chain VulnerabilitiesDependency scanning, model provenance
LLM06: Sensitive Info DisclosurePII detection, context separation, output filtering
LLM07: Insecure Plugin DesignTool validation, permission boundaries
LLM08: Excessive AgencyHuman-in-loop rails, action confirmation
LLM09: OverrelianceFactuality checking, confidence thresholds
LLM10: Model TheftN/A (infrastructure concern)
OWASP LLM风险项护栏解决方案
LLM01: 提示注入NeMo输入护栏、Guardrails AI验证器
LLM02: 不安全输出输出护栏、结构化验证、内容清理
LLM03: 训练数据投毒不适用(属于训练阶段关注点)
LLM04: 模型拒绝服务速率限制、令牌预算、超时护栏
LLM05: 供应链漏洞依赖扫描、模型溯源验证
LLM06: 敏感信息泄露PII检测、上下文隔离、输出过滤
LLM07: 不安全插件设计工具验证、权限边界控制
LLM08: 过度自主人工介入护栏、操作确认机制
LLM09: 过度依赖事实核查、置信度阈值控制
LLM10: 模型窃取不适用(属于基础设施关注点)

Anti-Patterns (FORBIDDEN)

反模式(严禁使用)

python
undefined
python
undefined

NEVER trust LLM output without validation

NEVER trust LLM output without validation

response = llm.generate(prompt) return response # Raw, unvalidated output!
response = llm.generate(prompt) return response # Raw, unvalidated output!

NEVER skip input sanitization

NEVER skip input sanitization

user_input = request.json["message"] llm.generate(user_input) # Prompt injection risk!
user_input = request.json["message"] llm.generate(user_input) # Prompt injection risk!

NEVER use single validation layer

NEVER use single validation layer

if not is_toxic(output): # Only one check return output
if not is_toxic(output): # Only one check return output

ALWAYS use layered validation

ALWAYS use layered validation

guard = Guard().use_many( ToxicLanguage(threshold=0.5), DetectPII(on_fail="fix"), ValidLength(max=500), )
guard = Guard().use_many( ToxicLanguage(threshold=0.5), DetectPII(on_fail="fix"), ValidLength(max=500), )

ALWAYS validate both input and output

ALWAYS validate both input and output

input_result = input_guard.validate(user_input) if not input_result.validation_passed: return "Invalid input"
llm_output = llm.generate(input_result.validated_output)
output_result = output_guard.validate(llm_output) return output_result.validated_output
undefined
input_result = input_guard.validate(user_input) if not input_result.validation_passed: return "Invalid input"
llm_output = llm.generate(input_result.validated_output)
output_result = output_guard.validate(llm_output) return output_result.validated_output
undefined

Key Decisions

关键决策建议

DecisionRecommendation
Framework choiceNeMo for flows, Guardrails AI for validators
Toxicity threshold0.5 for content apps, 0.3 for children's apps
PII handlingRedact for logs, block for outputs
Topic restrictionAllowlist preferred over blocklist
Fact-checkingRequired for factual domains (health, finance, legal)
Red-teaming frequencyPre-release + quarterly
决策项推荐方案
框架选择流程控制选NeMo,验证器选Guardrails AI
毒性检测阈值内容类应用设为0.5,儿童类应用设为0.3
PII处理方式日志中脱敏,输出中拦截
主题限制策略优先使用白名单而非黑名单
事实核查要求事实类领域(医疗、金融、法律)必须启用
红队测试频率发布前+每季度一次

Detailed Documentation

详细文档资源

ResourceDescription
references/nemo-guardrails.mdNeMo Guardrails with Colang 2.0
references/guardrails-ai.mdGuardrails AI validators and patterns
references/openai-guardrails.mdOpenAI Moderation API integration
references/factuality-checking.mdHallucination detection and grounding
references/red-teaming.mdDeepTeam and adversarial testing
scripts/nemo-config.yamlProduction NeMo configuration
scripts/rails-pipeline.pyComplete guardrails pipeline
资源说明
references/nemo-guardrails.mdNeMo Guardrails与Colang 2.0使用指南
references/guardrails-ai.mdGuardrails AI验证器与使用模式
references/openai-guardrails.mdOpenAI内容审核API集成指南
references/factuality-checking.md幻觉检测与事实锚定方案
references/red-teaming.mdDeepTeam与对抗性测试指南
scripts/nemo-config.yaml生产环境NeMo配置示例
scripts/rails-pipeline.py完整护栏工作流示例

Related Skills

相关技能

  • llm-safety-patterns
    - Context separation and attribution
  • llm-evaluation
    - Quality assessment and hallucination detection
  • input-validation
    - Request sanitization patterns
  • owasp-top-10
    - Web security fundamentals
  • llm-safety-patterns
    - 上下文隔离与归因
  • llm-evaluation
    - 质量评估与幻觉检测
  • input-validation
    - 请求清理模式
  • owasp-top-10
    - Web安全基础

Capability Details

能力详情

nemo-guardrails

nemo-guardrails

Keywords: NeMo, guardrails, rails, Colang, dialog flow, input rails, output rails Solves:
  • Configure NeMo Guardrails for LLM safety
  • Implement Colang 2.0 dialog flows
  • Create input/output validation rails
关键词: NeMo, guardrails, rails, Colang, dialog flow, input rails, output rails 解决场景:
  • 为LLM安全配置NeMo Guardrails
  • 实现Colang 2.0对话流程
  • 创建输入/输出验证护栏

guardrails-ai-validators

guardrails-ai-validators

Keywords: Guardrails AI, validator, PII, toxicity, topic restriction, structured output Solves:
  • Use Guardrails AI validators for output validation
  • Detect and redact PII from LLM responses
  • Restrict LLM to specific topics
关键词: Guardrails AI, validator, PII, toxicity, topic restriction, structured output 解决场景:
  • 使用Guardrails AI验证器进行输出验证
  • 检测并脱敏LLM响应中的PII
  • 限制LLM输出至特定主题

factuality-checking

factuality-checking

Keywords: fact-check, hallucination, grounding, RAG verification, NLI Solves:
  • Verify LLM claims against source documents
  • Detect hallucinations in generated content
  • Implement grounding checks for RAG
关键词: fact-check, hallucination, grounding, RAG verification, NLI 解决场景:
  • 对照源文档验证LLM生成内容的真实性
  • 检测生成内容中的幻觉
  • 为RAG实现事实锚定检查

red-teaming

red-teaming

Keywords: red team, adversarial, jailbreak, GOAT, prompt injection, DeepTeam Solves:
  • Run adversarial testing on LLM systems
  • Detect jailbreaking vulnerabilities
  • Test prompt injection resistance
关键词: red team, adversarial, jailbreak, GOAT, prompt injection, DeepTeam 解决场景:
  • 对LLM系统进行对抗性测试
  • 检测越狱漏洞
  • 测试提示注入抗性

owasp-llm-compliance

owasp-llm-compliance

Keywords: OWASP LLM, LLM security, LLM vulnerabilities, LLM Top 10 Solves:
  • Implement OWASP Top 10 for LLMs mitigations
  • Audit LLM systems for security compliance
  • Design secure LLM architectures
关键词: OWASP LLM, LLM security, LLM vulnerabilities, LLM Top 10 解决场景:
  • 实现LLM版OWASP Top 10风险缓解方案
  • 审计LLM系统的安全合规性
  • 设计安全的LLM架构