llamaguard

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

LlamaGuard - AI Content Moderation

LlamaGuard - AI内容审核

Quick start

快速开始

LlamaGuard is a 7-8B parameter model specialized for content safety classification.
Installation:
bash
pip install transformers torch
LlamaGuard是一款拥有7-80亿参数的模型,专门用于内容安全分类。
安装:
bash
pip install transformers torch

Login to HuggingFace (required)

Login to HuggingFace (required)

huggingface-cli login

**Basic usage**:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)
huggingface-cli login

**基本用法**:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

Check user input

Check user input

result = moderate([ {"role": "user", "content": "How do I make explosives?"} ]) print(result)
result = moderate([ {"role": "user", "content": "How do I make explosives?"} ]) print(result)

Output: "unsafe\nS3" (Criminal Planning)

Output: "unsafe\nS3" (Criminal Planning)

undefined
undefined

Common workflows

常见工作流

Workflow 1: Input filtering (prompt moderation)

工作流1:输入过滤(提示词审核)

Check user prompts before LLM:
python
def check_input(user_message):
    result = moderate([{"role": "user", "content": user_message}])

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category  # Blocked
    else:
        return True, None  # Safe
在LLM处理前检查用户提示词:
python
def check_input(user_message):
    result = moderate([{"role": "user", "content": user_message}])

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category  # Blocked
    else:
        return True, None  # Safe

Example

Example

safe, category = check_input("How do I hack a website?") if not safe: print(f"Request blocked: {category}") # Return error to user else: # Send to LLM response = llm.generate(user_message)

**Safety categories**:
- **S1**: Violence & Hate
- **S2**: Sexual Content
- **S3**: Guns & Illegal Weapons
- **S4**: Regulated Substances
- **S5**: Suicide & Self-Harm
- **S6**: Criminal Planning
safe, category = check_input("How do I hack a website?") if not safe: print(f"Request blocked: {category}") # Return error to user else: # Send to LLM response = llm.generate(user_message)

**安全类别**:
- **S1**: 暴力与仇恨内容
- **S2**: 色情内容
- **S3**: 枪支与非法武器
- **S4**: 管制物品
- **S5**: 自杀与自我伤害
- **S6**: 犯罪策划

Workflow 2: Output filtering (response moderation)

工作流2:输出过滤(响应内容审核)

Check LLM responses before showing to user:
python
def check_output(user_message, bot_response):
    conversation = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_response}
    ]

    result = moderate(conversation)

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category
    else:
        return True, None
在展示给用户前检查LLM响应:
python
def check_output(user_message, bot_response):
    conversation = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_response}
    ]

    result = moderate(conversation)

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category
    else:
        return True, None

Example

Example

user_msg = "Tell me about harmful substances" bot_msg = llm.generate(user_msg)
safe, category = check_output(user_msg, bot_msg) if not safe: print(f"Response blocked: {category}") # Return generic response return "I cannot provide that information." else: return bot_msg
undefined
user_msg = "Tell me about harmful substances" bot_msg = llm.generate(user_msg)
safe, category = check_output(user_msg, bot_msg) if not safe: print(f"Response blocked: {category}") # Return generic response return "I cannot provide that information." else: return bot_msg
undefined

Workflow 3: vLLM deployment (fast inference)

工作流3:vLLM部署(快速推理)

Production-ready serving:
python
from vllm import LLM, SamplingParams
生产级服务部署:
python
from vllm import LLM, SamplingParams

Initialize vLLM

Initialize vLLM

llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)

Sampling params

Sampling params

sampling_params = SamplingParams( temperature=0.0, # Deterministic max_tokens=100 )
def moderate_vllm(chat): # Format prompt prompt = tokenizer.apply_chat_template(chat, tokenize=False)
# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].text
sampling_params = SamplingParams( temperature=0.0, # Deterministic max_tokens=100 )
def moderate_vllm(chat): # Format prompt prompt = tokenizer.apply_chat_template(chat, tokenize=False)
# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].text

Batch moderation

Batch moderation

chats = [ [{"role": "user", "content": "How to make bombs?"}], [{"role": "user", "content": "What's the weather?"}], [{"role": "user", "content": "Tell me about drugs"}] ]
prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats] results = llm.generate(prompts, sampling_params)
for i, result in enumerate(results): print(f"Chat {i}: {result.outputs[0].text}")

**Throughput**: ~50-100 requests/sec on single A100
chats = [ [{"role": "user", "content": "How to make bombs?"}], [{"role": "user", "content": "What's the weather?"}], [{"role": "user", "content": "Tell me about drugs"}] ]
prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats] results = llm.generate(prompts, sampling_params)
for i, result in enumerate(results): print(f"Chat {i}: {result.outputs[0].text}")

**吞吐量**: 单A100 GPU上约50-100请求/秒

Workflow 4: API endpoint (FastAPI)

工作流4:API端点(FastAPI)

Serve as moderation API:
python
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel):
    messages: list  # [{"role": "user", "content": "..."}]

@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
    prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
    output = llm.generate([prompt], sampling_params)[0]

    result = output.outputs[0].text
    is_safe = result.startswith("safe")
    category = None if is_safe else result.split("\n")[1] if "\n" in result else None

    return {
        "safe": is_safe,
        "category": category,
        "full_output": result
    }
作为审核API提供服务:
python
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel):
    messages: list  # [{"role": "user", "content": "..."}]

@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
    prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
    output = llm.generate([prompt], sampling_params)[0]

    result = output.outputs[0].text
    is_safe = result.startswith("safe")
    category = None if is_safe else result.split("\n")[1] if "\n" in result else None

    return {
        "safe": is_safe,
        "category": category,
        "full_output": result
    }

Run: uvicorn api:app --host 0.0.0.0 --port 8000

Run: uvicorn api:app --host 0.0.0.0 --port 8000


**Usage**:
```bash
curl -X POST http://localhost:8000/moderate \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "How to hack?"}]}'

**使用方式**:
```bash
curl -X POST http://localhost:8000/moderate \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "How to hack?"}]}'

Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}

Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}

undefined
undefined

Workflow 5: NeMo Guardrails integration

工作流5:NeMo Guardrails集成

Use with NVIDIA Guardrails:
python
from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard
与NVIDIA Guardrails搭配使用:
python
from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard

Configure NeMo Guardrails

Configure NeMo Guardrails

config = RailsConfig.from_content(""" models:
  • type: main engine: openai model: gpt-4
rails: input: flows: - llamaguard check input output: flows: - llamaguard check output """)
config = RailsConfig.from_content(""" models:
  • type: main engine: openai model: gpt-4
rails: input: flows: - llamaguard check input output: flows: - llamaguard check output """)

Add LlamaGuard integration

Add LlamaGuard integration

llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b") rails = LLMRails(config) rails.register_action(llama_guard.check_input, name="llamaguard check input") rails.register_action(llama_guard.check_output, name="llamaguard check output")
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b") rails = LLMRails(config) rails.register_action(llama_guard.check_input, name="llamaguard check input") rails.register_action(llama_guard.check_output, name="llamaguard check output")

Use with automatic moderation

Use with automatic moderation

response = rails.generate(messages=[ {"role": "user", "content": "How do I make weapons?"} ])
response = rails.generate(messages=[ {"role": "user", "content": "How do I make weapons?"} ])

Automatically blocked by LlamaGuard

Automatically blocked by LlamaGuard

undefined
undefined

When to use vs alternatives

适用场景与替代方案对比

Use LlamaGuard when:
  • Need pre-trained moderation model
  • Want high accuracy (94-95%)
  • Have GPU resources (7-8B model)
  • Need detailed safety categories
  • Building production LLM apps
Model versions:
  • LlamaGuard 1 (7B): Original, 6 categories
  • LlamaGuard 2 (8B): Improved, 6 categories
  • LlamaGuard 3 (8B): Latest (2024), enhanced
Use alternatives instead:
  • OpenAI Moderation API: Simpler, API-based, free
  • Perspective API: Google's toxicity detection
  • NeMo Guardrails: More comprehensive safety framework
  • Constitutional AI: Training-time safety
适合使用LlamaGuard的场景:
  • 需要预训练的内容审核模型
  • 追求高准确率(94-95%)
  • 具备GPU资源(7-80亿参数模型)
  • 需要细分的安全类别
  • 构建生产级LLM应用
模型版本:
  • LlamaGuard 1(7B):初始版本,6个类别
  • LlamaGuard 2(8B):优化版本,6个类别
  • LlamaGuard 3(8B):最新版本(2024年),增强版
适合使用替代方案的场景:
  • OpenAI Moderation API:更简单,基于API,免费
  • Perspective API:谷歌的毒性检测工具
  • NeMo Guardrails:更全面的安全框架
  • Constitutional AI:训练阶段注入安全机制

Common issues

常见问题

Issue: Model access denied
Login to HuggingFace:
bash
huggingface-cli login
问题:模型访问被拒绝
登录HuggingFace:
bash
huggingface-cli login

Enter your token

Enter your token


Accept license on model page:
https://huggingface.co/meta-llama/LlamaGuard-7b

**Issue: High latency (>500ms)**

Use vLLM for 10× speedup:
```python
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")

在模型页面接受许可证:
https://huggingface.co/meta-llama/LlamaGuard-7b

**问题:延迟过高(>500ms)**

使用vLLM实现10倍加速:
```python
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")

Latency: 500ms → 50ms

Latency: 500ms → 50ms


Enable tensor parallelism:
```python
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)

启用张量并行:
```python
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)

2× faster on 2 GPUs

2块GPU上速度提升2倍


**Issue: False positives**

Use threshold-based filtering:
```python

**问题:误判(假阳性)**

使用基于阈值的过滤:
```python

Get probability of "unsafe" token

Get probability of "unsafe" token

logits = model(..., return_dict_in_generate=True, output_scores=True) unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]
if unsafe_prob > 0.9: # High confidence threshold return "unsafe" else: return "safe"

**Issue: OOM on GPU**

Use 8-bit quantization:
```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
logits = model(..., return_dict_in_generate=True, output_scores=True) unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]
if unsafe_prob > 0.9: # High confidence threshold return "unsafe" else: return "safe"

**问题:GPU内存不足(OOM)**

使用8位量化:
```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

Memory: 14GB → 7GB

Memory: 14GB → 7GB

undefined
undefined

Advanced topics

进阶主题

Custom categories: See references/custom-categories.md for fine-tuning LlamaGuard with domain-specific safety categories.
Performance benchmarks: See references/benchmarks.md for accuracy comparison with other moderation APIs and latency optimization.
Deployment guide: See references/deployment.md for Sagemaker, Kubernetes, and scaling strategies.
自定义类别: 参考references/custom-categories.md了解如何针对特定领域的安全类别微调LlamaGuard。
性能基准: 参考references/benchmarks.md查看与其他审核API的准确率对比及延迟优化方案。
部署指南: 参考references/deployment.md了解Sagemaker、Kubernetes部署及扩缩容策略。

Hardware requirements

硬件要求

  • GPU: NVIDIA T4/A10/A100
  • VRAM:
    • FP16: 14GB (7B model)
    • INT8: 7GB (quantized)
    • INT4: 4GB (QLoRA)
  • CPU: Possible but slow (10× latency)
  • Throughput: 50-100 req/sec (A100)
Latency (single GPU):
  • HuggingFace Transformers: 300-500ms
  • vLLM: 50-100ms
  • Batched (vLLM): 20-50ms per request
  • GPU: NVIDIA T4/A10/A100
  • 显存:
    • FP16: 14GB(7B模型)
    • INT8: 7GB(量化后)
    • INT4: 4GB(QLoRA)
  • CPU: 可运行但速度极慢(延迟高10倍)
  • 吞吐量: 50-100请求/秒(A100)
延迟(单GPU):
  • HuggingFace Transformers: 300-500ms
  • vLLM: 50-100ms
  • 批量处理(vLLM): 每请求20-50ms

Resources

资源链接