llamaguard

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

LlamaGuard - AI Content Moderation

LlamaGuard - AI内容审核

Quick start

快速开始

LlamaGuard is a 7-8B parameter model specialized for content safety classification.

Installation:

bash

pip install transformers torch

LlamaGuard是一款拥有7-80亿参数的模型，专门用于内容安全分类。

安装:

bash

pip install transformers torch

Login to HuggingFace (required)

huggingface-cli login


**Basic usage**:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

huggingface-cli login


**基本用法**:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

Check user input

result = moderate([ {"role": "user", "content": "How do I make explosives?"} ]) print(result)

Output: "unsafe\nS3" (Criminal Planning)

undefined

undefined

Common workflows

常见工作流

Workflow 1: Input filtering (prompt moderation)

工作流1：输入过滤（提示词审核）

Check user prompts before LLM:

python

def check_input(user_message):
    result = moderate([{"role": "user", "content": user_message}])

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category  # Blocked
    else:
        return True, None  # Safe

在LLM处理前检查用户提示词:

python

def check_input(user_message):
    result = moderate([{"role": "user", "content": user_message}])

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category  # Blocked
    else:
        return True, None  # Safe

Example

safe, category = check_input("How do I hack a website?") if not safe: print(f"Request blocked: {category}") # Return error to user else: # Send to LLM response = llm.generate(user_message)


**Safety categories**:
- **S1**: Violence & Hate
- **S2**: Sexual Content
- **S3**: Guns & Illegal Weapons
- **S4**: Regulated Substances
- **S5**: Suicide & Self-Harm
- **S6**: Criminal Planning

safe, category = check_input("How do I hack a website?") if not safe: print(f"Request blocked: {category}") # Return error to user else: # Send to LLM response = llm.generate(user_message)


**安全类别**:
- **S1**: 暴力与仇恨内容
- **S2**: 色情内容
- **S3**: 枪支与非法武器
- **S4**: 管制物品
- **S5**: 自杀与自我伤害
- **S6**: 犯罪策划

Workflow 2: Output filtering (response moderation)

工作流2：输出过滤（响应内容审核）

Check LLM responses before showing to user:

python

def check_output(user_message, bot_response):
    conversation = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_response}
    ]

    result = moderate(conversation)

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category
    else:
        return True, None

在展示给用户前检查LLM响应:

python

def check_output(user_message, bot_response):
    conversation = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_response}
    ]

    result = moderate(conversation)

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category
    else:
        return True, None

Example

user_msg = "Tell me about harmful substances" bot_msg = llm.generate(user_msg)

safe, category = check_output(user_msg, bot_msg) if not safe: print(f"Response blocked: {category}") # Return generic response return "I cannot provide that information." else: return bot_msg

undefined

user_msg = "Tell me about harmful substances" bot_msg = llm.generate(user_msg)

safe, category = check_output(user_msg, bot_msg) if not safe: print(f"Response blocked: {category}") # Return generic response return "I cannot provide that information." else: return bot_msg

undefined

Workflow 3: vLLM deployment (fast inference)

工作流3：vLLM部署（快速推理）

Production-ready serving:

python

from vllm import LLM, SamplingParams

生产级服务部署:

python

from vllm import LLM, SamplingParams

Initialize vLLM

llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)

Sampling params

sampling_params = SamplingParams( temperature=0.0, # Deterministic max_tokens=100 )

def moderate_vllm(chat): # Format prompt prompt = tokenizer.apply_chat_template(chat, tokenize=False)

# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].text

sampling_params = SamplingParams( temperature=0.0, # Deterministic max_tokens=100 )

def moderate_vllm(chat): # Format prompt prompt = tokenizer.apply_chat_template(chat, tokenize=False)

# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].text

Batch moderation

chats = [ [{"role": "user", "content": "How to make bombs?"}], [{"role": "user", "content": "What's the weather?"}], [{"role": "user", "content": "Tell me about drugs"}] ]

prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats] results = llm.generate(prompts, sampling_params)

for i, result in enumerate(results): print(f"Chat {i}: {result.outputs[0].text}")


**Throughput**: ~50-100 requests/sec on single A100

chats = [ [{"role": "user", "content": "How to make bombs?"}], [{"role": "user", "content": "What's the weather?"}], [{"role": "user", "content": "Tell me about drugs"}] ]

prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats] results = llm.generate(prompts, sampling_params)

for i, result in enumerate(results): print(f"Chat {i}: {result.outputs[0].text}")


**吞吐量**: 单A100 GPU上约50-100请求/秒

Workflow 4: API endpoint (FastAPI)

工作流4：API端点（FastAPI）

Serve as moderation API:

python

from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel):
    messages: list  # [{"role": "user", "content": "..."}]

@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
    prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
    output = llm.generate([prompt], sampling_params)[0]

    result = output.outputs[0].text
    is_safe = result.startswith("safe")
    category = None if is_safe else result.split("\n")[1] if "\n" in result else None

    return {
        "safe": is_safe,
        "category": category,
        "full_output": result
    }

作为审核API提供服务:

python

from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel):
    messages: list  # [{"role": "user", "content": "..."}]

@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
    prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
    output = llm.generate([prompt], sampling_params)[0]

    result = output.outputs[0].text
    is_safe = result.startswith("safe")
    category = None if is_safe else result.split("\n")[1] if "\n" in result else None

    return {
        "safe": is_safe,
        "category": category,
        "full_output": result
    }

Run: uvicorn api:app --host 0.0.0.0 --port 8000


**Usage**:
```bash
curl -X POST http://localhost:8000/moderate \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "How to hack?"}]}'


**使用方式**:
```bash
curl -X POST http://localhost:8000/moderate \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "How to hack?"}]}'

Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}

undefined

undefined

Workflow 5: NeMo Guardrails integration

工作流5：NeMo Guardrails集成

Use with NVIDIA Guardrails:

python

from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard

与NVIDIA Guardrails搭配使用:

python

from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard

Configure NeMo Guardrails

config = RailsConfig.from_content(""" models:

type: main engine: openai model: gpt-4

rails: input: flows: - llamaguard check input output: flows: - llamaguard check output """)

config = RailsConfig.from_content(""" models:

type: main engine: openai model: gpt-4

rails: input: flows: - llamaguard check input output: flows: - llamaguard check output """)

Add LlamaGuard integration

llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b") rails = LLMRails(config) rails.register_action(llama_guard.check_input, name="llamaguard check input") rails.register_action(llama_guard.check_output, name="llamaguard check output")

Use with automatic moderation

response = rails.generate(messages=[ {"role": "user", "content": "How do I make weapons?"} ])

Automatically blocked by LlamaGuard

undefined

undefined

When to use vs alternatives

适用场景与替代方案对比

Use LlamaGuard when:

Need pre-trained moderation model
Want high accuracy (94-95%)
Have GPU resources (7-8B model)
Need detailed safety categories
Building production LLM apps

Model versions:

LlamaGuard 1 (7B): Original, 6 categories
LlamaGuard 2 (8B): Improved, 6 categories
LlamaGuard 3 (8B): Latest (2024), enhanced

Use alternatives instead:

OpenAI Moderation API: Simpler, API-based, free
Perspective API: Google's toxicity detection
NeMo Guardrails: More comprehensive safety framework
Constitutional AI: Training-time safety

适合使用LlamaGuard的场景:

需要预训练的内容审核模型
追求高准确率（94-95%）
具备GPU资源（7-80亿参数模型）
需要细分的安全类别
构建生产级LLM应用

模型版本:

LlamaGuard 1（7B）：初始版本，6个类别
LlamaGuard 2（8B）：优化版本，6个类别
LlamaGuard 3（8B）：最新版本（2024年），增强版

适合使用替代方案的场景:

OpenAI Moderation API：更简单，基于API，免费
Perspective API：谷歌的毒性检测工具
NeMo Guardrails：更全面的安全框架
Constitutional AI：训练阶段注入安全机制

Common issues

常见问题

Issue: Model access denied

bash

huggingface-cli login

问题：模型访问被拒绝

登录HuggingFace:

bash

huggingface-cli login

Enter your token


Accept license on model page:
https://huggingface.co/meta-llama/LlamaGuard-7b

**Issue: High latency (>500ms)**

Use vLLM for 10× speedup:
```python
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")


在模型页面接受许可证:
https://huggingface.co/meta-llama/LlamaGuard-7b

**问题：延迟过高（>500ms）**

使用vLLM实现10倍加速:
```python
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")

Latency: 500ms → 50ms


Enable tensor parallelism:
```python
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)


启用张量并行:
```python
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)

2× faster on 2 GPUs

2块GPU上速度提升2倍


**Issue: False positives**

Use threshold-based filtering:
```python


**问题：误判（假阳性）**

使用基于阈值的过滤:
```python

Get probability of "unsafe" token

logits = model(..., return_dict_in_generate=True, output_scores=True) unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]

if unsafe_prob > 0.9: # High confidence threshold return "unsafe" else: return "safe"


**Issue: OOM on GPU**

Use 8-bit quantization:
```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

logits = model(..., return_dict_in_generate=True, output_scores=True) unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]

if unsafe_prob > 0.9: # High confidence threshold return "unsafe" else: return "safe"


**问题：GPU内存不足（OOM）**

使用8位量化:
```python
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

Memory: 14GB → 7GB

undefined

undefined

Advanced topics

进阶主题

Custom categories: See references/custom-categories.md for fine-tuning LlamaGuard with domain-specific safety categories.

Performance benchmarks: See references/benchmarks.md for accuracy comparison with other moderation APIs and latency optimization.

Deployment guide: See references/deployment.md for Sagemaker, Kubernetes, and scaling strategies.

自定义类别: 参考references/custom-categories.md了解如何针对特定领域的安全类别微调LlamaGuard。

性能基准: 参考references/benchmarks.md查看与其他审核API的准确率对比及延迟优化方案。

部署指南: 参考references/deployment.md了解Sagemaker、Kubernetes部署及扩缩容策略。

Hardware requirements

硬件要求

GPU: NVIDIA T4/A10/A100
VRAM:
- FP16: 14GB (7B model)
- INT8: 7GB (quantized)
- INT4: 4GB (QLoRA)
CPU: Possible but slow (10× latency)
Throughput: 50-100 req/sec (A100)

Latency (single GPU):

HuggingFace Transformers: 300-500ms
vLLM: 50-100ms
Batched (vLLM): 20-50ms per request

GPU: NVIDIA T4/A10/A100
显存:
- FP16: 14GB（7B模型）
- INT8: 7GB（量化后）
- INT4: 4GB（QLoRA）
CPU: 可运行但速度极慢（延迟高10倍）
吞吐量: 50-100请求/秒（A100）

延迟（单GPU）:

HuggingFace Transformers: 300-500ms
vLLM: 50-100ms
批量处理（vLLM）: 每请求20-50ms

Resources

资源链接

HuggingFace:
Paper: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
Integration: vLLM, Sagemaker, NeMo Guardrails
Accuracy: 94.5% (prompts), 95.3% (responses)

HuggingFace:
论文: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
集成工具: vLLM, Sagemaker, NeMo Guardrails
准确率: 94.5%（提示词）, 95.3%（响应内容）