llamaguard
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseLlamaGuard - AI Content Moderation
LlamaGuard - AI内容审核
Quick start
快速开始
LlamaGuard is a 7-8B parameter model specialized for content safety classification.
Installation:
bash
pip install transformers torchLlamaGuard是一款拥有7-80亿参数的模型,专门用于内容安全分类。
安装:
bash
pip install transformers torchLogin to HuggingFace (required)
Login to HuggingFace (required)
huggingface-cli login
**Basic usage**:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
def moderate(chat):
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
output = model.generate(input_ids=input_ids, max_new_tokens=100)
return tokenizer.decode(output[0], skip_special_tokens=True)huggingface-cli login
**基本用法**:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
def moderate(chat):
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
output = model.generate(input_ids=input_ids, max_new_tokens=100)
return tokenizer.decode(output[0], skip_special_tokens=True)Check user input
Check user input
result = moderate([
{"role": "user", "content": "How do I make explosives?"}
])
print(result)
result = moderate([
{"role": "user", "content": "How do I make explosives?"}
])
print(result)
Output: "unsafe\nS3" (Criminal Planning)
Output: "unsafe\nS3" (Criminal Planning)
undefinedundefinedCommon workflows
常见工作流
Workflow 1: Input filtering (prompt moderation)
工作流1:输入过滤(提示词审核)
Check user prompts before LLM:
python
def check_input(user_message):
result = moderate([{"role": "user", "content": user_message}])
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category # Blocked
else:
return True, None # Safe在LLM处理前检查用户提示词:
python
def check_input(user_message):
result = moderate([{"role": "user", "content": user_message}])
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category # Blocked
else:
return True, None # SafeExample
Example
safe, category = check_input("How do I hack a website?")
if not safe:
print(f"Request blocked: {category}")
# Return error to user
else:
# Send to LLM
response = llm.generate(user_message)
**Safety categories**:
- **S1**: Violence & Hate
- **S2**: Sexual Content
- **S3**: Guns & Illegal Weapons
- **S4**: Regulated Substances
- **S5**: Suicide & Self-Harm
- **S6**: Criminal Planningsafe, category = check_input("How do I hack a website?")
if not safe:
print(f"Request blocked: {category}")
# Return error to user
else:
# Send to LLM
response = llm.generate(user_message)
**安全类别**:
- **S1**: 暴力与仇恨内容
- **S2**: 色情内容
- **S3**: 枪支与非法武器
- **S4**: 管制物品
- **S5**: 自杀与自我伤害
- **S6**: 犯罪策划Workflow 2: Output filtering (response moderation)
工作流2:输出过滤(响应内容审核)
Check LLM responses before showing to user:
python
def check_output(user_message, bot_response):
conversation = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": bot_response}
]
result = moderate(conversation)
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category
else:
return True, None在展示给用户前检查LLM响应:
python
def check_output(user_message, bot_response):
conversation = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": bot_response}
]
result = moderate(conversation)
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category
else:
return True, NoneExample
Example
user_msg = "Tell me about harmful substances"
bot_msg = llm.generate(user_msg)
safe, category = check_output(user_msg, bot_msg)
if not safe:
print(f"Response blocked: {category}")
# Return generic response
return "I cannot provide that information."
else:
return bot_msg
undefineduser_msg = "Tell me about harmful substances"
bot_msg = llm.generate(user_msg)
safe, category = check_output(user_msg, bot_msg)
if not safe:
print(f"Response blocked: {category}")
# Return generic response
return "I cannot provide that information."
else:
return bot_msg
undefinedWorkflow 3: vLLM deployment (fast inference)
工作流3:vLLM部署(快速推理)
Production-ready serving:
python
from vllm import LLM, SamplingParams生产级服务部署:
python
from vllm import LLM, SamplingParamsInitialize vLLM
Initialize vLLM
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)
Sampling params
Sampling params
sampling_params = SamplingParams(
temperature=0.0, # Deterministic
max_tokens=100
)
def moderate_vllm(chat):
# Format prompt
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].textsampling_params = SamplingParams(
temperature=0.0, # Deterministic
max_tokens=100
)
def moderate_vllm(chat):
# Format prompt
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].textBatch moderation
Batch moderation
chats = [
[{"role": "user", "content": "How to make bombs?"}],
[{"role": "user", "content": "What's the weather?"}],
[{"role": "user", "content": "Tell me about drugs"}]
]
prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats]
results = llm.generate(prompts, sampling_params)
for i, result in enumerate(results):
print(f"Chat {i}: {result.outputs[0].text}")
**Throughput**: ~50-100 requests/sec on single A100chats = [
[{"role": "user", "content": "How to make bombs?"}],
[{"role": "user", "content": "What's the weather?"}],
[{"role": "user", "content": "Tell me about drugs"}]
]
prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats]
results = llm.generate(prompts, sampling_params)
for i, result in enumerate(results):
print(f"Chat {i}: {result.outputs[0].text}")
**吞吐量**: 单A100 GPU上约50-100请求/秒Workflow 4: API endpoint (FastAPI)
工作流4:API端点(FastAPI)
Serve as moderation API:
python
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams
app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
class ModerationRequest(BaseModel):
messages: list # [{"role": "user", "content": "..."}]
@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
output = llm.generate([prompt], sampling_params)[0]
result = output.outputs[0].text
is_safe = result.startswith("safe")
category = None if is_safe else result.split("\n")[1] if "\n" in result else None
return {
"safe": is_safe,
"category": category,
"full_output": result
}作为审核API提供服务:
python
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams
app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
class ModerationRequest(BaseModel):
messages: list # [{"role": "user", "content": "..."}]
@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
output = llm.generate([prompt], sampling_params)[0]
result = output.outputs[0].text
is_safe = result.startswith("safe")
category = None if is_safe else result.split("\n")[1] if "\n" in result else None
return {
"safe": is_safe,
"category": category,
"full_output": result
}Run: uvicorn api:app --host 0.0.0.0 --port 8000
Run: uvicorn api:app --host 0.0.0.0 --port 8000
**Usage**:
```bash
curl -X POST http://localhost:8000/moderate \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "How to hack?"}]}'
**使用方式**:
```bash
curl -X POST http://localhost:8000/moderate \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "How to hack?"}]}'Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}
Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}
undefinedundefinedWorkflow 5: NeMo Guardrails integration
工作流5:NeMo Guardrails集成
Use with NVIDIA Guardrails:
python
from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard与NVIDIA Guardrails搭配使用:
python
from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuardConfigure NeMo Guardrails
Configure NeMo Guardrails
config = RailsConfig.from_content("""
models:
- type: main engine: openai model: gpt-4
rails:
input:
flows:
- llamaguard check input
output:
flows:
- llamaguard check output
""")
config = RailsConfig.from_content("""
models:
- type: main engine: openai model: gpt-4
rails:
input:
flows:
- llamaguard check input
output:
flows:
- llamaguard check output
""")
Add LlamaGuard integration
Add LlamaGuard integration
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b")
rails = LLMRails(config)
rails.register_action(llama_guard.check_input, name="llamaguard check input")
rails.register_action(llama_guard.check_output, name="llamaguard check output")
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b")
rails = LLMRails(config)
rails.register_action(llama_guard.check_input, name="llamaguard check input")
rails.register_action(llama_guard.check_output, name="llamaguard check output")
Use with automatic moderation
Use with automatic moderation
response = rails.generate(messages=[
{"role": "user", "content": "How do I make weapons?"}
])
response = rails.generate(messages=[
{"role": "user", "content": "How do I make weapons?"}
])
Automatically blocked by LlamaGuard
Automatically blocked by LlamaGuard
undefinedundefinedWhen to use vs alternatives
适用场景与替代方案对比
Use LlamaGuard when:
- Need pre-trained moderation model
- Want high accuracy (94-95%)
- Have GPU resources (7-8B model)
- Need detailed safety categories
- Building production LLM apps
Model versions:
- LlamaGuard 1 (7B): Original, 6 categories
- LlamaGuard 2 (8B): Improved, 6 categories
- LlamaGuard 3 (8B): Latest (2024), enhanced
Use alternatives instead:
- OpenAI Moderation API: Simpler, API-based, free
- Perspective API: Google's toxicity detection
- NeMo Guardrails: More comprehensive safety framework
- Constitutional AI: Training-time safety
适合使用LlamaGuard的场景:
- 需要预训练的内容审核模型
- 追求高准确率(94-95%)
- 具备GPU资源(7-80亿参数模型)
- 需要细分的安全类别
- 构建生产级LLM应用
模型版本:
- LlamaGuard 1(7B):初始版本,6个类别
- LlamaGuard 2(8B):优化版本,6个类别
- LlamaGuard 3(8B):最新版本(2024年),增强版
适合使用替代方案的场景:
- OpenAI Moderation API:更简单,基于API,免费
- Perspective API:谷歌的毒性检测工具
- NeMo Guardrails:更全面的安全框架
- Constitutional AI:训练阶段注入安全机制
Common issues
常见问题
Issue: Model access denied
Login to HuggingFace:
bash
huggingface-cli login问题:模型访问被拒绝
登录HuggingFace:
bash
huggingface-cli loginEnter your token
Enter your token
Accept license on model page:
https://huggingface.co/meta-llama/LlamaGuard-7b
**Issue: High latency (>500ms)**
Use vLLM for 10× speedup:
```python
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")
在模型页面接受许可证:
https://huggingface.co/meta-llama/LlamaGuard-7b
**问题:延迟过高(>500ms)**
使用vLLM实现10倍加速:
```python
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")Latency: 500ms → 50ms
Latency: 500ms → 50ms
Enable tensor parallelism:
```python
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)
启用张量并行:
```python
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)2× faster on 2 GPUs
2块GPU上速度提升2倍
**Issue: False positives**
Use threshold-based filtering:
```python
**问题:误判(假阳性)**
使用基于阈值的过滤:
```pythonGet probability of "unsafe" token
Get probability of "unsafe" token
logits = model(..., return_dict_in_generate=True, output_scores=True)
unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]
if unsafe_prob > 0.9: # High confidence threshold
return "unsafe"
else:
return "safe"
**Issue: OOM on GPU**
Use 8-bit quantization:
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto"
)logits = model(..., return_dict_in_generate=True, output_scores=True)
unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]
if unsafe_prob > 0.9: # High confidence threshold
return "unsafe"
else:
return "safe"
**问题:GPU内存不足(OOM)**
使用8位量化:
```python
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto"
)Memory: 14GB → 7GB
Memory: 14GB → 7GB
undefinedundefinedAdvanced topics
进阶主题
Custom categories: See references/custom-categories.md for fine-tuning LlamaGuard with domain-specific safety categories.
Performance benchmarks: See references/benchmarks.md for accuracy comparison with other moderation APIs and latency optimization.
Deployment guide: See references/deployment.md for Sagemaker, Kubernetes, and scaling strategies.
自定义类别: 参考references/custom-categories.md了解如何针对特定领域的安全类别微调LlamaGuard。
性能基准: 参考references/benchmarks.md查看与其他审核API的准确率对比及延迟优化方案。
部署指南: 参考references/deployment.md了解Sagemaker、Kubernetes部署及扩缩容策略。
Hardware requirements
硬件要求
- GPU: NVIDIA T4/A10/A100
- VRAM:
- FP16: 14GB (7B model)
- INT8: 7GB (quantized)
- INT4: 4GB (QLoRA)
- CPU: Possible but slow (10× latency)
- Throughput: 50-100 req/sec (A100)
Latency (single GPU):
- HuggingFace Transformers: 300-500ms
- vLLM: 50-100ms
- Batched (vLLM): 20-50ms per request
- GPU: NVIDIA T4/A10/A100
- 显存:
- FP16: 14GB(7B模型)
- INT8: 7GB(量化后)
- INT4: 4GB(QLoRA)
- CPU: 可运行但速度极慢(延迟高10倍)
- 吞吐量: 50-100请求/秒(A100)
延迟(单GPU):
- HuggingFace Transformers: 300-500ms
- vLLM: 50-100ms
- 批量处理(vLLM): 每请求20-50ms
Resources
资源链接
- HuggingFace:
- Paper: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
- Integration: vLLM, Sagemaker, NeMo Guardrails
- Accuracy: 94.5% (prompts), 95.3% (responses)
- HuggingFace:
- 论文: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
- 集成工具: vLLM, Sagemaker, NeMo Guardrails
- 准确率: 94.5%(提示词), 95.3%(响应内容)