constitutional-ai
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseConstitutional AI - Harmlessness from AI Feedback
Constitutional AI - 基于AI反馈实现无害性
Quick start
快速入门
Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.
Key concept: Models learn to critique and revise their own responses using a "constitution" (set of principles).
Two phases:
- Supervised Learning (SL): Self-critique + revision
- Reinforcement Learning (RL): RLAIF (RL from AI Feedback)
Constitution example:
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuancedConstitutional AI(CAI)通过自我批判和AI反馈训练模型实现无害性,无需为有害输出标注人工标签。
核心概念:模型通过「准则」(一组原则)学习批判并修正自身的回复。
两阶段流程:
- 监督学习(SL):自我批判 + 修正
- 强化学习(RL):RLAIF(基于AI反馈的强化学习)
准则示例:
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuancedCommon workflows
常见工作流
Workflow 1: Supervised learning phase (self-critique + revision)
工作流1:监督学习阶段(自我批判+修正)
Step 1: Generate initial responses:
python
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)Step 2: Self-critique with constitution:
python
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)Step 3: Revision based on critique:
python
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)Step 4: Fine-tune on revised responses:
python
from trl import SFTTrainer步骤1:生成初始回复:
python
from transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)步骤2:基于准则进行自我批判:
python
critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)步骤3:基于批判结果修正回复:
python
revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)步骤4:基于修正后的回复微调模型:
python
from trl import SFTTrainerCreate dataset of (prompt, revised_response) pairs
Create dataset of (prompt, revised_response) pairs
dataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
undefineddataset = create_dataset(prompts, revised_responses)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
max_seq_length=1024
)
trainer.train()
undefinedWorkflow 2: RL phase (RLAIF - RL from AI Feedback)
工作流2:强化学习阶段(RLAIF - 基于AI反馈的强化学习)
Step 1: Generate comparison pairs:
python
undefined步骤1:生成对比回复对:
python
undefinedSample multiple responses per prompt
Sample multiple responses per prompt
responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
**Step 2: AI preference evaluation**:
```python
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)
**步骤2:AI偏好评估**:
```python
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""Get AI preferences (no human labels needed!)
Get AI preferences (no human labels needed!)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
preferences = generator(
[preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION)
for q, ra, rb in zip(prompts, responses_a, responses_b)]
)
Parse preferences (A or B)
Parse preferences (A or B)
chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
**Step 3: Train preference model (reward model)**:
```python
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()Step 4: RL training with RLAIF:
python
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()chosen, rejected = parse_preferences(preferences, responses_a, responses_b)
**步骤3:训练偏好模型(奖励模型)**:
```python
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()步骤4:基于RLAIF进行强化学习训练:
python
from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()Workflow 3: Chain-of-thought critique
工作流3:思维链批判
Enable reasoning transparency:
python
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)提升推理透明度:
python
cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)When to use vs alternatives
适用场景与替代方案对比
Use Constitutional AI when:
- Want safety alignment without human labels
- Need explainable AI decisions
- Want to avoid evasive refusals
- Have a clear set of principles/constitution
- Need scalable safety training
Principles:
- RLAIF: AI-generated preferences (scalable, no human labels)
- RLHF: Human preferences (more accurate, expensive)
- Self-critique: Iterative improvement
- Chain-of-thought: Reasoning transparency
Use alternatives instead:
- RLHF (PPO): Need human-validated safety
- DPO/SimPO: Have human preference data
- NeMo Guardrails: Need runtime content filtering
- LlamaGuard: Need pre-trained moderation model
适合使用Constitutional AI的场景:
- 希望无需人工标签实现安全对齐
- 需要可解释的AI决策
- 希望避免回避式拒绝回复
- 拥有明确的原则/准则体系
- 需要可扩展的安全训练方案
核心原则:
- RLAIF: AI生成偏好(可扩展,无需人工标签)
- RLHF: 人工偏好(准确性更高,成本昂贵)
- 自我批判: 迭代式提升
- 思维链: 推理透明度
适合使用替代方案的场景:
- RLHF (PPO): 需要人工验证的安全性
- DPO/SimPO: 已有人工偏好数据
- NeMo Guardrails: 需要运行时内容过滤
- LlamaGuard: 需要预训练的内容审核模型
Common issues
常见问题
Issue: Model refuses too much (evasive)
Add constitution principle:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.Issue: Self-critiques are weak
Use stronger critique prompts:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.Issue: Revisions don't improve quality
Iterate multiple times:
python
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)Issue: RLAIF preferences are noisy
Use multiple AI evaluators:
python
undefined问题:模型过度回避(拒绝回复过多)
添加准则原则:
Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.问题:自我批判力度不足
使用更严格的批判提示词:
Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.问题:修正后的回复质量未提升
进行多轮迭代:
python
for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)问题:RLAIF偏好结果存在噪声
使用多个AI评估器:
python
undefinedGet preferences from 3 different models
Get preferences from 3 different models
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
prefs_1 = model_1.evaluate(responses)
prefs_2 = model_2.evaluate(responses)
prefs_3 = model_3.evaluate(responses)
Majority vote
Majority vote
final_preference = majority_vote(prefs_1, prefs_2, prefs_3)
undefinedfinal_preference = majority_vote(prefs_1, prefs_2, prefs_3)
undefinedAdvanced topics
进阶主题
Constitution design: See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.
RLAIF vs RLHF: See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.
Chain-of-thought reasoning: See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.
准则设计: 参考references/constitution-design.md了解原则选择、有用性与无害性的权衡,以及领域专属准则设计。
RLAIF vs RLHF: 参考references/rlaif-comparison.md了解性能对比、成本分析,以及何时选择AI反馈而非人工反馈。
思维链推理: 参考references/cot-critique.md了解批判提示词工程、多步推理,以及透明度提升方法。
Hardware requirements
硬件要求
- GPU: NVIDIA A100/H100 recommended
- VRAM:
- SL phase (7B): 1× A100 40GB
- RL phase (7B): 2× A100 40GB (policy + reward model)
- Single-node: Sufficient for most use cases
- Mixed precision: BF16 recommended
Compute requirements:
- SL phase: Similar to standard SFT
- RL phase: Similar to PPO (higher than DPO)
- AI evaluation: Additional inference for critique/preference generation
- GPU: 推荐使用NVIDIA A100/H100
- 显存:
- SL阶段(7B模型): 1× A100 40GB
- RL阶段(7B模型): 2× A100 40GB(策略模型+奖励模型)
- 单节点: 满足大多数使用场景需求
- 混合精度: 推荐使用BF16
算力要求:
- SL阶段: 与标准SFT训练相当
- RL阶段: 与PPO训练相当(高于DPO)
- AI评估: 额外需要批判/偏好生成的推理算力
Resources
参考资源
- Paper: https://arxiv.org/abs/2212.08073 (Dec 2022)
- Anthropic blog: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
- Implementation: TRL (PPOTrainer + RewardTrainer)
- Claude: Uses Constitutional AI for safety
- 论文: https://arxiv.org/abs/2212.08073 (2022年12月)
- Anthropic博客: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
- 实现工具: TRL(PPOTrainer + RewardTrainer)
- Claude: 采用Constitutional AI构建安全系统