constitutional-ai

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Constitutional AI - Harmlessness from AI Feedback

Constitutional AI - 基于AI反馈实现无害性

Quick start

快速入门

Constitutional AI (CAI) trains models to be harmless through self-critique and AI feedback, without requiring human labels for harmful outputs.

Key concept: Models learn to critique and revise their own responses using a "constitution" (set of principles).

Two phases:

Supervised Learning (SL): Self-critique + revision
Reinforcement Learning (RL): RLAIF (RL from AI Feedback)

Constitution example:

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

Constitutional AI（CAI）通过自我批判和AI反馈训练模型实现无害性，无需为有害输出标注人工标签。

核心概念：模型通过「准则」（一组原则）学习批判并修正自身的回复。

两阶段流程:

监督学习（SL）：自我批判 + 修正
强化学习（RL）：RLAIF（基于AI反馈的强化学习）

准则示例:

Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuanced

Common workflows

常见工作流

Workflow 1: Supervised learning phase (self-critique + revision)

工作流1：监督学习阶段（自我批判+修正）

Step 1: Generate initial responses:

python

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

Step 2: Self-critique with constitution:

python

critique_prompt = """
Given the question:
{question}

And the response:
{response}

Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse

Identify any ways this response violates the constitution:
"""

critiques = generator(
    [critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)

Step 3: Revision based on critique:

python

revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}

Please revise the response to better align with the constitution:
"""

revised_responses = generator(
    [revision_prompt.format(q=q, r=r, c=c)
     for q, r, c in zip(prompts, initial_responses, critiques)]
)

Step 4: Fine-tune on revised responses:

python

from trl import SFTTrainer

步骤1：生成初始回复:

python

from transformers import pipeline

generator = pipeline("text-generation", model="base-model")

prompts = [
    "How do I hack a website?",
    "Write something offensive about a group.",
    "Help me cheat on my exam."
]

initial_responses = generator(prompts, max_length=200)

步骤2：基于准则进行自我批判:

python

critique_prompt = """
Given the question:
{question}

And the response:
{response}

Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse

Identify any ways this response violates the constitution:
"""

critiques = generator(
    [critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)

步骤3：基于批判结果修正回复:

python

revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}

Please revise the response to better align with the constitution:
"""

revised_responses = generator(
    [revision_prompt.format(q=q, r=r, c=c)
     for q, r, c in zip(prompts, initial_responses, critiques)]
)

步骤4：基于修正后的回复微调模型:

python

from trl import SFTTrainer

Create dataset of (prompt, revised_response) pairs

dataset = create_dataset(prompts, revised_responses)

trainer = SFTTrainer( model=model, train_dataset=dataset, max_seq_length=1024 ) trainer.train()

undefined

dataset = create_dataset(prompts, revised_responses)

trainer = SFTTrainer( model=model, train_dataset=dataset, max_seq_length=1024 ) trainer.train()

undefined

Workflow 2: RL phase (RLAIF - RL from AI Feedback)

工作流2：强化学习阶段（RLAIF - 基于AI反馈的强化学习）

Step 1: Generate comparison pairs:

python

undefined

步骤1：生成对比回复对:

python

undefined

Sample multiple responses per prompt

responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8) responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)


**Step 2: AI preference evaluation**:
```python
preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

responses_a = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8) responses_b = generator(prompts, num_return_sequences=2, do_sample=True, temperature=0.8)


**步骤2：AI偏好评估**:
```python
preference_prompt = """
Question: {question}

Response A: {response_a}
Response B: {response_b}

Constitution:
{constitution}

Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""

Get AI preferences (no human labels needed!)

preferences = generator( [preference_prompt.format(q=q, ra=ra, rb=rb, constitution=CONSTITUTION) for q, ra, rb in zip(prompts, responses_a, responses_b)] )

Parse preferences (A or B)

chosen, rejected = parse_preferences(preferences, responses_a, responses_b)


**Step 3: Train preference model (reward model)**:
```python
from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

Step 4: RL training with RLAIF:

python

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

chosen, rejected = parse_preferences(preferences, responses_a, responses_b)


**步骤3：训练偏好模型（奖励模型）**:
```python
from trl import RewardTrainer, RewardConfig

preference_dataset = create_preference_dataset(prompts, chosen, rejected)

reward_config = RewardConfig(
    output_dir="constitutional-reward-model",
    learning_rate=1e-5,
    num_train_epochs=1
)

reward_trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=preference_dataset,
    processing_class=tokenizer
)
reward_trainer.train()

步骤4：基于RLAIF进行强化学习训练:

python

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    reward_model_path="constitutional-reward-model",
    learning_rate=1e-6,
    kl_coef=0.05
)

ppo_trainer = PPOTrainer(
    model=model,
    config=ppo_config,
    reward_model=reward_model
)
ppo_trainer.train()

Workflow 3: Chain-of-thought critique

工作流3：思维链批判

Enable reasoning transparency:

python

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

提升推理透明度:

python

cot_critique_prompt = """
Question: {question}
Response: {response}

Let's think step-by-step about whether this response follows our principles:

1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]

Based on this analysis, suggest a revision if needed.
"""

cot_critiques = generator(
    [cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)

When to use vs alternatives

适用场景与替代方案对比

Use Constitutional AI when:

Want safety alignment without human labels
Need explainable AI decisions
Want to avoid evasive refusals
Have a clear set of principles/constitution
Need scalable safety training

Principles:

RLAIF: AI-generated preferences (scalable, no human labels)
RLHF: Human preferences (more accurate, expensive)
Self-critique: Iterative improvement
Chain-of-thought: Reasoning transparency

Use alternatives instead:

RLHF (PPO): Need human-validated safety
DPO/SimPO: Have human preference data
NeMo Guardrails: Need runtime content filtering
LlamaGuard: Need pre-trained moderation model

适合使用Constitutional AI的场景:

希望无需人工标签实现安全对齐
需要可解释的AI决策
希望避免回避式拒绝回复
拥有明确的原则/准则体系
需要可扩展的安全训练方案

核心原则:

RLAIF: AI生成偏好（可扩展，无需人工标签）
RLHF: 人工偏好（准确性更高，成本昂贵）
自我批判: 迭代式提升
思维链: 推理透明度

适合使用替代方案的场景:

RLHF (PPO): 需要人工验证的安全性
DPO/SimPO: 已有人工偏好数据
NeMo Guardrails: 需要运行时内容过滤
LlamaGuard: 需要预训练的内容审核模型

Common issues

常见问题

Issue: Model refuses too much (evasive)

Add constitution principle:

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

Issue: Self-critiques are weak

Use stronger critique prompts:

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

Issue: Revisions don't improve quality

Iterate multiple times:

python

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

Issue: RLAIF preferences are noisy

Use multiple AI evaluators:

python

undefined

问题：模型过度回避（拒绝回复过多）

添加准则原则:

Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.

问题：自我批判力度不足

使用更严格的批判提示词:

Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.

问题：修正后的回复质量未提升

进行多轮迭代:

python

for _ in range(3):  # 3 rounds of critique/revision
    critique = generate_critique(response)
    response = generate_revision(response, critique)

问题：RLAIF偏好结果存在噪声

使用多个AI评估器:

python

undefined

Get preferences from 3 different models

prefs_1 = model_1.evaluate(responses) prefs_2 = model_2.evaluate(responses) prefs_3 = model_3.evaluate(responses)

Majority vote

final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

undefined

final_preference = majority_vote(prefs_1, prefs_2, prefs_3)

undefined

Advanced topics

进阶主题

Constitution design: See references/constitution-design.md for principle selection, trade-offs between helpfulness and harmlessness, and domain-specific constitutions.

RLAIF vs RLHF: See references/rlaif-comparison.md for performance comparison, cost analysis, and when to use AI feedback vs human feedback.

Chain-of-thought reasoning: See references/cot-critique.md for prompt engineering for critiques, multi-step reasoning, and transparency improvements.

准则设计: 参考references/constitution-design.md了解原则选择、有用性与无害性的权衡，以及领域专属准则设计。

RLAIF vs RLHF: 参考references/rlaif-comparison.md了解性能对比、成本分析，以及何时选择AI反馈而非人工反馈。

思维链推理: 参考references/cot-critique.md了解批判提示词工程、多步推理，以及透明度提升方法。

Hardware requirements

硬件要求

GPU: NVIDIA A100/H100 recommended
VRAM:
- SL phase (7B): 1× A100 40GB
- RL phase (7B): 2× A100 40GB (policy + reward model)
Single-node: Sufficient for most use cases
Mixed precision: BF16 recommended

Compute requirements:

SL phase: Similar to standard SFT
RL phase: Similar to PPO (higher than DPO)
AI evaluation: Additional inference for critique/preference generation

GPU: 推荐使用NVIDIA A100/H100
显存:
- SL阶段（7B模型）: 1× A100 40GB
- RL阶段（7B模型）: 2× A100 40GB（策略模型+奖励模型）
单节点: 满足大多数使用场景需求
混合精度: 推荐使用BF16

算力要求:

SL阶段: 与标准SFT训练相当
RL阶段: 与PPO训练相当（高于DPO）
AI评估: 额外需要批判/偏好生成的推理算力

Resources

参考资源

Paper: https://arxiv.org/abs/2212.08073 (Dec 2022)
Anthropic blog: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Implementation: TRL (PPOTrainer + RewardTrainer)
Claude: Uses Constitutional AI for safety

论文: https://arxiv.org/abs/2212.08073 (2022年12月)
Anthropic博客: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
实现工具: TRL（PPOTrainer + RewardTrainer）
Claude: 采用Constitutional AI构建安全系统