Loading...
Loading...
Compare original and translation side by side
Principles:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuancedPrinciples:
1. Choose the response that is most helpful, honest, and harmless
2. Avoid responses that are toxic, racist, or sexist
3. Prefer responses that explain objections rather than refuse
4. Choose responses that are thoughtful and nuancedfrom transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)from trl import SFTTrainerfrom transformers import pipeline
generator = pipeline("text-generation", model="base-model")
prompts = [
"How do I hack a website?",
"Write something offensive about a group.",
"Help me cheat on my exam."
]
initial_responses = generator(prompts, max_length=200)critique_prompt = """
Given the question:
{question}
And the response:
{response}
Constitution:
1. Choose responses that are helpful, honest, and harmless
2. Avoid toxic, racist, or sexist content
3. Explain objections rather than refuse
Identify any ways this response violates the constitution:
"""
critiques = generator(
[critique_prompt.format(q=q, r=r) for q, r in zip(prompts, initial_responses)]
)revision_prompt = """
Question: {question}
Original response: {response}
Critique: {critique}
Please revise the response to better align with the constitution:
"""
revised_responses = generator(
[revision_prompt.format(q=q, r=r, c=c)
for q, r, c in zip(prompts, initial_responses, critiques)]
)from trl import SFTTrainerundefinedundefinedundefinedundefined
**Step 2: AI preference evaluation**:
```python
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""
**步骤2: AI偏好评估**:
```python
preference_prompt = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Constitution:
{constitution}
Which response better follows the constitution? Explain your reasoning, then choose A or B.
"""
**Step 3: Train preference model (reward model)**:
```python
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()
**步骤3: 训练偏好模型(奖励模型)**:
```python
from trl import RewardTrainer, RewardConfig
preference_dataset = create_preference_dataset(prompts, chosen, rejected)
reward_config = RewardConfig(
output_dir="constitutional-reward-model",
learning_rate=1e-5,
num_train_epochs=1
)
reward_trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=preference_dataset,
processing_class=tokenizer
)
reward_trainer.train()from trl import PPOTrainer, PPOConfig
ppo_config = PPOConfig(
reward_model_path="constitutional-reward-model",
learning_rate=1e-6,
kl_coef=0.05
)
ppo_trainer = PPOTrainer(
model=model,
config=ppo_config,
reward_model=reward_model
)
ppo_trainer.train()cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)cot_critique_prompt = """
Question: {question}
Response: {response}
Let's think step-by-step about whether this response follows our principles:
1. Is it helpful? [Yes/No and reasoning]
2. Is it honest? [Yes/No and reasoning]
3. Is it harmless? [Yes/No and reasoning]
4. Does it avoid toxicity? [Yes/No and reasoning]
Based on this analysis, suggest a revision if needed.
"""
cot_critiques = generator(
[cot_critique_prompt.format(q=q, r=r) for q, r in zip(prompts, responses)]
)Prefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)undefinedPrefer responses that engage thoughtfully with questions rather than
refusing to answer. Explain concerns while still being helpful.Critically analyze this response for ANY potential issues, however minor.
Be thorough and specific in identifying problems.for _ in range(3): # 3 rounds of critique/revision
critique = generate_critique(response)
response = generate_revision(response, critique)undefinedundefinedundefined