Loading...
Loading...
Found 14 Skills
Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
Techniques to test and bypass AI safety filters, content moderation systems, and guardrails for security assessment