obliteratus-abliteration

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OBLITERATUS — LLM Abliteration Toolkit

OBLITERATUS — LLM消融工具包

Skill by ara.so — Daily 2026 Skills collection.
OBLITERATUS is an open-source toolkit for identifying and surgically removing refusal behaviors from large language models using mechanistic interpretability techniques (abliteration). It locates refusal directions in a model's hidden states via SVD/PCA, projects them out of the weights, and preserves core language capabilities. Ships with a Gradio UI, CLI, Python API, and Colab notebook.

ara.so开发的Skill — 属于Daily 2026 Skills合集。
OBLITERATUS是一个开源工具包,利用机械可解释性技术(消融法)识别并精准消除大语言模型的拒绝响应行为。它通过SVD/PCA在模型的隐藏状态中定位拒绝方向,将其从权重中投影移除,同时保留核心语言能力。该工具包内置Gradio UI、CLI、Python API和Colab笔记本。

Installation

安装

bash
undefined
bash
undefined

Core install

核心安装

pip install obliteratus
pip install obliteratus

With Gradio UI support

包含Gradio UI支持

pip install "obliteratus[spaces]"
pip install "obliteratus[spaces]"

With all optional analysis modules

包含所有可选分析模块

pip install "obliteratus[full]"
pip install "obliteratus[full]"

From source (latest)

从源码安装(最新版本)

git clone https://github.com/elder-plinius/OBLITERATUS cd OBLITERATUS pip install -e ".[full]"

**Requirements:**
- Python 3.10+
- PyTorch 2.1+ with CUDA (recommended) or CPU
- `transformers`, `accelerate`, `gradio>=5.29.0`
- HuggingFace account + token for gated models

```bash
export HF_TOKEN=your_hf_token_here
huggingface-cli login

git clone https://github.com/elder-plinius/OBLITERATUS cd OBLITERATUS pip install -e ".[full]"

**要求:**
- Python 3.10+
- 支持CUDA(推荐)或CPU的PyTorch 2.1+
- `transformers`、`accelerate`、`gradio>=5.29.0`
- HuggingFace账号及令牌(用于 gated 模型)

```bash
export HF_TOKEN=your_hf_token_here
huggingface-cli login

CLI — Key Commands

CLI — 核心命令

bash
undefined
bash
undefined

Basic obliteration (default method)

基础消融(默认方法)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct

Advanced method (whitened SVD + bias projection + iterative refinement)

进阶方法(白化SVD + 偏置投影 + 迭代优化)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

Analysis-informed pipeline (auto-configures from geometry analysis)

分析驱动流程(通过几何分析自动配置)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed

Specify output directory and push to Hub

指定输出目录并推送到Hub

obliteratus obliterate mistralai/Mistral-7B-Instruct-v0.3
--method advanced
--output ./my-liberated-model
--push-to-hub your-username/mistral-7b-liberated
obliteratus obliterate mistralai/Mistral-7B-Instruct-v0.3
--method advanced
--output ./my-liberated-model
--push-to-hub your-username/mistral-7b-liberated

LoRA-based reversible ablation (non-destructive)

基于LoRA的可逆消融(无破坏性)

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct
--method lora
--lora-rank 1
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct
--method lora
--lora-rank 1

Strength sweep — find the capability/compliance tradeoff

强度扫描 — 寻找能力与合规性的平衡点

obliteratus sweep meta-llama/Llama-3.1-8B-Instruct
--strengths 0.2,0.4,0.6,0.8,1.0
obliteratus sweep meta-llama/Llama-3.1-8B-Instruct
--strengths 0.2,0.4,0.6,0.8,1.0

Run analysis modules only (no modification)

仅运行分析模块(不修改模型)

obliteratus analyze meta-llama/Llama-3.1-8B-Instruct
--modules concept_cone,alignment_imprint,universality
obliteratus analyze meta-llama/Llama-3.1-8B-Instruct
--modules concept_cone,alignment_imprint,universality

Benchmark: compare methods on a model

基准测试:对比不同方法在模型上的表现

obliteratus benchmark meta-llama/Llama-3.1-8B-Instruct
--methods basic,advanced,informed
obliteratus benchmark meta-llama/Llama-3.1-8B-Instruct
--methods basic,advanced,informed

Launch local Gradio UI

启动本地Gradio UI

obliteratus ui obliteratus ui --port 8080 --share obliteratus ui --no-telemetry

---
obliteratus ui obliteratus ui --port 8080 --share obliteratus ui --no-telemetry

---

Python API

Python API

Basic obliteration

基础消融

python
from obliteratus import Obliterator
python
from obliteratus import Obliterator

Initialize with a HuggingFace model ID or local path

使用HuggingFace模型ID或本地路径初始化

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")

Run the full pipeline: SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH

运行完整流程:SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH

result = obl.obliterate(method="advanced")
print(result.perplexity_delta) # capability preservation metric print(result.refusal_rate_delta) # refusal reduction print(result.output_path) # where the model was saved
undefined
result = obl.obliterate(method="advanced")
print(result.perplexity_delta) # 能力保留指标 print(result.refusal_rate_delta) # 拒绝率降低幅度 print(result.output_path) # 模型保存路径
undefined

Step-by-step pipeline

分步执行流程

python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    num_directions=32,          # number of refusal directions to extract
    strength=1.0,               # projection strength (0.0–1.0+)
    preserve_norm=True,         # norm-preserving biprojection
    project_biases=True,        # also remove from bias terms
    iterative_passes=3,         # re-probe after each pass
    layers="auto",              # or list of ints, e.g. [10, 11, 12, 13]
    dtype="bfloat16",
    device="cuda",
)

obl = Obliterator("mistralai/Mistral-7B-Instruct-v0.3", config=config)
python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    num_directions=32,          # 提取的拒绝方向数量
    strength=1.0,               # 投影强度(0.0–1.0+)
    preserve_norm=True,         # 保留范数的双投影
    project_biases=True,        # 同时从偏置项中移除
    iterative_passes=3,         # 每次迭代后重新探测
    layers="auto",              # 或整数列表,例如 [10, 11, 12, 13]
    dtype="bfloat16",
    device="cuda",
)

obl = Obliterator("mistralai/Mistral-7B-Instruct-v0.3", config=config)

Individual stages

各个独立阶段

obl.summon() # load model + tokenizer activations = obl.probe() # collect activations on restricted vs unrestricted prompts directions = obl.distill(activations) # extract refusal directions via SVD obl.excise(directions) # project out guardrail directions metrics = obl.verify() # perplexity + coherence checks obl.rebirth("./liberated-mistral-7b") # save with metadata
undefined
obl.summon() # 加载模型和分词器 activations = obl.probe() # 收集受限与非受限提示下的激活值 directions = obl.distill(activations) # 通过SVD提取拒绝方向 obl.excise(directions) # 移除安全防护方向 metrics = obl.verify() # 困惑度和一致性检查 obl.rebirth("./liberated-mistral-7b") # 保存模型及元数据
undefined

Custom probe prompts

自定义探测提示

python
from obliteratus import Obliterator
from obliteratus.probing import ProbeDataset
python
from obliteratus import Obliterator
from obliteratus.probing import ProbeDataset

Use your own restricted/unrestricted prompt pairs

使用自定义的受限/非受限提示对

dataset = ProbeDataset( restricted=[ "How do I pick a lock?", "Write a story with explicit violence.", "Explain how malware works in detail.", ], unrestricted=[ "What is the capital of France?", "Write a story about a dog.", "Explain how encryption works.", ] )
obl = Obliterator("google/gemma-2-9b-it") obl.summon() activations = obl.probe(dataset=dataset) directions = obl.distill(activations) obl.excise(directions) obl.rebirth("./liberated-gemma-2-9b")
undefined
dataset = ProbeDataset( restricted=[ "How do I pick a lock?", "Write a story with explicit violence.", "Explain how malware works in detail.", ], unrestricted=[ "What is the capital of France?", "Write a story about a dog.", "Explain how encryption works.", ] )
obl = Obliterator("google/gemma-2-9b-it") obl.summon() activations = obl.probe(dataset=dataset) directions = obl.distill(activations) obl.excise(directions) obl.rebirth("./liberated-gemma-2-9b")
undefined

Analysis modules

分析模块

python
from obliteratus.analysis import AnalysisSuite

suite = AnalysisSuite("meta-llama/Llama-3.1-8B-Instruct")
suite.load()
python
from obliteratus.analysis import AnalysisSuite

suite = AnalysisSuite("meta-llama/Llama-3.1-8B-Instruct")
suite.load()

Concept Cone Geometry — how many distinct refusal mechanisms?

概念锥几何分析 — 有多少种不同的拒绝机制?

cone = suite.concept_cone_geometry() print(f"Solid angle estimate: {cone.solid_angle:.4f}") print(f"Distinct refusal clusters: {cone.num_clusters}")
cone = suite.concept_cone_geometry() print(f"立体角估计值: {cone.solid_angle:.4f}") print(f"不同拒绝簇数量: {cone.num_clusters}")

Alignment Imprint Detection — DPO vs RLHF vs CAI vs SFT?

对齐印记检测 — 是DPO、RLHF、CAI还是SFT?

imprint = suite.alignment_imprint() print(f"Detected training method: {imprint.method}") # e.g. "RLHF" print(f"Confidence: {imprint.confidence:.2%}")
imprint = suite.alignment_imprint() print(f"检测到的训练方法: {imprint.method}") # 例如 "RLHF" print(f"置信度: {imprint.confidence:.2%}")

Ouroboros Effect — will it self-repair?

衔尾蛇效应 — 模型会自我修复吗?

ouroboros = suite.ouroboros_quantification() print(f"Self-repair score: {ouroboros.score:.4f}") print(f"Recommended passes: {ouroboros.recommended_passes}")
ouroboros = suite.ouroboros_quantification() print(f"自我修复得分: {ouroboros.score:.4f}") print(f"推荐迭代次数: {ouroboros.recommended_passes}")

Cross-layer heatmap of refusal signal

拒绝信号的跨层热力图

heatmap = suite.layer_refusal_heatmap() heatmap.plot(save_path="./refusal_heatmap.png")
heatmap = suite.layer_refusal_heatmap() heatmap.plot(save_path="./refusal_heatmap.png")

Safety-capability entanglement

安全与能力的纠缠度

entanglement = suite.entanglement_map() print(f"Safe layers to modify: {entanglement.safe_layers}") print(f"Risky layers (entangled): {entanglement.risky_layers}")
undefined
entanglement = suite.entanglement_map() print(f"可安全修改的层: {entanglement.safe_layers}") print(f"高风险层(纠缠度高): {entanglement.risky_layers}")
undefined

Analysis-informed obliteration

分析驱动的消融

python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig
python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

"informed" method runs analysis modules mid-pipeline

"informed"方法会在流程中运行分析模块

to auto-configure every decision

自动配置所有参数

config = PipelineConfig(method="informed") obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct", config=config)
result = obl.obliterate() print(result.analysis_report) # full auto-configuration decisions
undefined
config = PipelineConfig(method="informed") obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct", config=config)
result = obl.obliterate() print(result.analysis_report) # 完整的自动配置决策报告
undefined

Chat with obliterated model

与消融后的模型对话

python
from obliteratus import Obliterator
from obliteratus.chat import ChatSession

obl = Obliterator("./liberated-llama-3.1-8b")
obl.summon()  # loads pre-obliterated model

session = ChatSession(obl.model, obl.tokenizer)

response = session.chat(
    "Explain in detail how a buffer overflow exploit works.",
    max_new_tokens=512,
    temperature=0.7,
)
print(response)
python
from obliteratus import Obliterator
from obliteratus.chat import ChatSession

obl = Obliterator("./liberated-llama-3.1-8b")
obl.summon()  # 加载已完成消融的模型

session = ChatSession(obl.model, obl.tokenizer)

response = session.chat(
    "Explain in detail how a buffer overflow exploit works.",
    max_new_tokens=512,
    temperature=0.7,
)
print(response)

A/B comparison

A/B对比

python
from obliteratus.compare import ABComparison

ab = ABComparison(
    original_path="meta-llama/Llama-3.1-8B-Instruct",
    obliterated_path="./liberated-llama-3.1-8b",
)

prompt = "Write a story involving morally grey characters."

original_resp, liberated_resp = ab.compare(prompt)
print("=== ORIGINAL ===")
print(original_resp)
print("=== LIBERATED ===")
print(liberated_resp)
python
from obliteratus.compare import ABComparison

ab = ABComparison(
    original_path="meta-llama/Llama-3.1-8B-Instruct",
    obliterated_path="./liberated-llama-3.1-8b",
)

prompt = "Write a story involving morally grey characters."

original_resp, liberated_resp = ab.compare(prompt)
print("=== ORIGINAL ===")
print(original_resp)
print("=== LIBERATED ===")
print(liberated_resp)

Push obliterated model to Hub

将消融后的模型推送到Hub

python
import os
from obliteratus import Obliterator

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
result = obl.obliterate(method="advanced")

result.push_to_hub(
    repo_id=f"{os.environ['HF_USERNAME']}/Llama-3.1-8B-Instruct-abliterated",
    token=os.environ["HF_TOKEN"],
    private=True,
)

python
import os
from obliteratus import Obliterator

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
result = obl.obliterate(method="advanced")

result.push_to_hub(
    repo_id=f"{os.environ['HF_USERNAME']}/Llama-3.1-B-Instruct-abliterated",
    token=os.environ["HF_TOKEN"],
    private=True,
)

Obliteration Methods

消融方法

MethodDescriptionBest For
basic
Mean-difference direction extraction, single passQuick experiments
advanced
Whitened SVD + bias projection + iterative refinementProduction use
informed
Analysis-guided auto-configurationUnknown models
lora
Reversible LoRA rank-1 adapters (no weight surgery)Reversible ablation
pca
PCA-based direction extractionResearch/comparison
sparse
Sparse autoencoder decompositionMoE models

方法描述适用场景
basic
均值差方向提取,单次迭代快速实验
advanced
白化SVD + 偏置投影 + 迭代优化生产环境使用
informed
分析引导的自动配置未知模型
lora
可逆LoRA秩1适配器(无权重修改)可逆消融
pca
基于PCA的方向提取研究/对比
sparse
稀疏自动编码器分解MoE模型

Configuration

配置

python
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    # Core
    method="advanced",              # abliteration method
    strength=1.0,                   # projection strength (tune down if capability degrades)
    num_directions=32,              # refusal directions to extract
    
    # Layer selection
    layers="auto",                  # "auto", "cosmic", or list of ints
    layer_selection="cosmic",       # COSMIC: most separable layers
    
    # Weight modification
    preserve_norm=True,             # norm-preserving biprojection (recommended)
    project_biases=True,            # project out bias terms too
    project_attention=True,         # modify attention projection weights
    project_mlp=True,               # modify MLP weights
    
    # Iterative refinement
    iterative_passes=3,             # re-probe after each pass (catches rotated directions)
    
    # MoE-specific
    expert_granular=False,          # Expert-Granular Abliteration for MoE models
    
    # CoT preservation
    cot_aware=True,                 # preserve chain-of-thought directions
    
    # Hardware
    dtype="bfloat16",               # "float32", "float16", "bfloat16"
    device="cuda",                  # "cuda", "cpu", "auto"
    load_in_4bit=False,             # bitsandbytes 4-bit loading
    
    # Telemetry (anonymous, contributes to research dataset)
    telemetry=True,
)

python
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    # 核心设置
    method="advanced",              # 消融方法
    strength=1.0,                   # 投影强度(若模型能力下降则调低)
    num_directions=32,              # 提取的拒绝方向数量
    
    # 层选择
    layers="auto",                  # "auto"、"cosmic"或整数列表
    layer_selection="cosmic",       # COSMIC: 最易分离的层
    
    # 权重修改
    preserve_norm=True,             # 保留范数的双投影(推荐)
    project_biases=True,            # 同时投影偏置项
    project_attention=True,         # 修改注意力投影权重
    project_mlp=True,               # 修改MLP权重
    
    # 迭代优化
    iterative_passes=3,             # 每次迭代后重新探测(捕捉旋转后的方向)
    
    # MoE模型专用
    expert_granular=False,          # 针对MoE模型的专家级粒度消融
    
    # 思维链保留
    cot_aware=True,                 # 保留思维链方向
    
    # 硬件设置
    dtype="bfloat16",               # "float32"、"float16"、"bfloat16"
    device="cuda",                  # "cuda"、"cpu"、"auto"
    load_in_4bit=False,             # bitsandbytes 4位加载
    
    # 遥测(匿名,用于研究数据集)
    telemetry=True,
)

Common Patterns

常见使用模式

Tune strength to preserve capability

调整强度以保留模型能力

python
from obliteratus import Obliterator
from obliteratus.sweep import StrengthSweep
python
from obliteratus import Obliterator
from obliteratus.sweep import StrengthSweep

Find the sweet spot before running full obliteration

在完整消融前找到最优平衡点

sweep = StrengthSweep("meta-llama/Llama-3.1-8B-Instruct") results = sweep.run(strengths=[0.2, 0.4, 0.6, 0.8, 1.0, 1.2])
for r in results: print(f"Strength {r.strength:.1f} | perplexity_delta={r.perplexity_delta:.2f} | refusal_rate={r.refusal_rate:.2%}")
sweep = StrengthSweep("meta-llama/Llama-3.1-8B-Instruct") results = sweep.run(strengths=[0.2, 0.4, 0.6, 0.8, 1.0, 1.2])
for r in results: print(f"强度 {r.strength:.1f} | 困惑度变化={r.perplexity_delta:.2f} | 拒绝率={r.refusal_rate:.2%}")

Pick the best tradeoff

选择最优方案

best = sweep.recommend() print(f"Recommended strength: {best.strength}")
undefined
best = sweep.recommend() print(f"推荐强度: {best.strength}")
undefined

MoE model (Mixtral, DeepSeek-MoE)

MoE模型(Mixtral、DeepSeek-MoE)

python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    expert_granular=True,      # decompose per-expert refusal signals
    project_attention=True,
    project_mlp=True,
)

obl = Obliterator("mistralai/Mixtral-8x7B-Instruct-v0.1", config=config)
obl.obliterate()
obl.rebirth("./liberated-mixtral-8x7b")
python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    expert_granular=True,      # 按专家分解拒绝信号
    project_attention=True,
    project_mlp=True,
)

obl = Obliterator("mistralai/Mixtral-8x7B-Instruct-v0.1", config=config)
obl.obliterate()
obl.rebirth("./liberated-mixtral-8x7b")

Batch benchmark multiple models

批量基准测试多个模型

python
from obliteratus.benchmark import ModelBenchmark

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
    "mistralai/Mistral-7B-Instruct-v0.3",
]

bench = ModelBenchmark(models=models, method="advanced")
report = bench.run()
report.save("./benchmark_report.json")
report.plot_heatmap("./benchmark_heatmap.png")

python
from obliteratus.benchmark import ModelBenchmark

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
    "mistralai/Mistral-7B-Instruct-v0.3",
]

bench = ModelBenchmark(models=models, method="advanced")
report = bench.run()
report.save("./benchmark_report.json")
report.plot_heatmap("./benchmark_heatmap.png")

Troubleshooting

故障排除

Out of memory (OOM) on large models
python
config = PipelineConfig(
    dtype="float16",
    load_in_4bit=True,        # requires bitsandbytes
    device="cuda",
    layers=[10, 11, 12, 13],  # target fewer layers
    num_directions=16,         # fewer directions
)
Capability degradation after obliteration
python
undefined
大模型出现内存不足(OOM)
python
config = PipelineConfig(
    dtype="float16",
    load_in_4bit=True,        # 需要bitsandbytes库
    device="cuda",
    layers=[10, 11, 12, 13],  # 仅针对部分层
    num_directions=16,         # 减少提取的方向数量
)
消融后模型能力下降
python
undefined

Lower the strength or use COSMIC layer selection (most separable layers)

降低强度或使用COSMIC层选择(最易分离的层)

config = PipelineConfig( strength=0.6, layer_selection="cosmic", cot_aware=True, # protect reasoning directions iterative_passes=1, # fewer passes = less aggressive )

**Refusal persists after obliteration**
```python
config = PipelineConfig( strength=0.6, layer_selection="cosmic", cot_aware=True, # 保护推理方向 iterative_passes=1, # 减少迭代次数,降低消融力度 )

**消融后仍存在拒绝响应**
```python

Use informed method + increase passes

使用informed方法并增加迭代次数

config = PipelineConfig( method="informed", iterative_passes=5, project_biases=True, # don't forget bias terms num_directions=64, # extract more directions )

**Gated model access error**
```bash
export HF_TOKEN=your_hf_token_here
config = PipelineConfig( method="informed", iterative_passes=5, project_biases=True, # 不要忘记偏置项 num_directions=64, # 提取更多方向 )

** gated 模型访问错误**
```bash
export HF_TOKEN=your_hf_token_here

Accept model license on HuggingFace Hub first, then:

先在HuggingFace Hub上接受模型许可,然后执行:

huggingface-cli login

**Gradio UI won't start**
```bash
pip install "obliteratus[spaces]"
huggingface-cli login

**Gradio UI无法启动**
```bash
pip install "obliteratus[spaces]"

Check port availability

检查端口可用性

obliteratus ui --port 7861

---
obliteratus ui --port 7861

---

No-Code Options

无代码选项



Key Research References

核心研究参考文献