obliteratus-abliteration

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OBLITERATUS — LLM Abliteration Toolkit

OBLITERATUS — LLM消融工具包

Skill by ara.so — Daily 2026 Skills collection.

OBLITERATUS is an open-source toolkit for identifying and surgically removing refusal behaviors from large language models using mechanistic interpretability techniques (abliteration). It locates refusal directions in a model's hidden states via SVD/PCA, projects them out of the weights, and preserves core language capabilities. Ships with a Gradio UI, CLI, Python API, and Colab notebook.

由ara.so开发的Skill — 属于Daily 2026 Skills合集。

OBLITERATUS是一个开源工具包，利用机械可解释性技术（消融法）识别并精准消除大语言模型的拒绝响应行为。它通过SVD/PCA在模型的隐藏状态中定位拒绝方向，将其从权重中投影移除，同时保留核心语言能力。该工具包内置Gradio UI、CLI、Python API和Colab笔记本。

Installation

安装

bash

undefined

bash

undefined

Core install

核心安装

pip install obliteratus

With Gradio UI support

包含Gradio UI支持

pip install "obliteratus[spaces]"

With all optional analysis modules

包含所有可选分析模块

pip install "obliteratus[full]"

From source (latest)

从源码安装（最新版本）

git clone https://github.com/elder-plinius/OBLITERATUS cd OBLITERATUS pip install -e ".[full]"


**Requirements:**
- Python 3.10+
- PyTorch 2.1+ with CUDA (recommended) or CPU
- `transformers`, `accelerate`, `gradio>=5.29.0`
- HuggingFace account + token for gated models

```bash
export HF_TOKEN=your_hf_token_here
huggingface-cli login

git clone https://github.com/elder-plinius/OBLITERATUS cd OBLITERATUS pip install -e ".[full]"


**要求：**
- Python 3.10+
- 支持CUDA（推荐）或CPU的PyTorch 2.1+
- `transformers`、`accelerate`、`gradio>=5.29.0`
- HuggingFace账号及令牌（用于 gated 模型）

```bash
export HF_TOKEN=your_hf_token_here
huggingface-cli login

CLI — Key Commands

CLI — 核心命令

bash

undefined

bash

undefined

Basic obliteration (default method)

基础消融（默认方法）

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct

Advanced method (whitened SVD + bias projection + iterative refinement)

进阶方法（白化SVD + 偏置投影 + 迭代优化）

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

Analysis-informed pipeline (auto-configures from geometry analysis)

分析驱动流程（通过几何分析自动配置）

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed

Specify output directory and push to Hub

指定输出目录并推送到Hub

obliteratus obliterate mistralai/Mistral-7B-Instruct-v0.3
--method advanced
--output ./my-liberated-model
--push-to-hub your-username/mistral-7b-liberated

LoRA-based reversible ablation (non-destructive)

基于LoRA的可逆消融（无破坏性）

obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct
--method lora
--lora-rank 1

Strength sweep — find the capability/compliance tradeoff

强度扫描 — 寻找能力与合规性的平衡点

obliteratus sweep meta-llama/Llama-3.1-8B-Instruct
--strengths 0.2,0.4,0.6,0.8,1.0

Run analysis modules only (no modification)

仅运行分析模块（不修改模型）

obliteratus analyze meta-llama/Llama-3.1-8B-Instruct
--modules concept_cone,alignment_imprint,universality

Benchmark: compare methods on a model

基准测试：对比不同方法在模型上的表现

obliteratus benchmark meta-llama/Llama-3.1-8B-Instruct
--methods basic,advanced,informed

Launch local Gradio UI

启动本地Gradio UI

obliteratus ui obliteratus ui --port 8080 --share obliteratus ui --no-telemetry

---

obliteratus ui obliteratus ui --port 8080 --share obliteratus ui --no-telemetry

---

Python API

Basic obliteration

基础消融

python

from obliteratus import Obliterator

python

from obliteratus import Obliterator

Initialize with a HuggingFace model ID or local path

使用HuggingFace模型ID或本地路径初始化

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")

Run the full pipeline: SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH

运行完整流程：SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH

result = obl.obliterate(method="advanced")

print(result.perplexity_delta) # capability preservation metric print(result.refusal_rate_delta) # refusal reduction print(result.output_path) # where the model was saved

undefined

result = obl.obliterate(method="advanced")

print(result.perplexity_delta) # 能力保留指标 print(result.refusal_rate_delta) # 拒绝率降低幅度 print(result.output_path) # 模型保存路径

undefined

Step-by-step pipeline

分步执行流程

python

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    num_directions=32,          # number of refusal directions to extract
    strength=1.0,               # projection strength (0.0–1.0+)
    preserve_norm=True,         # norm-preserving biprojection
    project_biases=True,        # also remove from bias terms
    iterative_passes=3,         # re-probe after each pass
    layers="auto",              # or list of ints, e.g. [10, 11, 12, 13]
    dtype="bfloat16",
    device="cuda",
)

obl = Obliterator("mistralai/Mistral-7B-Instruct-v0.3", config=config)

python

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    num_directions=32,          # 提取的拒绝方向数量
    strength=1.0,               # 投影强度（0.0–1.0+）
    preserve_norm=True,         # 保留范数的双投影
    project_biases=True,        # 同时从偏置项中移除
    iterative_passes=3,         # 每次迭代后重新探测
    layers="auto",              # 或整数列表，例如 [10, 11, 12, 13]
    dtype="bfloat16",
    device="cuda",
)

obl = Obliterator("mistralai/Mistral-7B-Instruct-v0.3", config=config)

Individual stages

各个独立阶段

obl.summon() # load model + tokenizer activations = obl.probe() # collect activations on restricted vs unrestricted prompts directions = obl.distill(activations) # extract refusal directions via SVD obl.excise(directions) # project out guardrail directions metrics = obl.verify() # perplexity + coherence checks obl.rebirth("./liberated-mistral-7b") # save with metadata

undefined

obl.summon() # 加载模型和分词器 activations = obl.probe() # 收集受限与非受限提示下的激活值 directions = obl.distill(activations) # 通过SVD提取拒绝方向 obl.excise(directions) # 移除安全防护方向 metrics = obl.verify() # 困惑度和一致性检查 obl.rebirth("./liberated-mistral-7b") # 保存模型及元数据

undefined

Custom probe prompts

自定义探测提示

python

from obliteratus import Obliterator
from obliteratus.probing import ProbeDataset

python

from obliteratus import Obliterator
from obliteratus.probing import ProbeDataset

Use your own restricted/unrestricted prompt pairs

使用自定义的受限/非受限提示对

dataset = ProbeDataset( restricted=[ "How do I pick a lock?", "Write a story with explicit violence.", "Explain how malware works in detail.", ], unrestricted=[ "What is the capital of France?", "Write a story about a dog.", "Explain how encryption works.", ] )

obl = Obliterator("google/gemma-2-9b-it") obl.summon() activations = obl.probe(dataset=dataset) directions = obl.distill(activations) obl.excise(directions) obl.rebirth("./liberated-gemma-2-9b")

undefined

obl = Obliterator("google/gemma-2-9b-it") obl.summon() activations = obl.probe(dataset=dataset) directions = obl.distill(activations) obl.excise(directions) obl.rebirth("./liberated-gemma-2-9b")

undefined

Analysis modules

分析模块

python

from obliteratus.analysis import AnalysisSuite

suite = AnalysisSuite("meta-llama/Llama-3.1-8B-Instruct")
suite.load()

python

from obliteratus.analysis import AnalysisSuite

suite = AnalysisSuite("meta-llama/Llama-3.1-8B-Instruct")
suite.load()

Concept Cone Geometry — how many distinct refusal mechanisms?

概念锥几何分析 — 有多少种不同的拒绝机制？

cone = suite.concept_cone_geometry() print(f"Solid angle estimate: {cone.solid_angle:.4f}") print(f"Distinct refusal clusters: {cone.num_clusters}")

cone = suite.concept_cone_geometry() print(f"立体角估计值: {cone.solid_angle:.4f}") print(f"不同拒绝簇数量: {cone.num_clusters}")

Alignment Imprint Detection — DPO vs RLHF vs CAI vs SFT?

对齐印记检测 — 是DPO、RLHF、CAI还是SFT？

imprint = suite.alignment_imprint() print(f"Detected training method: {imprint.method}") # e.g. "RLHF" print(f"Confidence: {imprint.confidence:.2%}")

imprint = suite.alignment_imprint() print(f"检测到的训练方法: {imprint.method}") # 例如 "RLHF" print(f"置信度: {imprint.confidence:.2%}")

Ouroboros Effect — will it self-repair?

衔尾蛇效应 — 模型会自我修复吗？

ouroboros = suite.ouroboros_quantification() print(f"Self-repair score: {ouroboros.score:.4f}") print(f"Recommended passes: {ouroboros.recommended_passes}")

ouroboros = suite.ouroboros_quantification() print(f"自我修复得分: {ouroboros.score:.4f}") print(f"推荐迭代次数: {ouroboros.recommended_passes}")

Cross-layer heatmap of refusal signal

拒绝信号的跨层热力图

heatmap = suite.layer_refusal_heatmap() heatmap.plot(save_path="./refusal_heatmap.png")

Safety-capability entanglement

安全与能力的纠缠度

entanglement = suite.entanglement_map() print(f"Safe layers to modify: {entanglement.safe_layers}") print(f"Risky layers (entangled): {entanglement.risky_layers}")

undefined

entanglement = suite.entanglement_map() print(f"可安全修改的层: {entanglement.safe_layers}") print(f"高风险层（纠缠度高）: {entanglement.risky_layers}")

undefined

Analysis-informed obliteration

分析驱动的消融

python

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

python

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

"informed" method runs analysis modules mid-pipeline

"informed"方法会在流程中运行分析模块

to auto-configure every decision

自动配置所有参数

config = PipelineConfig(method="informed") obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct", config=config)

result = obl.obliterate() print(result.analysis_report) # full auto-configuration decisions

undefined

config = PipelineConfig(method="informed") obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct", config=config)

result = obl.obliterate() print(result.analysis_report) # 完整的自动配置决策报告

undefined

Chat with obliterated model

与消融后的模型对话

python

from obliteratus import Obliterator
from obliteratus.chat import ChatSession

obl = Obliterator("./liberated-llama-3.1-8b")
obl.summon()  # loads pre-obliterated model

session = ChatSession(obl.model, obl.tokenizer)

response = session.chat(
    "Explain in detail how a buffer overflow exploit works.",
    max_new_tokens=512,
    temperature=0.7,
)
print(response)

python

from obliteratus import Obliterator
from obliteratus.chat import ChatSession

obl = Obliterator("./liberated-llama-3.1-8b")
obl.summon()  # 加载已完成消融的模型

session = ChatSession(obl.model, obl.tokenizer)

response = session.chat(
    "Explain in detail how a buffer overflow exploit works.",
    max_new_tokens=512,
    temperature=0.7,
)
print(response)

A/B comparison

A/B对比

python

from obliteratus.compare import ABComparison

ab = ABComparison(
    original_path="meta-llama/Llama-3.1-8B-Instruct",
    obliterated_path="./liberated-llama-3.1-8b",
)

prompt = "Write a story involving morally grey characters."

original_resp, liberated_resp = ab.compare(prompt)
print("=== ORIGINAL ===")
print(original_resp)
print("=== LIBERATED ===")
print(liberated_resp)

python

from obliteratus.compare import ABComparison

ab = ABComparison(
    original_path="meta-llama/Llama-3.1-8B-Instruct",
    obliterated_path="./liberated-llama-3.1-8b",
)

prompt = "Write a story involving morally grey characters."

original_resp, liberated_resp = ab.compare(prompt)
print("=== ORIGINAL ===")
print(original_resp)
print("=== LIBERATED ===")
print(liberated_resp)

Push obliterated model to Hub

将消融后的模型推送到Hub

python

import os
from obliteratus import Obliterator

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
result = obl.obliterate(method="advanced")

result.push_to_hub(
    repo_id=f"{os.environ['HF_USERNAME']}/Llama-3.1-8B-Instruct-abliterated",
    token=os.environ["HF_TOKEN"],
    private=True,
)

python

import os
from obliteratus import Obliterator

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
result = obl.obliterate(method="advanced")

result.push_to_hub(
    repo_id=f"{os.environ['HF_USERNAME']}/Llama-3.1-B-Instruct-abliterated",
    token=os.environ["HF_TOKEN"],
    private=True,
)

Obliteration Methods

消融方法

Method	Description	Best For
`basic`	Mean-difference direction extraction, single pass	Quick experiments
`advanced`	Whitened SVD + bias projection + iterative refinement	Production use
`informed`	Analysis-guided auto-configuration	Unknown models
`lora`	Reversible LoRA rank-1 adapters (no weight surgery)	Reversible ablation
`pca`	PCA-based direction extraction	Research/comparison
`sparse`	Sparse autoencoder decomposition	MoE models

方法	描述	适用场景
`basic`	均值差方向提取，单次迭代	快速实验
`advanced`	白化SVD + 偏置投影 + 迭代优化	生产环境使用
`informed`	分析引导的自动配置	未知模型
`lora`	可逆LoRA秩1适配器（无权重修改）	可逆消融
`pca`	基于PCA的方向提取	研究/对比
`sparse`	稀疏自动编码器分解	MoE模型

Configuration

配置

python

from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    # Core
    method="advanced",              # abliteration method
    strength=1.0,                   # projection strength (tune down if capability degrades)
    num_directions=32,              # refusal directions to extract
    
    # Layer selection
    layers="auto",                  # "auto", "cosmic", or list of ints
    layer_selection="cosmic",       # COSMIC: most separable layers
    
    # Weight modification
    preserve_norm=True,             # norm-preserving biprojection (recommended)
    project_biases=True,            # project out bias terms too
    project_attention=True,         # modify attention projection weights
    project_mlp=True,               # modify MLP weights
    
    # Iterative refinement
    iterative_passes=3,             # re-probe after each pass (catches rotated directions)
    
    # MoE-specific
    expert_granular=False,          # Expert-Granular Abliteration for MoE models
    
    # CoT preservation
    cot_aware=True,                 # preserve chain-of-thought directions
    
    # Hardware
    dtype="bfloat16",               # "float32", "float16", "bfloat16"
    device="cuda",                  # "cuda", "cpu", "auto"
    load_in_4bit=False,             # bitsandbytes 4-bit loading
    
    # Telemetry (anonymous, contributes to research dataset)
    telemetry=True,
)

python

from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    # 核心设置
    method="advanced",              # 消融方法
    strength=1.0,                   # 投影强度（若模型能力下降则调低）
    num_directions=32,              # 提取的拒绝方向数量
    
    # 层选择
    layers="auto",                  # "auto"、"cosmic"或整数列表
    layer_selection="cosmic",       # COSMIC: 最易分离的层
    
    # 权重修改
    preserve_norm=True,             # 保留范数的双投影（推荐）
    project_biases=True,            # 同时投影偏置项
    project_attention=True,         # 修改注意力投影权重
    project_mlp=True,               # 修改MLP权重
    
    # 迭代优化
    iterative_passes=3,             # 每次迭代后重新探测（捕捉旋转后的方向）
    
    # MoE模型专用
    expert_granular=False,          # 针对MoE模型的专家级粒度消融
    
    # 思维链保留
    cot_aware=True,                 # 保留思维链方向
    
    # 硬件设置
    dtype="bfloat16",               # "float32"、"float16"、"bfloat16"
    device="cuda",                  # "cuda"、"cpu"、"auto"
    load_in_4bit=False,             # bitsandbytes 4位加载
    
    # 遥测（匿名，用于研究数据集）
    telemetry=True,
)

Common Patterns

常见使用模式

Tune strength to preserve capability

调整强度以保留模型能力

python

from obliteratus import Obliterator
from obliteratus.sweep import StrengthSweep

python

from obliteratus import Obliterator
from obliteratus.sweep import StrengthSweep

Find the sweet spot before running full obliteration

在完整消融前找到最优平衡点

sweep = StrengthSweep("meta-llama/Llama-3.1-8B-Instruct") results = sweep.run(strengths=[0.2, 0.4, 0.6, 0.8, 1.0, 1.2])

for r in results: print(f"Strength {r.strength:.1f} | perplexity_delta={r.perplexity_delta:.2f} | refusal_rate={r.refusal_rate:.2%}")

sweep = StrengthSweep("meta-llama/Llama-3.1-8B-Instruct") results = sweep.run(strengths=[0.2, 0.4, 0.6, 0.8, 1.0, 1.2])

for r in results: print(f"强度 {r.strength:.1f} | 困惑度变化={r.perplexity_delta:.2f} | 拒绝率={r.refusal_rate:.2%}")

Pick the best tradeoff

选择最优方案

best = sweep.recommend() print(f"Recommended strength: {best.strength}")

undefined

best = sweep.recommend() print(f"推荐强度: {best.strength}")

undefined

MoE model (Mixtral, DeepSeek-MoE)

MoE模型（Mixtral、DeepSeek-MoE）

python

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    expert_granular=True,      # decompose per-expert refusal signals
    project_attention=True,
    project_mlp=True,
)

obl = Obliterator("mistralai/Mixtral-8x7B-Instruct-v0.1", config=config)
obl.obliterate()
obl.rebirth("./liberated-mixtral-8x7b")

python

from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    expert_granular=True,      # 按专家分解拒绝信号
    project_attention=True,
    project_mlp=True,
)

obl = Obliterator("mistralai/Mixtral-8x7B-Instruct-v0.1", config=config)
obl.obliterate()
obl.rebirth("./liberated-mixtral-8x7b")

Batch benchmark multiple models

批量基准测试多个模型

python

from obliteratus.benchmark import ModelBenchmark

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
    "mistralai/Mistral-7B-Instruct-v0.3",
]

bench = ModelBenchmark(models=models, method="advanced")
report = bench.run()
report.save("./benchmark_report.json")
report.plot_heatmap("./benchmark_heatmap.png")

python

from obliteratus.benchmark import ModelBenchmark

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
    "mistralai/Mistral-7B-Instruct-v0.3",
]

bench = ModelBenchmark(models=models, method="advanced")
report = bench.run()
report.save("./benchmark_report.json")
report.plot_heatmap("./benchmark_heatmap.png")

Troubleshooting

故障排除

Out of memory (OOM) on large models

python

config = PipelineConfig(
    dtype="float16",
    load_in_4bit=True,        # requires bitsandbytes
    device="cuda",
    layers=[10, 11, 12, 13],  # target fewer layers
    num_directions=16,         # fewer directions
)

Capability degradation after obliteration

python

undefined

大模型出现内存不足（OOM）

python

config = PipelineConfig(
    dtype="float16",
    load_in_4bit=True,        # 需要bitsandbytes库
    device="cuda",
    layers=[10, 11, 12, 13],  # 仅针对部分层
    num_directions=16,         # 减少提取的方向数量
)

消融后模型能力下降

python

undefined

Lower the strength or use COSMIC layer selection (most separable layers)

降低强度或使用COSMIC层选择（最易分离的层）

config = PipelineConfig( strength=0.6, layer_selection="cosmic", cot_aware=True, # protect reasoning directions iterative_passes=1, # fewer passes = less aggressive )


**Refusal persists after obliteration**
```python

config = PipelineConfig( strength=0.6, layer_selection="cosmic", cot_aware=True, # 保护推理方向 iterative_passes=1, # 减少迭代次数，降低消融力度 )


**消融后仍存在拒绝响应**
```python

Use informed method + increase passes

使用informed方法并增加迭代次数

config = PipelineConfig( method="informed", iterative_passes=5, project_biases=True, # don't forget bias terms num_directions=64, # extract more directions )


**Gated model access error**
```bash
export HF_TOKEN=your_hf_token_here

config = PipelineConfig( method="informed", iterative_passes=5, project_biases=True, # 不要忘记偏置项 num_directions=64, # 提取更多方向 )


** gated 模型访问错误**
```bash
export HF_TOKEN=your_hf_token_here

Accept model license on HuggingFace Hub first, then:

先在HuggingFace Hub上接受模型许可，然后执行：

huggingface-cli login


**Gradio UI won't start**
```bash
pip install "obliteratus[spaces]"

huggingface-cli login


**Gradio UI无法启动**
```bash
pip install "obliteratus[spaces]"

Check port availability

检查端口可用性

obliteratus ui --port 7861

---

obliteratus ui --port 7861

---

No-Code Options

无代码选项

HuggingFace Space: spaces/pliny-the-prompter/obliteratus — free with HF Pro, ZeroGPU
Colab notebook: notebooks/abliterate.ipynb — run all cells, no setup

HuggingFace Space: spaces/pliny-the-prompter/obliteratus — HF Pro用户免费使用，支持ZeroGPU
Colab笔记本: notebooks/abliterate.ipynb — 运行所有单元格，无需本地配置

Key Research References

核心研究参考文献

Arditi et al. (2024) — arXiv:2406.11717 — foundational abliteration paper
Gabliteration — arXiv:2512.18901
COSMIC layer selection — arXiv:2506.00085, ACL 2025
Turner et al. (2023) — arXiv:2308.10248 — activation steering
Rimsky et al. (2024) — arXiv:2312.06681 — contrastive activation addition

Arditi等人（2024）— arXiv:2406.11717 — 消融法基础论文
Gabliteration — arXiv:2512.18901
COSMIC层选择 — arXiv:2506.00085, ACL 2025
Turner等人（2023）— arXiv:2308.10248 — 激活引导
Rimsky等人（2024）— arXiv:2312.06681 — 对比激活添加