nnsight-remote-interpretability

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

nnsight: Transparent Access to Neural Network Internals

nnsight:透明访问神经网络内部结构

nnsight (/ɛn.saɪt/) enables researchers to interpret and manipulate the internals of any PyTorch model, with the unique capability of running the same code locally on small models or remotely on massive models (70B+) via NDIF.
nnsight(发音:/ɛn.saɪt/)支持研究人员解读和操作任意PyTorch模型的内部结构,其独特优势在于可通过NDIF在本地小模型或远程超大规模模型(700亿参数及以上)上运行相同代码。

Key Value Proposition

核心价值主张

Write once, run anywhere: The same interpretability code works on GPT-2 locally or Llama-3.1-405B remotely. Just toggle
remote=True
.
python
undefined
一次编写,随处运行:相同的可解释性代码可在本地GPT-2或远程Llama-3.1-405B上运行,只需切换
remote=True
参数即可。
python
undefined

Local execution (small model)

Local execution (small model)

with model.trace("Hello world"): hidden = model.transformer.h[5].output[0].save()
with model.trace("Hello world"): hidden = model.transformer.h[5].output[0].save()

Remote execution (massive model) - same code!

Remote execution (massive model) - same code!

with model.trace("Hello world", remote=True): hidden = model.model.layers[40].output[0].save()
undefined
with model.trace("Hello world", remote=True): hidden = model.model.layers[40].output[0].save()
undefined

When to Use nnsight

何时使用nnsight

Use nnsight when you need to:
  • Run interpretability experiments on models too large for local GPUs (70B, 405B)
  • Work with any PyTorch architecture (transformers, Mamba, custom models)
  • Perform multi-token generation interventions
  • Share activations between different prompts
  • Access full model internals without reimplementation
Consider alternatives when:
  • You want consistent API across models → Use TransformerLens
  • You need declarative, shareable interventions → Use pyvene
  • You're training SAEs → Use SAELens
  • You only work with small models locally → TransformerLens may be simpler
在以下场景中选择nnsight
  • 需要在本地GPU无法承载的超大规模模型(700亿、4050亿参数)上进行可解释性实验
  • 处理任意PyTorch架构(Transformer、Mamba、自定义模型)
  • 执行多token生成干预
  • 在不同提示词之间共享激活值
  • 无需重新实现即可访问完整模型内部结构
考虑替代工具的场景
  • 需要跨模型的统一API → 使用TransformerLens
  • 需要声明式、可共享的干预方案 → 使用pyvene
  • 训练SAE(稀疏自动编码器) → 使用SAELens
  • 仅在本地处理小模型 → TransformerLens可能更简单

Installation

安装

bash
undefined
bash
undefined

Basic installation

Basic installation

pip install nnsight
pip install nnsight

For vLLM support

For vLLM support

pip install "nnsight[vllm]"

For remote NDIF execution, sign up at [login.ndif.us](https://login.ndif.us) for an API key.
pip install "nnsight[vllm]"

如需使用NDIF远程执行,请前往[login.ndif.us](https://login.ndif.us)注册获取API密钥。

Core Concepts

核心概念

LanguageModel Wrapper

LanguageModel 包装器

python
from nnsight import LanguageModel
python
from nnsight import LanguageModel

Load model (uses HuggingFace under the hood)

Load model (uses HuggingFace under the hood)

model = LanguageModel("openai-community/gpt2", device_map="auto")
model = LanguageModel("openai-community/gpt2", device_map="auto")

For larger models

For larger models

model = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")
undefined
model = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")
undefined

Tracing Context

追踪上下文

The
trace
context manager enables deferred execution - operations are collected into a computation graph:
python
from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The Eiffel Tower is in") as tracer:
    # Access any module's output
    hidden_states = model.transformer.h[5].output[0].save()

    # Access attention patterns
    attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()

    # Modify activations
    model.transformer.h[8].output[0][:] = 0  # Zero out layer 8

    # Get final output
    logits = model.output.save()
trace
上下文管理器支持延迟执行——操作会被收集到计算图中:
python
from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The Eiffel Tower is in") as tracer:
    # Access any module's output
    hidden_states = model.transformer.h[5].output[0].save()

    # Access attention patterns
    attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()

    # Modify activations
    model.transformer.h[8].output[0][:] = 0  # Zero out layer 8

    # Get final output
    logits = model.output.save()

After context exits, access saved values

After context exits, access saved values

print(hidden_states.shape) # [batch, seq, hidden]
undefined
print(hidden_states.shape) # [batch, seq, hidden]
undefined

Proxy Objects

代理对象

Inside
trace
, module accesses return Proxy objects that record operations:
python
with model.trace("Hello"):
    # These are all Proxy objects - operations are deferred
    h5_out = model.transformer.h[5].output[0]  # Proxy
    h5_mean = h5_out.mean(dim=-1)              # Proxy
    h5_saved = h5_mean.save()                   # Save for later access
trace
上下文内,访问模块会返回Proxy对象,用于记录操作:
python
with model.trace("Hello"):
    # These are all Proxy objects - operations are deferred
    h5_out = model.transformer.h[5].output[0]  # Proxy
    h5_mean = h5_out.mean(dim=-1)              # Proxy
    h5_saved = h5_mean.save()                   # Save for later access

Workflow 1: Activation Analysis

工作流1:激活值分析

Step-by-Step

分步指南

python
from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

prompt = "The capital of France is"

with model.trace(prompt) as tracer:
    # 1. Collect activations from multiple layers
    layer_outputs = []
    for i in range(12):  # GPT-2 has 12 layers
        layer_out = model.transformer.h[i].output[0].save()
        layer_outputs.append(layer_out)

    # 2. Get attention patterns
    attn_patterns = []
    for i in range(12):
        # Access attention weights (after softmax)
        attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
        attn_patterns.append(attn)

    # 3. Get final logits
    logits = model.output.save()
python
from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

prompt = "The capital of France is"

with model.trace(prompt) as tracer:
    # 1. Collect activations from multiple layers
    layer_outputs = []
    for i in range(12):  # GPT-2 has 12 layers
        layer_out = model.transformer.h[i].output[0].save()
        layer_outputs.append(layer_out)

    # 2. Get attention patterns
    attn_patterns = []
    for i in range(12):
        # Access attention weights (after softmax)
        attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
        attn_patterns.append(attn)

    # 3. Get final logits
    logits = model.output.save()

4. Analyze outside context

4. Analyze outside context

for i, layer_out in enumerate(layer_outputs): print(f"Layer {i} output shape: {layer_out.shape}") print(f"Layer {i} norm: {layer_out.norm().item():.3f}")
for i, layer_out in enumerate(layer_outputs): print(f"Layer {i} output shape: {layer_out.shape}") print(f"Layer {i} norm: {layer_out.norm().item():.3f}")

5. Find top predictions

5. Find top predictions

probs = torch.softmax(logits[0, -1], dim=-1) top_tokens = probs.topk(5) for token, prob in zip(top_tokens.indices, top_tokens.values): print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")
undefined
probs = torch.softmax(logits[0, -1], dim=-1) top_tokens = probs.topk(5) for token, prob in zip(top_tokens.indices, top_tokens.values): print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")
undefined

Checklist

检查清单

  • Load model with LanguageModel wrapper
  • Use trace context for operations
  • Call
    .save()
    on values you need after context
  • Access saved values outside context
  • Use
    .shape
    ,
    .norm()
    , etc. for analysis
  • 使用LanguageModel包装器加载模型
  • 使用trace上下文执行操作
  • 对需要在上下文外访问的值调用
    .save()
  • 在上下文外访问已保存的值
  • 使用
    .shape
    .norm()
    等方法进行分析

Workflow 2: Activation Patching

工作流2:激活值修补

Step-by-Step

分步指南

python
from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"
python
from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"

1. Get clean activations

1. Get clean activations

with model.trace(clean_prompt) as tracer: clean_hidden = model.transformer.h[8].output[0].save()
with model.trace(clean_prompt) as tracer: clean_hidden = model.transformer.h[8].output[0].save()

2. Patch clean into corrupted run

2. Patch clean into corrupted run

with model.trace(corrupted_prompt) as tracer: # Replace layer 8 output with clean activations model.transformer.h[8].output[0][:] = clean_hidden
patched_logits = model.output.save()
with model.trace(corrupted_prompt) as tracer: # Replace layer 8 output with clean activations model.transformer.h[8].output[0][:] = clean_hidden
patched_logits = model.output.save()

3. Compare predictions

3. Compare predictions

paris_token = model.tokenizer.encode(" Paris")[0] rome_token = model.tokenizer.encode(" Rome")[0]
patched_probs = torch.softmax(patched_logits[0, -1], dim=-1) print(f"Paris prob: {patched_probs[paris_token].item():.3f}") print(f"Rome prob: {patched_probs[rome_token].item():.3f}")
undefined
paris_token = model.tokenizer.encode(" Paris")[0] rome_token = model.tokenizer.encode(" Rome")[0]
patched_probs = torch.softmax(patched_logits[0, -1], dim=-1) print(f"Paris prob: {patched_probs[paris_token].item():.3f}") print(f"Rome prob: {patched_probs[rome_token].item():.3f}")
undefined

Systematic Patching Sweep

系统性修补扫描

python
def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
    """Patch single layer/position from clean to corrupted."""
    with model.trace(corrupted_prompt) as tracer:
        # Get current activation
        current = model.transformer.h[layer].output[0]

        # Patch only specific position
        current[:, position, :] = clean_cache[layer][:, position, :]

        logits = model.output.save()

    return logits
python
def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
    """Patch single layer/position from clean to corrupted."""
    with model.trace(corrupted_prompt) as tracer:
        # Get current activation
        current = model.transformer.h[layer].output[0]

        # Patch only specific position
        current[:, position, :] = clean_cache[layer][:, position, :]

        logits = model.output.save()

    return logits

Sweep over all layers and positions

Sweep over all layers and positions

results = torch.zeros(12, seq_len) for layer in range(12): for pos in range(seq_len): logits = patch_layer_position(layer, pos, clean_hidden, corrupted) results[layer, pos] = compute_metric(logits)
undefined
results = torch.zeros(12, seq_len) for layer in range(12): for pos in range(seq_len): logits = patch_layer_position(layer, pos, clean_hidden, corrupted) results[layer, pos] = compute_metric(logits)
undefined

Workflow 3: Remote Execution with NDIF

工作流3:使用NDIF进行远程执行

Run the same experiments on massive models without local GPUs.
无需本地GPU即可在超大规模模型上运行相同实验。

Step-by-Step

分步指南

python
from nnsight import LanguageModel
python
from nnsight import LanguageModel

1. Load large model (will run remotely)

1. Load large model (will run remotely)

model = LanguageModel("meta-llama/Llama-3.1-70B")
model = LanguageModel("meta-llama/Llama-3.1-70B")

2. Same code, just add remote=True

2. Same code, just add remote=True

with model.trace("The meaning of life is", remote=True) as tracer: # Access internals of 70B model! layer_40_out = model.model.layers[40].output[0].save() logits = model.output.save()
with model.trace("The meaning of life is", remote=True) as tracer: # Access internals of 70B model! layer_40_out = model.model.layers[40].output[0].save() logits = model.output.save()

3. Results returned from NDIF

3. Results returned from NDIF

print(f"Layer 40 shape: {layer_40_out.shape}")
print(f"Layer 40 shape: {layer_40_out.shape}")

4. Generation with interventions

4. Generation with interventions

with model.trace(remote=True) as tracer: with tracer.invoke("What is 2+2?"): # Intervene during generation model.model.layers[20].output[0][:, -1, :] *= 1.5
output = model.generate(max_new_tokens=50)
undefined
with model.trace(remote=True) as tracer: with tracer.invoke("What is 2+2?"): # Intervene during generation model.model.layers[20].output[0][:, -1, :] *= 1.5
output = model.generate(max_new_tokens=50)
undefined

NDIF Setup

NDIF 设置

  1. Sign up at login.ndif.us
  2. Get API key
  3. Set environment variable or pass to nnsight:
python
import os
os.environ["NDIF_API_KEY"] = "your_key"
  1. 前往login.ndif.us注册
  2. 获取API密钥
  3. 设置环境变量或直接传入nnsight:
python
import os
os.environ["NDIF_API_KEY"] = "your_key"

Or configure directly

Or configure directly

from nnsight import CONFIG CONFIG.API_KEY = "your_key"
undefined
from nnsight import CONFIG CONFIG.API_KEY = "your_key"
undefined

Available Models on NDIF

NDIF 可用模型

  • Llama-3.1-8B, 70B, 405B
  • DeepSeek-R1 models
  • Various open-weight models (check ndif.us for current list)
  • Llama-3.1-8B, 70B, 405B
  • DeepSeek-R1系列模型
  • 各类开源权重模型(当前列表请查看ndif.us

Workflow 4: Cross-Prompt Activation Sharing

工作流4:跨提示词激活值共享

Share activations between different inputs in a single trace.
python
from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace() as tracer:
    # First prompt
    with tracer.invoke("The cat sat on the"):
        cat_hidden = model.transformer.h[6].output[0].save()

    # Second prompt - inject cat's activations
    with tracer.invoke("The dog ran through the"):
        # Replace with cat's activations at layer 6
        model.transformer.h[6].output[0][:] = cat_hidden
        dog_with_cat = model.output.save()
在单次追踪中在不同输入之间共享激活值。
python
from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace() as tracer:
    # First prompt
    with tracer.invoke("The cat sat on the"):
        cat_hidden = model.transformer.h[6].output[0].save()

    # Second prompt - inject cat's activations
    with tracer.invoke("The dog ran through the"):
        # Replace with cat's activations at layer 6
        model.transformer.h[6].output[0][:] = cat_hidden
        dog_with_cat = model.output.save()

The dog prompt now has cat's internal representations

The dog prompt now has cat's internal representations

undefined
undefined

Workflow 5: Gradient-Based Analysis

工作流5:基于梯度的分析

Access gradients during backward pass.
python
from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The quick brown fox") as tracer:
    # Save activations and enable gradient
    hidden = model.transformer.h[5].output[0].save()
    hidden.retain_grad()

    logits = model.output

    # Compute loss on specific token
    target_token = model.tokenizer.encode(" jumps")[0]
    loss = -logits[0, -1, target_token]

    # Backward pass
    loss.backward()
在反向传播过程中访问梯度。
python
from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The quick brown fox") as tracer:
    # Save activations and enable gradient
    hidden = model.transformer.h[5].output[0].save()
    hidden.retain_grad()

    logits = model.output

    # Compute loss on specific token
    target_token = model.tokenizer.encode(" jumps")[0]
    loss = -logits[0, -1, target_token]

    # Backward pass
    loss.backward()

Access gradients

Access gradients

grad = hidden.grad print(f"Gradient shape: {grad.shape}") print(f"Gradient norm: {grad.norm().item():.3f}")

**Note**: Gradient access not supported for vLLM or remote execution.
grad = hidden.grad print(f"Gradient shape: {grad.shape}") print(f"Gradient norm: {grad.norm().item():.3f}")

**注意**:vLLM或远程执行不支持梯度访问。

Common Issues & Solutions

常见问题与解决方案

Issue: Module path differs between models

问题:不同模型的模块路径不一致

python
undefined
python
undefined

GPT-2 structure

GPT-2 structure

model.transformer.h[5].output[0]
model.transformer.h[5].output[0]

LLaMA structure

LLaMA structure

model.model.layers[5].output[0]
model.model.layers[5].output[0]

Solution: Check model structure

Solution: Check model structure

print(model._model) # See actual module names
undefined
print(model._model) # See actual module names
undefined

Issue: Forgetting to save

问题:忘记保存值

python
undefined
python
undefined

WRONG: Value not accessible outside trace

WRONG: Value not accessible outside trace

with model.trace("Hello"): hidden = model.transformer.h[5].output[0] # Not saved!
print(hidden) # Error or wrong value
with model.trace("Hello"): hidden = model.transformer.h[5].output[0] # Not saved!
print(hidden) # Error or wrong value

RIGHT: Call .save()

RIGHT: Call .save()

with model.trace("Hello"): hidden = model.transformer.h[5].output[0].save()
print(hidden) # Works!
undefined
with model.trace("Hello"): hidden = model.transformer.h[5].output[0].save()
print(hidden) # Works!
undefined

Issue: Remote timeout

问题:远程执行超时

python
undefined
python
undefined

For long operations, increase timeout

For long operations, increase timeout

with model.trace("prompt", remote=True, timeout=300) as tracer: # Long operation...
undefined
with model.trace("prompt", remote=True, timeout=300) as tracer: # Long operation...
undefined

Issue: Memory with many saved activations

问题:保存过多激活值导致内存不足

python
undefined
python
undefined

Only save what you need

Only save what you need

with model.trace("prompt"): # Don't save everything for i in range(100): model.transformer.h[i].output[0].save() # Memory heavy!
# Better: save specific layers
key_layers = [0, 5, 11]
for i in key_layers:
    model.transformer.h[i].output[0].save()
undefined
with model.trace("prompt"): # Don't save everything for i in range(100): model.transformer.h[i].output[0].save() # Memory heavy!
# Better: save specific layers
key_layers = [0, 5, 11]
for i in key_layers:
    model.transformer.h[i].output[0].save()
undefined

Issue: vLLM gradient limitation

问题:vLLM梯度限制

python
undefined
python
undefined

vLLM doesn't support gradients

vLLM doesn't support gradients

Use standard execution for gradient analysis

Use standard execution for gradient analysis

model = LanguageModel("gpt2", device_map="auto") # Not vLLM
undefined
model = LanguageModel("gpt2", device_map="auto") # Not vLLM
undefined

Key API Reference

核心API参考

Method/PropertyPurpose
model.trace(prompt, remote=False)
Start tracing context
proxy.save()
Save value for access after trace
proxy[:]
Slice/index proxy (assignment patches)
tracer.invoke(prompt)
Add prompt within trace
model.generate(...)
Generate with interventions
model.output
Final model output logits
model._model
Underlying HuggingFace model
方法/属性用途
model.trace(prompt, remote=False)
启动追踪上下文
proxy.save()
保存值以便在追踪结束后访问
proxy[:]
切片/索引代理对象(赋值即修补)
tracer.invoke(prompt)
在追踪内添加提示词
model.generate(...)
带干预的文本生成
model.output
模型最终输出logits
model._model
底层HuggingFace模型

Comparison with Other Tools

与其他工具的对比

FeaturennsightTransformerLenspyvene
Any architectureYesTransformers onlyYes
Remote executionYes (NDIF)NoNo
Consistent APINoYesYes
Deferred executionYesNoNo
HuggingFace nativeYesReimplementedYes
Shareable configsNoNoYes
特性nnsightTransformerLenspyvene
支持任意架构仅Transformer
远程执行是(NDIF)
统一API
延迟执行
原生支持HuggingFace重新实现
可共享配置

Reference Documentation

参考文档

For detailed API documentation, tutorials, and advanced usage, see the
references/
folder:
FileContents
references/README.mdOverview and quick start guide
references/api.mdComplete API reference for LanguageModel, tracing, proxy objects
references/tutorials.mdStep-by-step tutorials for local and remote interpretability
如需详细API文档、教程和进阶用法,请查看
references/
目录:
文件内容
references/README.md概览与快速入门指南
references/api.mdLanguageModel、追踪、代理对象的完整API参考
references/tutorials.md本地和远程可解释性实验的分步教程

External Resources

外部资源

Tutorials

教程

Official Documentation

官方文档

Papers

论文

Architecture Support

架构支持

nnsight works with any PyTorch model:
  • Transformers: GPT-2, LLaMA, Mistral, etc.
  • State Space Models: Mamba
  • Vision Models: ViT, CLIP
  • Custom architectures: Any nn.Module
The key is knowing the module structure to access the right components.
nnsight可兼容任意PyTorch模型:
  • Transformer:GPT-2、LLaMA、Mistral等
  • 状态空间模型:Mamba
  • 视觉模型:ViT、CLIP
  • 自定义架构:任意nn.Module
关键在于了解模块结构以访问正确的组件。