nnsight-remote-interpretability

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

nnsight: Transparent Access to Neural Network Internals

nnsight：透明访问神经网络内部结构

nnsight (/ɛn.saɪt/) enables researchers to interpret and manipulate the internals of any PyTorch model, with the unique capability of running the same code locally on small models or remotely on massive models (70B+) via NDIF.

GitHub: ndif-team/nnsight (730+ stars) Paper: NNsight and NDIF: Democratizing Access to Foundation Model Internals (ICLR 2025)

nnsight（发音：/ɛn.saɪt/）支持研究人员解读和操作任意PyTorch模型的内部结构，其独特优势在于可通过NDIF在本地小模型或远程超大规模模型（700亿参数及以上）上运行相同代码。

GitHub：ndif-team/nnsight（730+星标）论文：NNsight and NDIF: Democratizing Access to Foundation Model Internals（ICLR 2025收录）

Key Value Proposition

核心价值主张

Write once, run anywhere: The same interpretability code works on GPT-2 locally or Llama-3.1-405B remotely. Just toggle

remote=True

python

undefined

一次编写，随处运行：相同的可解释性代码可在本地GPT-2或远程Llama-3.1-405B上运行，只需切换

remote=True

参数即可。

python

undefined

Local execution (small model)

with model.trace("Hello world"): hidden = model.transformer.h[5].output[0].save()

Remote execution (massive model) - same code!

with model.trace("Hello world", remote=True): hidden = model.model.layers[40].output[0].save()

undefined

with model.trace("Hello world", remote=True): hidden = model.model.layers[40].output[0].save()

undefined

When to Use nnsight

何时使用nnsight

Use nnsight when you need to:

Run interpretability experiments on models too large for local GPUs (70B, 405B)
Work with any PyTorch architecture (transformers, Mamba, custom models)
Perform multi-token generation interventions
Share activations between different prompts
Access full model internals without reimplementation

Consider alternatives when:

You want consistent API across models → Use TransformerLens
You need declarative, shareable interventions → Use pyvene
You're training SAEs → Use SAELens
You only work with small models locally → TransformerLens may be simpler

在以下场景中选择nnsight：

需要在本地GPU无法承载的超大规模模型（700亿、4050亿参数）上进行可解释性实验
处理任意PyTorch架构（Transformer、Mamba、自定义模型）
执行多token生成干预
在不同提示词之间共享激活值
无需重新实现即可访问完整模型内部结构

考虑替代工具的场景：

需要跨模型的统一API → 使用TransformerLens
需要声明式、可共享的干预方案 → 使用pyvene
训练SAE（稀疏自动编码器） → 使用SAELens
仅在本地处理小模型 → TransformerLens可能更简单

Installation

安装

bash

undefined

bash

undefined

Basic installation

pip install nnsight

For vLLM support

pip install "nnsight[vllm]"


For remote NDIF execution, sign up at [login.ndif.us](https://login.ndif.us) for an API key.

pip install "nnsight[vllm]"


如需使用NDIF远程执行，请前往[login.ndif.us](https://login.ndif.us)注册获取API密钥。

Core Concepts

核心概念

LanguageModel Wrapper

LanguageModel 包装器

python

from nnsight import LanguageModel

python

from nnsight import LanguageModel

Load model (uses HuggingFace under the hood)

model = LanguageModel("openai-community/gpt2", device_map="auto")

For larger models

model = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")

undefined

model = LanguageModel("meta-llama/Llama-3.1-8B", device_map="auto")

undefined

Tracing Context

追踪上下文

The

trace

context manager enables deferred execution - operations are collected into a computation graph:

python

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The Eiffel Tower is in") as tracer:
    # Access any module's output
    hidden_states = model.transformer.h[5].output[0].save()

    # Access attention patterns
    attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()

    # Modify activations
    model.transformer.h[8].output[0][:] = 0  # Zero out layer 8

    # Get final output
    logits = model.output.save()

trace

上下文管理器支持延迟执行——操作会被收集到计算图中：

python

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The Eiffel Tower is in") as tracer:
    # Access any module's output
    hidden_states = model.transformer.h[5].output[0].save()

    # Access attention patterns
    attn = model.transformer.h[5].attn.attn_dropout.input[0][0].save()

    # Modify activations
    model.transformer.h[8].output[0][:] = 0  # Zero out layer 8

    # Get final output
    logits = model.output.save()

After context exits, access saved values

print(hidden_states.shape) # [batch, seq, hidden]

undefined

print(hidden_states.shape) # [batch, seq, hidden]

undefined

Proxy Objects

代理对象

Inside

trace

, module accesses return Proxy objects that record operations:

python

with model.trace("Hello"):
    # These are all Proxy objects - operations are deferred
    h5_out = model.transformer.h[5].output[0]  # Proxy
    h5_mean = h5_out.mean(dim=-1)              # Proxy
    h5_saved = h5_mean.save()                   # Save for later access

在

trace

上下文内，访问模块会返回Proxy对象，用于记录操作：

python

with model.trace("Hello"):
    # These are all Proxy objects - operations are deferred
    h5_out = model.transformer.h[5].output[0]  # Proxy
    h5_mean = h5_out.mean(dim=-1)              # Proxy
    h5_saved = h5_mean.save()                   # Save for later access

Workflow 1: Activation Analysis

工作流1：激活值分析

Step-by-Step

分步指南

python

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

prompt = "The capital of France is"

with model.trace(prompt) as tracer:
    # 1. Collect activations from multiple layers
    layer_outputs = []
    for i in range(12):  # GPT-2 has 12 layers
        layer_out = model.transformer.h[i].output[0].save()
        layer_outputs.append(layer_out)

    # 2. Get attention patterns
    attn_patterns = []
    for i in range(12):
        # Access attention weights (after softmax)
        attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
        attn_patterns.append(attn)

    # 3. Get final logits
    logits = model.output.save()

python

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

prompt = "The capital of France is"

with model.trace(prompt) as tracer:
    # 1. Collect activations from multiple layers
    layer_outputs = []
    for i in range(12):  # GPT-2 has 12 layers
        layer_out = model.transformer.h[i].output[0].save()
        layer_outputs.append(layer_out)

    # 2. Get attention patterns
    attn_patterns = []
    for i in range(12):
        # Access attention weights (after softmax)
        attn = model.transformer.h[i].attn.attn_dropout.input[0][0].save()
        attn_patterns.append(attn)

    # 3. Get final logits
    logits = model.output.save()

4. Analyze outside context

for i, layer_out in enumerate(layer_outputs): print(f"Layer {i} output shape: {layer_out.shape}") print(f"Layer {i} norm: {layer_out.norm().item():.3f}")

5. Find top predictions

probs = torch.softmax(logits[0, -1], dim=-1) top_tokens = probs.topk(5) for token, prob in zip(top_tokens.indices, top_tokens.values): print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")

undefined

probs = torch.softmax(logits[0, -1], dim=-1) top_tokens = probs.topk(5) for token, prob in zip(top_tokens.indices, top_tokens.values): print(f"{model.tokenizer.decode(token)}: {prob.item():.3f}")

undefined

Checklist

检查清单

Load model with LanguageModel wrapper
Use trace context for operations
Call
```
.save()
```
on values you need after context
Access saved values outside context
Use
```
.shape
```
,
```
.norm()
```
, etc. for analysis

使用LanguageModel包装器加载模型
使用trace上下文执行操作
对需要在上下文外访问的值调用
```
.save()
```
在上下文外访问已保存的值
使用
```
.shape
```
、
```
.norm()
```
等方法进行分析

Workflow 2: Activation Patching

工作流2：激活值修补

Step-by-Step

分步指南

python

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"

python

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

clean_prompt = "The Eiffel Tower is in"
corrupted_prompt = "The Colosseum is in"

1. Get clean activations

with model.trace(clean_prompt) as tracer: clean_hidden = model.transformer.h[8].output[0].save()

2. Patch clean into corrupted run

with model.trace(corrupted_prompt) as tracer: # Replace layer 8 output with clean activations model.transformer.h[8].output[0][:] = clean_hidden

patched_logits = model.output.save()

with model.trace(corrupted_prompt) as tracer: # Replace layer 8 output with clean activations model.transformer.h[8].output[0][:] = clean_hidden

patched_logits = model.output.save()

3. Compare predictions

paris_token = model.tokenizer.encode(" Paris")[0] rome_token = model.tokenizer.encode(" Rome")[0]

patched_probs = torch.softmax(patched_logits[0, -1], dim=-1) print(f"Paris prob: {patched_probs[paris_token].item():.3f}") print(f"Rome prob: {patched_probs[rome_token].item():.3f}")

undefined

paris_token = model.tokenizer.encode(" Paris")[0] rome_token = model.tokenizer.encode(" Rome")[0]

patched_probs = torch.softmax(patched_logits[0, -1], dim=-1) print(f"Paris prob: {patched_probs[paris_token].item():.3f}") print(f"Rome prob: {patched_probs[rome_token].item():.3f}")

undefined

Systematic Patching Sweep

系统性修补扫描

python

def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
    """Patch single layer/position from clean to corrupted."""
    with model.trace(corrupted_prompt) as tracer:
        # Get current activation
        current = model.transformer.h[layer].output[0]

        # Patch only specific position
        current[:, position, :] = clean_cache[layer][:, position, :]

        logits = model.output.save()

    return logits

python

def patch_layer_position(layer, position, clean_cache, corrupted_prompt):
    """Patch single layer/position from clean to corrupted."""
    with model.trace(corrupted_prompt) as tracer:
        # Get current activation
        current = model.transformer.h[layer].output[0]

        # Patch only specific position
        current[:, position, :] = clean_cache[layer][:, position, :]

        logits = model.output.save()

    return logits

Sweep over all layers and positions

results = torch.zeros(12, seq_len) for layer in range(12): for pos in range(seq_len): logits = patch_layer_position(layer, pos, clean_hidden, corrupted) results[layer, pos] = compute_metric(logits)

undefined

results = torch.zeros(12, seq_len) for layer in range(12): for pos in range(seq_len): logits = patch_layer_position(layer, pos, clean_hidden, corrupted) results[layer, pos] = compute_metric(logits)

undefined

Workflow 3: Remote Execution with NDIF

工作流3：使用NDIF进行远程执行

Run the same experiments on massive models without local GPUs.

无需本地GPU即可在超大规模模型上运行相同实验。

Step-by-Step

分步指南

python

from nnsight import LanguageModel

python

from nnsight import LanguageModel

1. Load large model (will run remotely)

model = LanguageModel("meta-llama/Llama-3.1-70B")

2. Same code, just add remote=True

with model.trace("The meaning of life is", remote=True) as tracer: # Access internals of 70B model! layer_40_out = model.model.layers[40].output[0].save() logits = model.output.save()

3. Results returned from NDIF

print(f"Layer 40 shape: {layer_40_out.shape}")

4. Generation with interventions

with model.trace(remote=True) as tracer: with tracer.invoke("What is 2+2?"): # Intervene during generation model.model.layers[20].output[0][:, -1, :] *= 1.5

output = model.generate(max_new_tokens=50)

undefined

with model.trace(remote=True) as tracer: with tracer.invoke("What is 2+2?"): # Intervene during generation model.model.layers[20].output[0][:, -1, :] *= 1.5

output = model.generate(max_new_tokens=50)

undefined

NDIF Setup

NDIF 设置

Sign up at login.ndif.us
Get API key
Set environment variable or pass to nnsight:

python

import os
os.environ["NDIF_API_KEY"] = "your_key"

前往login.ndif.us注册
获取API密钥
设置环境变量或直接传入nnsight：

python

import os
os.environ["NDIF_API_KEY"] = "your_key"

Or configure directly

from nnsight import CONFIG CONFIG.API_KEY = "your_key"

undefined

from nnsight import CONFIG CONFIG.API_KEY = "your_key"

undefined

Available Models on NDIF

NDIF 可用模型

Llama-3.1-8B, 70B, 405B
DeepSeek-R1 models
Various open-weight models (check ndif.us for current list)

Llama-3.1-8B, 70B, 405B
DeepSeek-R1系列模型
各类开源权重模型（当前列表请查看ndif.us）

Workflow 4: Cross-Prompt Activation Sharing

工作流4：跨提示词激活值共享

Share activations between different inputs in a single trace.

python

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace() as tracer:
    # First prompt
    with tracer.invoke("The cat sat on the"):
        cat_hidden = model.transformer.h[6].output[0].save()

    # Second prompt - inject cat's activations
    with tracer.invoke("The dog ran through the"):
        # Replace with cat's activations at layer 6
        model.transformer.h[6].output[0][:] = cat_hidden
        dog_with_cat = model.output.save()

在单次追踪中在不同输入之间共享激活值。

python

from nnsight import LanguageModel

model = LanguageModel("gpt2", device_map="auto")

with model.trace() as tracer:
    # First prompt
    with tracer.invoke("The cat sat on the"):
        cat_hidden = model.transformer.h[6].output[0].save()

    # Second prompt - inject cat's activations
    with tracer.invoke("The dog ran through the"):
        # Replace with cat's activations at layer 6
        model.transformer.h[6].output[0][:] = cat_hidden
        dog_with_cat = model.output.save()

The dog prompt now has cat's internal representations

undefined

undefined

Workflow 5: Gradient-Based Analysis

工作流5：基于梯度的分析

Access gradients during backward pass.

python

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The quick brown fox") as tracer:
    # Save activations and enable gradient
    hidden = model.transformer.h[5].output[0].save()
    hidden.retain_grad()

    logits = model.output

    # Compute loss on specific token
    target_token = model.tokenizer.encode(" jumps")[0]
    loss = -logits[0, -1, target_token]

    # Backward pass
    loss.backward()

在反向传播过程中访问梯度。

python

from nnsight import LanguageModel
import torch

model = LanguageModel("gpt2", device_map="auto")

with model.trace("The quick brown fox") as tracer:
    # Save activations and enable gradient
    hidden = model.transformer.h[5].output[0].save()
    hidden.retain_grad()

    logits = model.output

    # Compute loss on specific token
    target_token = model.tokenizer.encode(" jumps")[0]
    loss = -logits[0, -1, target_token]

    # Backward pass
    loss.backward()

Access gradients

grad = hidden.grad print(f"Gradient shape: {grad.shape}") print(f"Gradient norm: {grad.norm().item():.3f}")


**Note**: Gradient access not supported for vLLM or remote execution.

grad = hidden.grad print(f"Gradient shape: {grad.shape}") print(f"Gradient norm: {grad.norm().item():.3f}")


**注意**：vLLM或远程执行不支持梯度访问。

Common Issues & Solutions

常见问题与解决方案

Issue: Module path differs between models

问题：不同模型的模块路径不一致

python

undefined

python

undefined

GPT-2 structure

model.transformer.h[5].output[0]

LLaMA structure

model.model.layers[5].output[0]

Solution: Check model structure

print(model._model) # See actual module names

undefined

print(model._model) # See actual module names

undefined

Issue: Forgetting to save

问题：忘记保存值

python

undefined

python

undefined

WRONG: Value not accessible outside trace

with model.trace("Hello"): hidden = model.transformer.h[5].output[0] # Not saved!

print(hidden) # Error or wrong value

with model.trace("Hello"): hidden = model.transformer.h[5].output[0] # Not saved!

print(hidden) # Error or wrong value

RIGHT: Call .save()

with model.trace("Hello"): hidden = model.transformer.h[5].output[0].save()

print(hidden) # Works!

undefined

with model.trace("Hello"): hidden = model.transformer.h[5].output[0].save()

print(hidden) # Works!

undefined

Issue: Remote timeout

问题：远程执行超时

python

undefined

python

undefined

For long operations, increase timeout

with model.trace("prompt", remote=True, timeout=300) as tracer: # Long operation...

undefined

with model.trace("prompt", remote=True, timeout=300) as tracer: # Long operation...

undefined

Issue: Memory with many saved activations

问题：保存过多激活值导致内存不足

python

undefined

python

undefined

Only save what you need

with model.trace("prompt"): # Don't save everything for i in range(100): model.transformer.h[i].output[0].save() # Memory heavy!

# Better: save specific layers
key_layers = [0, 5, 11]
for i in key_layers:
    model.transformer.h[i].output[0].save()

undefined

with model.trace("prompt"): # Don't save everything for i in range(100): model.transformer.h[i].output[0].save() # Memory heavy!

# Better: save specific layers
key_layers = [0, 5, 11]
for i in key_layers:
    model.transformer.h[i].output[0].save()

undefined

Issue: vLLM gradient limitation

问题：vLLM梯度限制

python

undefined

python

undefined

vLLM doesn't support gradients

Use standard execution for gradient analysis

model = LanguageModel("gpt2", device_map="auto") # Not vLLM

undefined

model = LanguageModel("gpt2", device_map="auto") # Not vLLM

undefined

Key API Reference

核心API参考

Method/Property	Purpose
`model.trace(prompt, remote=False)`	Start tracing context
`proxy.save()`	Save value for access after trace
`proxy[:]`	Slice/index proxy (assignment patches)
`tracer.invoke(prompt)`	Add prompt within trace
`model.generate(...)`	Generate with interventions
`model.output`	Final model output logits
`model._model`	Underlying HuggingFace model

方法/属性	用途
`model.trace(prompt, remote=False)`	启动追踪上下文
`proxy.save()`	保存值以便在追踪结束后访问
`proxy[:]`	切片/索引代理对象（赋值即修补）
`tracer.invoke(prompt)`	在追踪内添加提示词
`model.generate(...)`	带干预的文本生成
`model.output`	模型最终输出logits
`model._model`	底层HuggingFace模型

Comparison with Other Tools

与其他工具的对比

Feature	nnsight	TransformerLens	pyvene
Any architecture	Yes	Transformers only	Yes
Remote execution	Yes (NDIF)	No	No
Consistent API	No	Yes	Yes
Deferred execution	Yes	No	No
HuggingFace native	Yes	Reimplemented	Yes
Shareable configs	No	No	Yes

特性	nnsight	TransformerLens	pyvene
支持任意架构	是	仅Transformer	是
远程执行	是（NDIF）	否	否
统一API	否	是	是
延迟执行	是	否	否
原生支持HuggingFace	是	重新实现	是
可共享配置	否	否	是

Reference Documentation

参考文档

For detailed API documentation, tutorials, and advanced usage, see the

references/

folder:

File	Contents
references/README.md	Overview and quick start guide
references/api.md	Complete API reference for LanguageModel, tracing, proxy objects
references/tutorials.md	Step-by-step tutorials for local and remote interpretability

如需详细API文档、教程和进阶用法，请查看

references/

文件	内容
references/README.md	概览与快速入门指南
references/api.md	LanguageModel、追踪、代理对象的完整API参考
references/tutorials.md	本地和远程可解释性实验的分步教程

External Resources

外部资源

Tutorials

教程

Official Documentation

官方文档

Papers

论文

NNsight and NDIF Paper - Fiotto-Kaufman et al. (ICLR 2025)

NNsight and NDIF 论文 - Fiotto-Kaufman等人（ICLR 2025）

Architecture Support

架构支持

nnsight works with any PyTorch model:

Transformers: GPT-2, LLaMA, Mistral, etc.
State Space Models: Mamba
Vision Models: ViT, CLIP
Custom architectures: Any nn.Module

The key is knowing the module structure to access the right components.

nnsight可兼容任意PyTorch模型：

Transformer：GPT-2、LLaMA、Mistral等
状态空间模型：Mamba
视觉模型：ViT、CLIP
自定义架构：任意nn.Module

关键在于了解模块结构以访问正确的组件。