model-merging

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Model Merging: Combining Pre-trained Models

模型合并:整合预训练模型

When to Use This Skill

何时使用该技术

Use Model Merging when you need to:
  • Combine capabilities from multiple fine-tuned models without retraining
  • Create specialized models by blending domain-specific expertise (math + coding + chat)
  • Improve performance beyond single models (often +5-10% on benchmarks)
  • Reduce training costs - no GPUs needed, merges run on CPU
  • Experiment rapidly - create new model variants in minutes, not days
  • Preserve multiple skills - merge without catastrophic forgetting
Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging
Tools: mergekit (Arcee AI), LazyMergekit, Model Soup
在以下场景中使用模型合并:
  • 整合多模型能力:无需重新训练,合并多个微调模型的能力
  • 创建专用模型:融合领域专业能力(数学+代码+对话)
  • 提升模型性能:超越单个模型的性能(通常在基准测试中提升5-10%)
  • 降低训练成本:无需GPU,可在CPU上完成合并
  • 快速试验:几分钟内即可创建新的模型变体,无需耗时数天
  • 保留多技能:合并时不会出现灾难性遗忘
成功案例:Marcoro14-7B-slerp(2024年2月Open LLM排行榜榜首),众多顶级HuggingFace模型均采用了合并技术
工具:mergekit(Arcee AI)、LazyMergekit、Model Soup

Installation

安装

bash
undefined
bash
undefined

Install mergekit

安装mergekit

git clone https://github.com/arcee-ai/mergekit.git cd mergekit pip install -e .
git clone https://github.com/arcee-ai/mergekit.git cd mergekit pip install -e .

Or via pip

或通过pip直接安装

pip install mergekit
pip install mergekit

Optional: Transformer library

可选:安装Transformer库

pip install transformers torch
undefined
pip install transformers torch
undefined

Quick Start

快速开始

Simple Linear Merge

简单线性合并

yaml
undefined
yaml
undefined

config.yml - Merge two models with equal weights

config.yml - 以相等权重合并两个模型

merge_method: linear models:
  • model: mistralai/Mistral-7B-v0.1 parameters: weight: 0.5
  • model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.5 dtype: bfloat16

```bash
merge_method: linear models:
  • model: mistralai/Mistral-7B-v0.1 parameters: weight: 0.5
  • model: teknium/OpenHermes-2.5-Mistral-7B parameters: weight: 0.5 dtype: bfloat16

```bash

Run merge

执行合并

mergekit-yaml config.yml ./merged-model --cuda
mergekit-yaml config.yml ./merged-model --cuda

Use merged model

使用合并后的模型

python -m transformers.models.auto --model_name_or_path ./merged-model
undefined
python -m transformers.models.auto --model_name_or_path ./merged-model
undefined

SLERP Merge (Best for 2 Models)

SLERP合并(适用于2个模型的最佳方案)

yaml
undefined
yaml
undefined

config.yml - Spherical interpolation

config.yml - 球面插值

merge_method: slerp slices:
  • sources:
    • model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32]
    • model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # Interpolation factor (0=model1, 1=model2) dtype: bfloat16
undefined
merge_method: slerp slices:
  • sources:
    • model: mistralai/Mistral-7B-v0.1 layer_range: [0, 32]
    • model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [0, 32] parameters: t: 0.5 # 插值因子(0=第一个模型,1=第二个模型) dtype: bfloat16
undefined

Core Concepts

核心概念

1. Merge Methods

1. 合并方法

Linear (Model Soup)
  • Simple weighted average of parameters
  • Fast, works well for similar models
  • Can merge 2+ models
python
merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights
线性合并(Model Soup)
  • 参数的简单加权平均
  • 速度快,适用于相似模型
  • 可合并2个及以上模型
python
merged_weights = w1 * model1_weights + w2 * model2_weights + w3 * model3_weights

where w1 + w2 + w3 = 1

其中 w1 + w2 + w3 = 1


**SLERP (Spherical Linear Interpolation)**
- Interpolates along sphere in weight space
- Preserves magnitude of weight vectors
- Best for merging 2 models
- Smoother than linear

```python

**SLERP(球面线性插值)**
- 在权重空间的球面上进行插值
- 保留权重向量的幅度
- 最适合合并2个模型
- 比线性合并更平滑

```python

SLERP formula

SLERP公式

merged = (sin((1-t)θ) / sin(θ)) * model1 + (sin(tθ) / sin(θ)) * model2
merged = (sin((1-t)θ) / sin(θ)) * model1 + (sin(tθ) / sin(θ)) * model2

where θ = arccos(dot(model1, model2))

其中 θ = arccos(dot(model1, model2))

t ∈ [0, 1]

t ∈ [0, 1]


**Task Arithmetic**
- Extract "task vectors" (fine-tuned - base)
- Combine task vectors, add to base
- Good for merging multiple specialized models

```python

**任务算术**
- 提取“任务向量”(微调模型 - 基础模型)
- 合并任务向量后添加到基础模型
- 适合合并多个专用模型

```python

Task vector

任务向量

task_vector = finetuned_model - base_model
task_vector = finetuned_model - base_model

Merge multiple task vectors

合并多个任务向量

merged = base_model + α₁task_vector₁ + α₂task_vector₂

**TIES-Merging**
- Task arithmetic + sparsification
- Resolves sign conflicts in parameters
- Best for merging many task-specific models

**DARE (Drop And REscale)**
- Randomly drops fine-tuned parameters
- Rescales remaining parameters
- Reduces redundancy, maintains performance
merged = base_model + α₁task_vector₁ + α₂task_vector₂

**TIES-Merging**
- 任务算术+稀疏化
- 解决参数中的符号冲突
- 最适合合并多个任务专用模型

**DARE(Drop And REscale,丢弃并重缩放)**
- 随机丢弃微调后的参数
- 重新缩放剩余参数
- 减少冗余,保持性能

2. Configuration Structure

2. 配置结构

yaml
undefined
yaml
undefined

Basic structure

基础结构

merge_method: <method> # linear, slerp, ties, dare_ties, task_arithmetic base_model: <path> # Optional: base model for task arithmetic
models:
  • model: <path/to/model1> parameters: weight: <float> # Merge weight density: <float> # For TIES/DARE
  • model: <path/to/model2> parameters: weight: <float>
parameters:

Method-specific parameters

dtype: <dtype> # bfloat16, float16, float32
merge_method: <method> # linear, slerp, ties, dare_ties, task_arithmetic base_model: <path> # 可选:任务算术使用的基础模型
models:
  • model: <path/to/model1> parameters: weight: <float> # 合并权重 density: <float> # 用于TIES/DARE
  • model: <path/to/model2> parameters: weight: <float>
parameters:

方法特定参数

dtype: <dtype> # bfloat16, float16, float32

Optional

可选

slices: # Layer-wise merging tokenizer: # Tokenizer configuration
undefined
slices: # 分层合并 tokenizer: # 分词器配置
undefined

Merge Methods Guide

合并方法指南

Linear Merge

线性合并

Best for: Simple model combinations, equal weighting
yaml
merge_method: linear
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      weight: 0.4
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      weight: 0.3
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      weight: 0.3
dtype: bfloat16
最佳适用场景:简单模型组合、等权重合并
yaml
merge_method: linear
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      weight: 0.4
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      weight: 0.3
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      weight: 0.3
dtype: bfloat16

SLERP Merge

SLERP合并

Best for: Two models, smooth interpolation
yaml
merge_method: slerp
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  t: 0.5  # 0.0 = first model, 1.0 = second model
dtype: bfloat16
Layer-specific SLERP:
yaml
merge_method: slerp
slices:
  - sources:
      - model: model_a
        layer_range: [0, 32]
      - model: model_b
        layer_range: [0, 32]
parameters:
  t:
    - filter: self_attn    # Attention layers
      value: 0.3
    - filter: mlp          # MLP layers
      value: 0.7
    - value: 0.5           # Default for other layers
dtype: bfloat16
最佳适用场景:两个模型的平滑插值
yaml
merge_method: slerp
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  t: 0.5  # 0.0 = 第一个模型,1.0 = 第二个模型
dtype: bfloat16
分层SLERP
yaml
merge_method: slerp
slices:
  - sources:
      - model: model_a
        layer_range: [0, 32]
      - model: model_b
        layer_range: [0, 32]
parameters:
  t:
    - filter: self_attn    # 注意力层
      value: 0.3
    - filter: mlp          # MLP层
      value: 0.7
    - value: 0.5           # 其他层默认值
dtype: bfloat16

Task Arithmetic

任务算术

Best for: Combining specialized skills
yaml
merge_method: task_arithmetic
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1  # Math
    parameters:
      weight: 0.5
  - model: teknium/OpenHermes-2.5-Mistral-7B  # Chat
    parameters:
      weight: 0.3
  - model: ajibawa-2023/Code-Mistral-7B  # Code
    parameters:
      weight: 0.2
dtype: bfloat16
最佳适用场景:合并多种专用技能
yaml
merge_method: task_arithmetic
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1  # 数学
    parameters:
      weight: 0.5
  - model: teknium/OpenHermes-2.5-Mistral-7B  # 对话
    parameters:
      weight: 0.3
  - model: ajibawa-2023/Code-Mistral-7B  # 代码
    parameters:
      weight: 0.2
dtype: bfloat16

TIES-Merging

TIES-Merging

Best for: Many models, resolving conflicts
yaml
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5  # Keep top 50% of parameters
      weight: 1.0
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 1.0
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      density: 0.5
      weight: 1.0
parameters:
  normalize: true
dtype: bfloat16
最佳适用场景:多个模型,解决冲突
yaml
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5  # 保留前50%的参数
      weight: 1.0
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 1.0
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      density: 0.5
      weight: 1.0
parameters:
  normalize: true
dtype: bfloat16

DARE Merge

DARE合并

Best for: Reducing redundancy
yaml
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5    # Drop 50% of deltas
      weight: 0.6
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 0.4
parameters:
  int8_mask: true  # Use int8 for masks (saves memory)
dtype: bfloat16
最佳适用场景:减少冗余
yaml
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5    # 丢弃50%的增量
      weight: 0.6
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 0.4
parameters:
  int8_mask: true  # 使用int8掩码(节省内存)
dtype: bfloat16

Advanced Patterns

高级模式

Layer-wise Merging

分层合并

yaml
undefined
yaml
undefined

Different models for different layers

不同层使用不同模型

merge_method: passthrough slices:
  • sources:
    • model: mistralai/Mistral-7B-v0.1 layer_range: [0, 16] # First half
  • sources:
    • model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [16, 32] # Second half dtype: bfloat16
undefined
merge_method: passthrough slices:
  • sources:
    • model: mistralai/Mistral-7B-v0.1 layer_range: [0, 16] # 前半部分
  • sources:
    • model: teknium/OpenHermes-2.5-Mistral-7B layer_range: [16, 32] # 后半部分 dtype: bfloat16
undefined

MoE from Merged Models

基于合并模型的MoE(混合专家)

yaml
undefined
yaml
undefined

Create Mixture of Experts

创建混合专家模型

merge_method: moe base_model: mistralai/Mistral-7B-v0.1 experts:
  • source_model: WizardLM/WizardMath-7B-V1.1 positive_prompts:
    • "math"
    • "calculate"
  • source_model: teknium/OpenHermes-2.5-Mistral-7B positive_prompts:
    • "chat"
    • "conversation"
  • source_model: ajibawa-2023/Code-Mistral-7B positive_prompts:
    • "code"
    • "python" dtype: bfloat16
undefined
merge_method: moe base_model: mistralai/Mistral-7B-v0.1 experts:
  • source_model: WizardLM/WizardMath-7B-V1.1 positive_prompts:
    • "math"
    • "calculate"
  • source_model: teknium/OpenHermes-2.5-Mistral-7B positive_prompts:
    • "chat"
    • "conversation"
  • source_model: ajibawa-2023/Code-Mistral-7B positive_prompts:
    • "code"
    • "python" dtype: bfloat16
undefined

Tokenizer Merging

分词器合并

yaml
merge_method: linear
models:
  - model: mistralai/Mistral-7B-v0.1
  - model: custom/specialized-model

tokenizer:
  source: "union"  # Combine vocabularies from both models
  tokens:
    <|special_token|>:
      source: "custom/specialized-model"
yaml
merge_method: linear
models:
  - model: mistralai/Mistral-7B-v0.1
  - model: custom/specialized-model

tokenizer:
  source: "union"  # 合并两个模型的词汇表
  tokens:
    <|special_token|>:
      source: "custom/specialized-model"

Best Practices

最佳实践

1. Model Compatibility

1. 模型兼容性

python
undefined
python
undefined

✅ Good: Same architecture

✅ 推荐:相同架构

models = [ "mistralai/Mistral-7B-v0.1", "teknium/OpenHermes-2.5-Mistral-7B", # Both Mistral 7B ]
models = [ "mistralai/Mistral-7B-v0.1", "teknium/OpenHermes-2.5-Mistral-7B", # 均为Mistral 7B架构 ]

❌ Bad: Different architectures

❌ 不推荐:不同架构

models = [ "meta-llama/Llama-2-7b-hf", # Llama "mistralai/Mistral-7B-v0.1", # Mistral (incompatible!) ]
undefined
models = [ "meta-llama/Llama-2-7b-hf", # Llama架构 "mistralai/Mistral-7B-v0.1", # Mistral架构(不兼容!) ]
undefined

2. Weight Selection

2. 权重选择

yaml
undefined
yaml
undefined

✅ Good: Weights sum to 1.0

✅ 推荐:权重总和为1.0

models:
  • model: model_a parameters: weight: 0.6
  • model: model_b parameters: weight: 0.4 # 0.6 + 0.4 = 1.0
models:
  • model: model_a parameters: weight: 0.6
  • model: model_b parameters: weight: 0.4 # 0.6 + 0.4 = 1.0

⚠️ Acceptable: Weights don't sum to 1 (for task arithmetic)

⚠️ 可接受:权重总和不为1(适用于任务算术)

models:
  • model: model_a parameters: weight: 0.8
  • model: model_b parameters: weight: 0.8 # May boost performance
undefined
models:
  • model: model_a parameters: weight: 0.8
  • model: model_b parameters: weight: 0.8 # 可能提升性能
undefined

3. Method Selection

3. 方法选择

python
undefined
python
undefined

Choose merge method based on use case:

根据使用场景选择合并方法:

2 models, smooth blend → SLERP

2个模型,平滑融合 → SLERP

merge_method = "slerp"
merge_method = "slerp"

3+ models, simple average → Linear

3个及以上模型,简单平均 → 线性合并

merge_method = "linear"
merge_method = "linear"

Multiple task-specific models → Task Arithmetic or TIES

多个任务专用模型 → 任务算术或TIES

merge_method = "ties"
merge_method = "ties"

Want to reduce redundancy → DARE

希望减少冗余 → DARE

merge_method = "dare_ties"
undefined
merge_method = "dare_ties"
undefined

4. Density Tuning (TIES/DARE)

4. 密度调优(TIES/DARE)

yaml
undefined
yaml
undefined

Start conservative (keep more parameters)

从保守配置开始(保留更多参数)

parameters: density: 0.8 # Keep 80%
parameters: density: 0.8 # 保留80%

If performance good, increase sparsity

如果性能良好,增加稀疏度

parameters: density: 0.5 # Keep 50%
parameters: density: 0.5 # 保留50%

If performance degrades, reduce sparsity

如果性能下降,降低稀疏度

parameters: density: 0.9 # Keep 90%
undefined
parameters: density: 0.9 # 保留90%
undefined

5. Layer-specific Merging

5. 分层合并

yaml
undefined
yaml
undefined

Preserve base model's beginning and end

保留基础模型的开头和结尾层

merge_method: passthrough slices:
  • sources:
    • model: base_model layer_range: [0, 2] # Keep first layers
  • sources:
    • model: merged_middle # Merge middle layers layer_range: [2, 30]
  • sources:
    • model: base_model layer_range: [30, 32] # Keep last layers
undefined
merge_method: passthrough slices:
  • sources:
    • model: base_model layer_range: [0, 2] # 保留前几层
  • sources:
    • model: merged_middle # 合并中间层 layer_range: [2, 30]
  • sources:
    • model: base_model layer_range: [30, 32] # 保留最后几层
undefined

Evaluation & Testing

评估与测试

Benchmark Merged Models

基准测试合并模型

python
from transformers import AutoModelForCausalLM, AutoTokenizer
python
from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

加载合并后的模型

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")
model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Test on various tasks

在不同任务上测试

test_prompts = { "math": "Calculate: 25 * 17 =", "code": "Write a Python function to reverse a string:", "chat": "What is the capital of France?", }
for task, prompt in test_prompts.items(): inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(f"{task}: {tokenizer.decode(outputs[0])}")
undefined
test_prompts = { "math": "计算:25 * 17 =", "code": "编写一个Python函数来反转字符串:", "chat": "法国的首都是什么?", }
for task, prompt in test_prompts.items(): inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(f"{task}:{tokenizer.decode(outputs[0])}")
undefined

Common Benchmarks

常用基准测试

  • Open LLM Leaderboard: General capabilities
  • MT-Bench: Multi-turn conversation
  • MMLU: Multitask accuracy
  • HumanEval: Code generation
  • GSM8K: Math reasoning
  • Open LLM Leaderboard:通用能力
  • MT-Bench:多轮对话
  • MMLU:多任务准确率
  • HumanEval:代码生成
  • GSM8K:数学推理

Production Deployment

生产部署

Save and Upload

保存并上传

python
from transformers import AutoModelForCausalLM, AutoTokenizer
python
from transformers import AutoModelForCausalLM, AutoTokenizer

Load merged model

加载合并后的模型

model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")
model = AutoModelForCausalLM.from_pretrained("./merged-model") tokenizer = AutoTokenizer.from_pretrained("./merged-model")

Upload to HuggingFace Hub

上传至HuggingFace Hub

model.push_to_hub("username/my-merged-model") tokenizer.push_to_hub("username/my-merged-model")
undefined
model.push_to_hub("username/my-merged-model") tokenizer.push_to_hub("username/my-merged-model")
undefined

Quantize Merged Model

量化合并模型

bash
undefined
bash
undefined

Quantize with GGUF

使用GGUF量化

python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf
python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf

Quantize with GPTQ

使用GPTQ量化

python quantize_gptq.py ./merged-model --bits 4 --group_size 128
undefined
python quantize_gptq.py ./merged-model --bits 4 --group_size 128
undefined

Common Pitfalls

常见陷阱

❌ Pitfall 1: Merging Incompatible Models

❌ 陷阱1:合并不兼容模型

yaml
undefined
yaml
undefined

Wrong: Different architectures

错误:不同架构

models:
  • model: meta-llama/Llama-2-7b # Llama architecture
  • model: mistralai/Mistral-7B # Mistral architecture

**Fix**: Only merge models with same architecture
models:
  • model: meta-llama/Llama-2-7b # Llama架构
  • model: mistralai/Mistral-7B # Mistral架构

**修复方案**:仅合并相同架构的模型

❌ Pitfall 2: Over-weighting One Model

❌ 陷阱2:单一模型权重占比过高

yaml
undefined
yaml
undefined

Suboptimal: One model dominates

次优:单一模型主导

models:
  • model: model_a parameters: weight: 0.95 # Too high
  • model: model_b parameters: weight: 0.05 # Too low

**Fix**: Use more balanced weights (0.3-0.7 range)
models:
  • model: model_a parameters: weight: 0.95 # 占比过高
  • model: model_b parameters: weight: 0.05 # 占比过低

**修复方案**:使用更均衡的权重(0.3-0.7范围)

❌ Pitfall 3: Not Evaluating

❌ 陷阱3:未进行评估

bash
undefined
bash
undefined

Wrong: Merge and deploy without testing

错误:合并后直接部署,未测试

mergekit-yaml config.yml ./merged-model
mergekit-yaml config.yml ./merged-model

Deploy immediately (risky!)

立即部署(风险高!)


**Fix**: Always benchmark before deploying

**修复方案**:部署前始终进行基准测试

Resources

资源

See Also

另请参阅

  • references/methods.md
    - Deep dive into merge algorithms
  • references/examples.md
    - Real-world merge configurations
  • references/evaluation.md
    - Benchmarking and testing strategies
  • references/methods.md
    - 合并算法深度解析
  • references/examples.md
    - 真实场景合并配置
  • references/evaluation.md
    - 基准测试与测试策略