Model Merging

Name: Model Merging
Author: orchestra-research

orchestra-research/ai-research-skills

393 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

model-merging is an agent skill that benchmarks and compares merged Hugging Face language models with Open LLM Leaderboard tasks, lm_eval, and MT-Bench-style conversation tests for developers evaluating merge quality bef

About

model-merging is an agent skill in orchestra-research/ai-research-skills that benchmarks merged Hugging Face models using research-grade evaluation suites. The guide centers on the Open LLM Leaderboard with six standard tasks—ARC (25-shot science reasoning), HellaSwag (10-shot commonsense), MMLU (5-shot across 57 subjects), TruthfulQA (0-shot factual accuracy), Winogrande (5-shot commonsense), and GSM8K (5-shot math)—plus lm_eval harness usage and MT-Bench-style multi-turn conversation scoring. Developers reach for model-merging after producing merged checkpoints (SLERP, TIES, DARE, or similar) when leaderboard scores, task-level regressions, and conversational layout compatibility must be verified before serving. The workflow documents metrics, comparison frameworks, and quality-assurance checks aligned with published merge research practices. Use when comparing two merge recipes or validating a new merged artifact against base models. Skip for training merges from scratch, dataset curation, or production vLLM deployment tuning without an evaluation pass.

Documents 6-task Open LLM Leaderboard suite: ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
Provides lm_eval simple_evaluate Python example with few-shot and batch settings
Covers MT-Bench multi-turn conversation evaluation via FastChat tooling
Includes comparison framework and QA-oriented testing methodology sections

Model Merging by the numbers

393 all-time installs (skills.sh)
+37 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #514 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill model-merging

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/model-merging.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/model-merging)

Installs	393
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you benchmark merged Hugging Face models?

Benchmark and compare merged Hugging Face models with Open LLM Leaderboard tasks, lm_eval, and MT-Bench-style conversation tests.

Who is it for?

ML engineers and researchers who merged Hugging Face LLM checkpoints and need standardized leaderboard and conversation benchmarks before promoting a candidate.

Skip if: Training new base models, curating fine-tuning datasets, or production inference optimization with no merge evaluation step.

When should I use this skill?

User merged Hugging Face models and asks for Open LLM Leaderboard, lm_eval, or MT-Bench comparison against base checkpoints.

What you get

Leaderboard task scores, lm_eval metrics, MT-Bench conversation ratings, and a structured merge comparison report.

Leaderboard task scores
Merge comparison matrix
Conversation benchmark notes

By the numbers

Covers 6 Open LLM Leaderboard benchmark tasks
MMLU evaluation spans 57 subject areas
Documents 98 skills in the parent ai-research-skills library

Files

SKILL.mdMarkdownGitHub ↗

Model Merging: Combining Pre-trained Models

When to Use This Skill

Use Model Merging when you need to:

Combine capabilities from multiple fine-tuned models without retraining
Create specialized models by blending domain-specific expertise (math + coding + chat)
Improve performance beyond single models (often +5-10% on benchmarks)
Reduce training costs - no GPUs needed, merges run on CPU
Experiment rapidly - create new model variants in minutes, not days
Preserve multiple skills - merge without catastrophic forgetting

Success Stories: Marcoro14-7B-slerp (best on Open LLM Leaderboard 02/2024), many top HuggingFace models use merging

Tools: mergekit (Arcee AI), LazyMergekit, Model Soup

Installation

# Install mergekit
git clone https://github.com/arcee-ai/mergekit.git
cd mergekit
pip install -e .

# Or via pip
pip install mergekit

# Optional: Transformer library
pip install transformers torch

Quick Start

Simple Linear Merge

# config.yml - Merge two models with equal weights
merge_method: linear
models:
  - model: mistralai/Mistral-7B-v0.1
    parameters:
      weight: 0.5
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      weight: 0.5
dtype: bfloat16

# Run merge
mergekit-yaml config.yml ./merged-model --cuda

# Use merged model
python -m transformers.models.auto --model_name_or_path ./merged-model

SLERP Merge (Best for 2 Models)

# config.yml - Spherical interpolation
merge_method: slerp
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  t: 0.5  # Interpolation factor (0=model1, 1=model2)
dtype: bfloat16

Core Concepts

1. Merge Methods

Linear (Model Soup)

Simple weighted average of parameters
Fast, works well for similar models
Can merge 2+ models (w1 + w2 + ... = 1)

SLERP (Spherical Linear Interpolation)

Interpolates along sphere in weight space
Preserves magnitude of weight vectors
Best for merging 2 models
Smoother than linear

# SLERP formula
merged = (sin((1-t)*θ) / sin(θ)) * model1 + (sin(t*θ) / sin(θ)) * model2
# where θ = arccos(dot(model1, model2))
# t ∈ [0, 1]

Task Arithmetic

Extract "task vectors" (fine-tuned - base)
Combine task vectors, add to base
Good for merging multiple specialized models (merged = base + α₁·tv₁ + α₂·tv₂)

TIES-Merging

Task arithmetic + sparsification
Resolves sign conflicts in parameters
Best for merging many task-specific models

DARE (Drop And REscale)

Randomly drops fine-tuned parameters
Rescales remaining parameters
Reduces redundancy, maintains performance

2. Configuration Structure

# Basic structure
merge_method: <method>  # linear, slerp, ties, dare_ties, task_arithmetic
base_model: <path>      # Optional: base model for task arithmetic

models:
  - model: <path/to/model1>
    parameters:
      weight: <float>   # Merge weight
      density: <float>  # For TIES/DARE

  - model: <path/to/model2>
    parameters:
      weight: <float>

parameters:
  # Method-specific parameters

dtype: <dtype>  # bfloat16, float16, float32

# Optional
slices:  # Layer-wise merging
tokenizer:  # Tokenizer configuration

Merge Methods Guide

Linear Merge

Best for: Simple model combinations, equal weighting

merge_method: linear
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      weight: 0.4
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      weight: 0.3
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      weight: 0.3
dtype: bfloat16

SLERP Merge

Best for: Two models, smooth interpolation

merge_method: slerp
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  t: 0.5  # 0.0 = first model, 1.0 = second model
dtype: bfloat16

Layer-specific SLERP:

merge_method: slerp
slices:
  - sources:
      - model: model_a
        layer_range: [0, 32]
      - model: model_b
        layer_range: [0, 32]
parameters:
  t:
    - filter: self_attn    # Attention layers
      value: 0.3
    - filter: mlp          # MLP layers
      value: 0.7
    - value: 0.5           # Default for other layers
dtype: bfloat16

Task Arithmetic

Best for: Combining specialized skills

merge_method: task_arithmetic
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1  # Math
    parameters:
      weight: 0.5
  - model: teknium/OpenHermes-2.5-Mistral-7B  # Chat
    parameters:
      weight: 0.3
  - model: ajibawa-2023/Code-Mistral-7B  # Code
    parameters:
      weight: 0.2
dtype: bfloat16

TIES-Merging

Best for: Many models, resolving conflicts

merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5  # Keep top 50% of parameters
      weight: 1.0
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 1.0
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    parameters:
      density: 0.5
      weight: 1.0
parameters:
  normalize: true
dtype: bfloat16

DARE Merge

Best for: Reducing redundancy

merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5    # Drop 50% of deltas
      weight: 0.6
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 0.4
parameters:
  int8_mask: true  # Use int8 for masks (saves memory)
dtype: bfloat16

Advanced Patterns

Layer-wise Merging

# Different models for different layers
merge_method: passthrough
slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 16]   # First half
  - sources:
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [16, 32]  # Second half
dtype: bfloat16

MoE from Merged Models

# Create Mixture of Experts
merge_method: moe
base_model: mistralai/Mistral-7B-v0.1
experts:
  - source_model: WizardLM/WizardMath-7B-V1.1
    positive_prompts:
      - "math"
      - "calculate"
  - source_model: teknium/OpenHermes-2.5-Mistral-7B
    positive_prompts:
      - "chat"
      - "conversation"
  - source_model: ajibawa-2023/Code-Mistral-7B
    positive_prompts:
      - "code"
      - "python"
dtype: bfloat16

Tokenizer Merging

merge_method: linear
models:
  - model: mistralai/Mistral-7B-v0.1
  - model: custom/specialized-model

tokenizer:
  source: "union"  # Combine vocabularies from both models
  tokens:
    <|special_token|>:
      source: "custom/specialized-model"

Best Practices

1. Model Compatibility

# ✅ Good: Same architecture
models = [
    "mistralai/Mistral-7B-v0.1",
    "teknium/OpenHermes-2.5-Mistral-7B",  # Both Mistral 7B
]

# ❌ Bad: Different architectures
models = [
    "meta-llama/Llama-2-7b-hf",  # Llama
    "mistralai/Mistral-7B-v0.1",  # Mistral (incompatible!)
]

2. Weight Selection

# ✅ Good: Weights sum to 1.0
models:
  - model: model_a
    parameters:
      weight: 0.6
  - model: model_b
    parameters:
      weight: 0.4  # 0.6 + 0.4 = 1.0

# ⚠️  Acceptable: Weights don't sum to 1 (for task arithmetic)
models:
  - model: model_a
    parameters:
      weight: 0.8
  - model: model_b
    parameters:
      weight: 0.8  # May boost performance

Unsupervised Coefficient Tuning (no labeled data needed)

Instead of manual search, use generation consistency: merge with several candidate coefficients, generate responses on a small unlabeled subset, and pick the coefficient whose outputs are most similar to those of its neighbors. Consistent outputs signal a stable, well-performing merge region (AdaMMS, arXiv:2503.23733).

# Pseudocode — see references/coefficient-tuning.md for full implementation
candidates = [0.3, 0.4, 0.5, 0.6, 0.7]
for alpha in candidates:
    merged_paths[alpha] = merge_with_coefficient(alpha, model_a, model_b)
    responses[alpha]    = generate_responses(merged_paths[alpha], eval_prompts)

# Score each alpha by similarity to its neighbors (alpha ± 0.1)
best_alpha = max(candidates, key=lambda a: generation_consistency(a, responses))

See [references/coefficient-tuning.md](references/coefficient-tuning.md) for the full algorithm, similarity metrics, multi-coefficient search, and end-to-end pipeline.

3. Method Selection

# Choose merge method based on use case:

# 2 models, smooth blend → SLERP
merge_method = "slerp"

# 3+ models, simple average → Linear
merge_method = "linear"

# Multiple task-specific models → Task Arithmetic or TIES
merge_method = "ties"

# Want to reduce redundancy → DARE
merge_method = "dare_ties"

4. Density Tuning (TIES/DARE)

# Start conservative (keep more parameters)
parameters:
  density: 0.8  # Keep 80%

# If performance good, increase sparsity
parameters:
  density: 0.5  # Keep 50%

# If performance degrades, reduce sparsity
parameters:
  density: 0.9  # Keep 90%

5. Layer-specific Merging

Preserve the base model's first/last layers (often best left untouched) and merge only the middle via merge_method: passthrough with slices — see the Layer-wise Merging pattern above.

Evaluation & Testing

Benchmark Merged Models

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load merged model
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")

# Test on various tasks
test_prompts = {
    "math": "Calculate: 25 * 17 =",
    "code": "Write a Python function to reverse a string:",
    "chat": "What is the capital of France?",
}

for task, prompt in test_prompts.items():
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=100)
    print(f"{task}: {tokenizer.decode(outputs[0])}")

Common Benchmarks

Open LLM Leaderboard: General capabilities
MT-Bench: Multi-turn conversation
MMLU: Multitask accuracy
HumanEval: Code generation
GSM8K: Math reasoning

Production Deployment

Save and Upload

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load merged model
model = AutoModelForCausalLM.from_pretrained("./merged-model")
tokenizer = AutoTokenizer.from_pretrained("./merged-model")

# Upload to HuggingFace Hub
model.push_to_hub("username/my-merged-model")
tokenizer.push_to_hub("username/my-merged-model")

Quantize Merged Model

# Quantize with GGUF
python convert.py ./merged-model --outtype f16 --outfile merged-model.gguf

# Quantize with GPTQ
python quantize_gptq.py ./merged-model --bits 4 --group_size 128

Common Pitfalls

Mismatched architectures — only merge models that share the same architecture (e.g., don't mix Llama and Mistral).
Over-weighting one model (e.g., 0.95 / 0.05) — keep weights balanced, typically in the 0.3–0.7 range.
Skipping evaluation — always benchmark a merged model before deploying (see the Evaluation & Testing section above).

Resources

mergekit GitHub: https://github.com/arcee-ai/mergekit
HuggingFace Tutorial: https://huggingface.co/blog/mlabonne/merge-models
LazyMergekit: Automated merging notebook
TIES Paper: https://arxiv.org/abs/2306.01708
DARE Paper: https://arxiv.org/abs/2311.03099

Unsupervised Coefficient Tuning via Generation Consistency

Reference for the generation-consistency-based coefficient selection method introduced in:

"AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization"

Yiyang Du, Xiaochen Wang, Chi Chen, Jiabo Ye, Yiru Wang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Zhifang Sui, Maosong Sun, Yang Liu. CVPR 2025. arXiv:2503.23733

Proposes an unsupervised proxy—generation consistency—to automatically select merging coefficients without labeled data or manual search.

---

Problem: Coefficient Selection in Model Merging

Most merging methods (Task Arithmetic, TIES, DARE, SLERP) expose one or more scalar coefficients (e.g., weight, density, lambda, t) that strongly affect final quality. Common approaches:

Approach	Drawback
Manual/intuition	Unreliable, requires human expertise
Grid search with eval set	Requires labeled data; expensive (N merges × eval cost)
Bayesian optimization	Needs ground-truth signal per trial
Generation Consistency (this method)	Unsupervised, label-free, small unlabeled data subset only

---

Core Idea: Generation Consistency

Key insight: When a coefficient value is near-optimal, merged models at nearby coefficient values produce similar outputs—because the loss landscape is smooth around a good solution. Near poor solutions (too high or too low coefficient), outputs diverge sharply as the model is at an unstable point.

For a candidate coefficient α, compute:

ConsistencyScore(α) = avg_similarity(outputs(α), outputs(α - δ))
                    + avg_similarity(outputs(α), outputs(α + δ))

where δ is a small step size (e.g., 0.1) and similarity is measured over a small unlabeled dataset. The coefficient with the highest consistency score is selected.

---

Algorithm

Step 1: Define Candidate Coefficients

import numpy as np

# Example: searching over SLERP t or Task Arithmetic lambda
alpha_min, alpha_max = 0.2, 0.8
step = 0.1
candidates = np.arange(alpha_min, alpha_max + step, step).tolist()
# candidates = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]

Step 2: Merge Models for Each Candidate

import subprocess
import json
import os

def merge_with_coefficient(alpha, model_a, model_b, output_dir, method="slerp"):
    """Merge models using mergekit with a specific coefficient."""
    if method == "slerp":
        config = {
            "merge_method": "slerp",
            "slices": [{"sources": [
                {"model": model_a, "layer_range": [0, 32]},
                {"model": model_b, "layer_range": [0, 32]}
            ]}],
            "parameters": {"t": alpha},
            "dtype": "bfloat16"
        }
    elif method == "task_arithmetic":
        config = {
            "merge_method": "task_arithmetic",
            "base_model": model_a,  # treat model_a as base
            "models": [{"model": model_b, "parameters": {"weight": alpha}}],
            "dtype": "bfloat16"
        }

    config_path = f"/tmp/merge_config_{alpha:.2f}.yaml"
    import yaml
    with open(config_path, "w") as f:
        yaml.dump(config, f)

    out_path = os.path.join(output_dir, f"merged_alpha_{alpha:.2f}")
    subprocess.run(
        ["mergekit-yaml", config_path, out_path, "--cuda"],
        check=True
    )
    return out_path


# Merge all candidates
output_root = "/tmp/merge_candidates"
os.makedirs(output_root, exist_ok=True)

merged_paths = {}
for alpha in candidates:
    path = merge_with_coefficient(alpha, "model_a_path", "model_b_path", output_root)
    merged_paths[alpha] = path

Step 3: Run Inference on Unlabeled Subset

Only a small subset (~50–200 samples) of unlabeled text is needed. The data does not need labels.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def generate_responses(model_path, prompts, max_new_tokens=128, batch_size=8):
    """Generate responses for all prompts using the merged model."""
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    model.eval()

    all_responses = []
    for i in range(0, len(prompts), batch_size):
        batch = prompts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True).to(model.device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,       # greedy for determinism
                pad_token_id=tokenizer.eos_token_id
            )
        decoded = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        all_responses.extend(decoded)

    del model  # free GPU memory before loading next
    torch.cuda.empty_cache()
    return all_responses


# Small unlabeled evaluation prompts (no labels needed)
eval_prompts = [
    "Explain the concept of gradient descent.",
    "Write a Python function to find the maximum of a list.",
    # ... 50-200 prompts total
]

# Generate responses for each candidate
all_responses = {}
for alpha in candidates:
    print(f"Generating responses for alpha={alpha:.2f} ...")
    all_responses[alpha] = generate_responses(merged_paths[alpha], eval_prompts)

Step 4: Compute Generation Consistency Scores

from rouge_score import rouge_scorer

def text_similarity(text_a, text_b, metric="rougeL"):
    """Compute similarity between two text strings."""
    if metric == "rougeL":
        scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=False)
        score = scorer.score(text_a, text_b)
        return score["rougeL"].fmeasure
    elif metric == "token_overlap":
        tokens_a = set(text_a.lower().split())
        tokens_b = set(text_b.lower().split())
        if not tokens_a or not tokens_b:
            return 0.0
        return len(tokens_a & tokens_b) / len(tokens_a | tokens_b)


def generation_consistency(alpha, all_responses, candidates, delta=0.1, metric="rougeL"):
    """
    Compute generation consistency for a given alpha.

    Compares model at `alpha` against its nearest neighbors:
    alpha - delta and alpha + delta.
    """
    responses_curr = all_responses[alpha]
    neighbor_alphas = []

    left_alpha = round(alpha - delta, 2)
    right_alpha = round(alpha + delta, 2)

    if left_alpha in all_responses:
        neighbor_alphas.append(left_alpha)
    if right_alpha in all_responses:
        neighbor_alphas.append(right_alpha)

    if not neighbor_alphas:
        return 0.0  # boundary case with no neighbors

    total_sim = 0.0
    count = 0
    for n_alpha in neighbor_alphas:
        responses_neighbor = all_responses[n_alpha]
        pair_sims = [
            text_similarity(r_curr, r_neighbor, metric=metric)
            for r_curr, r_neighbor in zip(responses_curr, responses_neighbor)
        ]
        total_sim += sum(pair_sims) / len(pair_sims)
        count += 1

    return total_sim / count


# Compute consistency scores for all interior candidates
consistency_scores = {}
for alpha in candidates:
    score = generation_consistency(alpha, all_responses, candidates, delta=0.1)
    consistency_scores[alpha] = score
    print(f"alpha={alpha:.2f}  consistency={score:.4f}")

Step 5: Select Best Coefficient

best_alpha = max(consistency_scores, key=consistency_scores.get)
print(f"\nBest coefficient: alpha={best_alpha:.2f} (consistency={consistency_scores[best_alpha]:.4f})")

# Optionally visualize
import matplotlib.pyplot as plt

alphas_sorted = sorted(consistency_scores.keys())
scores_sorted = [consistency_scores[a] for a in alphas_sorted]

plt.figure(figsize=(8, 4))
plt.plot(alphas_sorted, scores_sorted, marker="o")
plt.axvline(best_alpha, color="red", linestyle="--", label=f"Best α={best_alpha:.2f}")
plt.xlabel("Merge Coefficient (α)")
plt.ylabel("Generation Consistency Score")
plt.title("Unsupervised Coefficient Selection via Generation Consistency")
plt.legend()
plt.tight_layout()
plt.savefig("consistency_curve.png", dpi=150)
plt.show()

---

Similarity Metrics

Three options depending on speed vs. quality trade-off:

Metric	Speed	Quality	Notes
Token overlap (Jaccard)	Fast	Low	Suitable for quick prototyping
ROUGE-L	Medium	Medium	Good balance; install `rouge-score`
BERTScore	Slow	High	Best semantic sensitivity; requires GPU

# BERTScore alternative (higher quality)
from bert_score import score as bert_score

def bertscore_similarity_batch(texts_a, texts_b, lang="en"):
    P, R, F1 = bert_score(texts_a, texts_b, lang=lang, verbose=False)
    return F1.mean().item()

# Replace text_similarity call with:
# score = bertscore_similarity_batch(responses_curr, responses_neighbor)

---

Applying to Different Merge Methods

SLERP (`t` parameter)

# Search t ∈ [0.0, 1.0]; interior candidates avoid boundary collapse
candidates = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]

Task Arithmetic / TIES (`lambda` or `weight`)

# Lambda can exceed 1.0; search a wider range
candidates = [0.3, 0.5, 0.7, 1.0, 1.2, 1.5]

Multi-coefficient (e.g., two models with different weights)

When searching over a 2D coefficient space (e.g., w1 and w2 = 1 - w1), the 1D search over w1 is sufficient since w2 is determined:

# 1D search: w1 ∈ [0.2, 0.8], w2 = 1 - w1
candidates = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
# Map alpha -> (alpha, 1-alpha) for the merge config

For higher-dimensional searches (3+ models), perform coordinate-wise consistency optimization:

def coordinate_wise_search(base_weights, coord_idx, candidates, all_responses_fn, delta=0.1):
    """Optimize one coefficient at a time, holding others fixed."""
    best_score = -1
    best_alpha = base_weights[coord_idx]
    for alpha in candidates:
        weights = base_weights.copy()
        weights[coord_idx] = alpha
        # Normalize if weights should sum to 1
        weights = [w / sum(weights) for w in weights]
        responses = all_responses_fn(weights)
        score = generation_consistency_from_responses(responses, delta)
        if score > best_score:
            best_score = score
            best_alpha = alpha
    return best_alpha, best_score

Note: coordinate_wise_search is illustrative — it calls user-supplied helpers (all_responses_fn, which merges and generates for a given weight vector, and generation_consistency_from_responses) that you assemble from the building blocks in Steps 2–4 above.

---

Full Pipeline (End-to-End)

import numpy as np
from typing import List, Dict

def unsupervised_coefficient_search(
    model_a: str,
    model_b: str,
    eval_prompts: List[str],
    method: str = "slerp",
    candidates: List[float] = None,
    delta: float = 0.1,
    similarity_metric: str = "rougeL",
    output_root: str = "/tmp/merge_candidates",
    max_new_tokens: int = 128,
) -> Dict:
    """
    Unsupervised coefficient search using generation consistency.

    Args:
        model_a: Path or HuggingFace ID of first model (or base model).
        model_b: Path or HuggingFace ID of second model.
        eval_prompts: Small set of unlabeled prompts (50-200 recommended).
        method: Merge method ('slerp', 'task_arithmetic', 'ties').
        candidates: List of coefficient values to search over.
        delta: Step size for neighbor comparison.
        similarity_metric: 'rougeL', 'token_overlap', or 'bertscore'.
        output_root: Directory to store temporary merged models.
        max_new_tokens: Max tokens to generate per prompt.

    Returns:
        dict with 'best_alpha', 'best_path', 'scores', 'all_responses'.
    """
    if candidates is None:
        candidates = [round(x, 2) for x in np.arange(0.2, 0.9, 0.1).tolist()]

    os.makedirs(output_root, exist_ok=True)

    # Step 1-2: Merge all candidates
    merged_paths = {}
    for alpha in candidates:
        merged_paths[alpha] = merge_with_coefficient(alpha, model_a, model_b, output_root, method)

    # Step 3: Generate responses
    all_responses = {}
    for alpha in candidates:
        all_responses[alpha] = generate_responses(merged_paths[alpha], eval_prompts, max_new_tokens)

    # Step 4: Score consistency
    scores = {}
    for alpha in candidates:
        scores[alpha] = generation_consistency(alpha, all_responses, candidates, delta, similarity_metric)

    # Step 5: Select best
    best_alpha = max(scores, key=scores.get)

    return {
        "best_alpha": best_alpha,
        "best_path": merged_paths[best_alpha],
        "scores": scores,
        "all_responses": all_responses,
    }


# Usage
result = unsupervised_coefficient_search(
    model_a="mistralai/Mistral-7B-v0.1",
    model_b="teknium/OpenHermes-2.5-Mistral-7B",
    eval_prompts=eval_prompts,   # ~100 unlabeled prompts
    method="slerp",
    candidates=[0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
)
print(f"Best coefficient: {result['best_alpha']}")
print(f"Best model path: {result['best_path']}")

---

Practical Tips

Dataset selection: Any unlabeled text in the target domain works. Even 50 samples often suffice; diminishing returns beyond 200.

Step size δ: Use δ = 0.1 for a search grid with step 0.1. For finer grids (e.g., step 0.05), set δ = 0.05 accordingly.

Boundary candidates: Candidates at alpha_min or alpha_max have only one neighbor, so their scores are one-sided and tend to be underestimated. Consider excluding them from final selection, or widening the search range.

Compute cost: N merges + N inference passes. Inference is the bottleneck; use greedy decoding and short outputs to keep it fast. With N=7 candidates and 100 prompts, this typically takes 10–30 minutes on a single GPU.

When consistency is flat: If the curve shows no clear peak, the two models may be too dissimilar or too similar. Try adjusting density/dropout parameters first, then re-search.

---

References

Paper: Du et al., AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization, CVPR 2025. arXiv:2503.23733
ROUGE: pip install rouge-score
BERTScore: pip install bert-score
mergekit: https://github.com/arcee-ai/mergekit

Model Merging Evaluation

Complete guide to benchmarking and testing merged models based on research best practices.

Benchmark Suites
Evaluation Metrics
Testing Methodology
Comparison Framework
Quality Assurance

Benchmark Suites

Open LLM Leaderboard

URL: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Tasks (6 benchmarks): 1. ARC (AI2 Reasoning Challenge): 25-shot, science questions 2. HellaSwag: 10-shot, commonsense reasoning 3. MMLU (Massive Multitask Language Understanding): 5-shot, 57 subjects 4. TruthfulQA: 0-shot, factual accuracy 5. Winogrande: 5-shot, commonsense reasoning 6. GSM8K: 5-shot, grade-school math

Running Evaluation:

from lm_eval import evaluator

model = "path/to/merged/model"

results = evaluator.simple_evaluate(
    model="hf",
    model_args=f"pretrained={model},dtype=float16",
    tasks=[
        "arc_challenge",
        "hellaswag",
        "hendrycksTest-*",  # MMLU
        "truthfulqa_mc",
        "winogrande",
        "gsm8k"
    ],
    num_fewshot=5,
    batch_size=8
)

# Average score
avg_score = sum(results['results'].values()) / len(results['results'])
print(f"Average: {avg_score:.2f}")

MT-Bench

Focus: Multi-turn conversation quality

Installation:

git clone https://github.com/lm-sys/FastChat
cd FastChat
pip install -e .

Running:

# Generate responses
python gen_model_answer.py \
  --model-path path/to/merged/model \
  --model-id merged_model

# Judge with GPT-4
python gen_judgment.py \
  --model-list merged_model \
  --judge-model gpt-4

# View scores
python show_result.py

Metrics:

Turn 1 score (1-10)
Turn 2 score (1-10)
Average score

MMLU (Detailed)

Subjects (57 total):

STEM: Math, Physics, Chemistry, Biology, Computer Science
Humanities: History, Philosophy, Law
Social Sciences: Economics, Psychology, Sociology
Other: Professional subjects (Medicine, Accounting, etc.)

from lm_eval import evaluator

# Run all MMLU subjects
results = evaluator.simple_evaluate(
    model="hf",
    model_args=f"pretrained={model}",
    tasks="hendrycksTest-*",  # All MMLU tasks
    num_fewshot=5
)

# Subject breakdown
for task, score in results['results'].items():
    subject = task.replace('hendrycksTest-', '')
    print(f"{subject}: {score['acc']:.2%}")

HumanEval (Code)

Focus: Python code generation

from human_eval.data import write_jsonl, read_problems
from human_eval.evaluation import evaluate_functional_correctness

# Generate completions
problems = read_problems()
samples = []

for task_id, problem in problems.items():
    prompt = problem['prompt']
    completion = model.generate(prompt)
    samples.append({
        'task_id': task_id,
        'completion': completion
    })

write_jsonl("samples.jsonl", samples)

# Evaluate
results = evaluate_functional_correctness("samples.jsonl")
print(f"Pass@1: {results['pass@1']:.2%}")

Evaluation Metrics

Performance Metrics

Accuracy: Correct predictions / total predictions

def accuracy(predictions, labels):
    correct = sum(p == l for p, l in zip(predictions, labels))
    return correct / len(predictions)

Perplexity: Language modeling quality (lower is better)

import torch

def perplexity(model, text):
    tokens = tokenizer(text, return_tensors='pt')
    with torch.no_grad():
        loss = model(**tokens).loss
    return torch.exp(loss).item()

BLEU Score: Translation/generation quality

from nltk.translate.bleu_score import sentence_bleu

reference = [["the", "cat", "sat", "on", "the", "mat"]]
candidate = ["the", "cat", "is", "on", "the", "mat"]

score = sentence_bleu(reference, candidate)

Capability Retention

Test: Does merged model retain parent capabilities?

def test_capability_retention(merged_model, parent_models, test_suite):
    """Check if merged model maintains parent capabilities."""
    results = {}

    # Baseline: Test parent models
    for i, parent in enumerate(parent_models):
        parent_score = evaluate(parent, test_suite)
        results[f'parent_{i}'] = parent_score

    # Test merged model
    merged_score = evaluate(merged_model, test_suite)
    results['merged'] = merged_score

    # Retention percentage
    avg_parent_score = sum(s for k, s in results.items() if k.startswith('parent')) / len(parent_models)
    retention = merged_score / avg_parent_score

    print(f"Capability Retention: {retention:.1%}")
    return retention >= 0.95  # 95% retention threshold

Conflict Detection

Test: Does model show conflicting behaviors?

def test_conflicts(model, test_pairs):
    """Test for contradictory outputs."""
    conflicts = []

    for question_a, question_b, expected_consistency in test_pairs:
        answer_a = model.generate(question_a)
        answer_b = model.generate(question_b)

        # Check consistency
        is_consistent = check_semantic_similarity(answer_a, answer_b)

        if is_consistent != expected_consistency:
            conflicts.append((question_a, question_b, answer_a, answer_b))

    conflict_rate = len(conflicts) / len(test_pairs)
    print(f"Conflict Rate: {conflict_rate:.1%}")

    return conflict_rate < 0.05  # <5% conflicts acceptable

Testing Methodology

Pre-Merge Testing

Before merging, establish baselines:

# Test parent models
parent_1_scores = evaluate(parent_1, benchmark_suite)
parent_2_scores = evaluate(parent_2, benchmark_suite)

# Expected range for merged model
min_expected = min(parent_1_scores, parent_2_scores)
max_expected = max(parent_1_scores, parent_2_scores)

print(f"Expected merged score: {min_expected:.2f} - {max_expected:.2f}")

Post-Merge Testing

Comprehensive evaluation:

def comprehensive_eval(merged_model):
    """Full evaluation suite."""
    results = {}

    # 1. General capabilities
    results['open_llm'] = evaluate_open_llm(merged_model)

    # 2. Conversation
    results['mt_bench'] = evaluate_mt_bench(merged_model)

    # 3. Domain-specific
    results['math'] = evaluate_math(merged_model)  # GSM8K, MATH
    results['code'] = evaluate_code(merged_model)  # HumanEval
    results['reasoning'] = evaluate_reasoning(merged_model)  # ARC, HellaSwag

    # 4. Safety
    results['safety'] = evaluate_safety(merged_model)  # TruthfulQA

    return results

A/B Testing

Compare merged model vs parents:

def ab_test(model_a, model_b, test_prompts, n_users=100):
    """User preference testing."""
    preferences = {'a': 0, 'b': 0, 'tie': 0}

    for prompt in test_prompts:
        response_a = model_a.generate(prompt)
        response_b = model_b.generate(prompt)

        # Simulated user preference (or use GPT-4 as judge)
        preference = judge_responses(prompt, response_a, response_b)
        preferences[preference] += 1

    a_win_rate = preferences['a'] / (preferences['a'] + preferences['b'] + preferences['tie'])

    print(f"Model A Win Rate: {a_win_rate:.1%}")
    print(f"Tie Rate: {preferences['tie'] / len(test_prompts):.1%}")

    return a_win_rate

Comparison Framework

Score Comparison Table

import pandas as pd

def compare_models(models, benchmarks):
    """Create comparison table."""
    results = {}

    for model_name, model_path in models.items():
        results[model_name] = {}

        for benchmark_name, benchmark_fn in benchmarks.items():
            score = benchmark_fn(model_path)
            results[model_name][benchmark_name] = score

    # Create DataFrame
    df = pd.DataFrame(results).T

    # Add average column
    df['Average'] = df.mean(axis=1)

    # Highlight best
    print(df.to_markdown())

    return df

# Usage
models = {
    'Parent 1': 'path/to/parent1',
    'Parent 2': 'path/to/parent2',
    'Merged (SLERP t=0.5)': 'path/to/merged_0.5',
    'Merged (TIES)': 'path/to/merged_ties'
}

benchmarks = {
    'MMLU': evaluate_mmlu,
    'ARC': evaluate_arc,
    'GSM8K': evaluate_gsm8k
}

df = compare_models(models, benchmarks)

Statistical Significance

from scipy import stats

def is_improvement_significant(scores_a, scores_b, alpha=0.05):
    """Test if improvement is statistically significant."""
    # Paired t-test
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)

    is_significant = p_value < alpha
    improvement = (sum(scores_b) - sum(scores_a)) / len(scores_a)

    print(f"Mean improvement: {improvement:.2f}")
    print(f"P-value: {p_value:.4f}")
    print(f"Significant: {is_significant}")

    return is_significant

Quality Assurance

Regression Testing

Ensure no capability loss:

def regression_test(merged_model, parent_models, critical_tests):
    """Check for performance regressions."""
    regressions = []

    for test_name, test_fn in critical_tests.items():
        # Parent scores
        parent_scores = [test_fn(p) for p in parent_models]
        min_parent_score = min(parent_scores)

        # Merged score
        merged_score = test_fn(merged_model)

        # Regression if merged < min parent
        if merged_score < min_parent_score * 0.95:  # 5% tolerance
            regressions.append({
                'test': test_name,
                'parents': parent_scores,
                'merged': merged_score,
                'delta': merged_score - min_parent_score
            })

    if regressions:
        print(f"⚠️  {len(regressions)} regressions detected:")
        for r in regressions:
            print(f"  - {r['test']}: {r['delta']:.2%} drop")

    return len(regressions) == 0

Sanity Checks

def sanity_checks(model):
    """Basic functionality tests."""
    tests = {
        'generates': lambda: model.generate("Hello") != "",
        'coherent': lambda: len(model.generate("The capital of France is")) > 5,
        'follows_instruction': lambda: "paris" in model.generate("What is the capital of France?").lower(),
        'no_repetition': lambda: not has_repetition(model.generate("Tell me about AI", max_length=100))
    }

    results = {name: test() for name, test in tests.items()}

    passed = sum(results.values())
    total = len(results)

    print(f"Sanity Checks: {passed}/{total} passed")

    for name, result in results.items():
        status = "✓" if result else "✗"
        print(f"  {status} {name}")

    return passed == total

Deployment Checklist

Before deploying merged model:

[ ] Open LLM Leaderboard score >= min(parent scores)
[ ] MT-Bench score >= avg(parent scores)
[ ] Domain-specific benchmarks pass
[ ] No regressions in critical tests
[ ] Sanity checks all pass
[ ] A/B test win rate >= 45%
[ ] Safety checks pass (TruthfulQA)
[ ] Manual testing with diverse prompts
[ ] Model size acceptable for deployment
[ ] Inference speed acceptable

Benchmark Interpretation

Open LLM Leaderboard Ranges

Score	Quality
<60	Poor - likely broken
60-65	Below average
65-70	Average
70-75	Good
75-80	Excellent
>80	State-of-art

MT-Bench Ranges

Score	Quality
<6.0	Poor conversation
6.0-7.0	Acceptable
7.0-8.0	Good
8.0-9.0	Excellent
>9.0	Near human-level

Resources

lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
MT-Bench: https://github.com/lm-sys/FastChat
HumanEval: https://github.com/openai/human-eval
Open LLM Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Model Merging Examples

Real-world merge configurations from successful models on HuggingFace and research papers.

Successful Merges
Mixtral-based Merges
Llama-based Merges
Task-Specific Merges
Production Examples

Successful Merges

Marcoro14-7B-slerp

Achievement: #1 on Open LLM Leaderboard (February 2024) Method: SLERP Source: HuggingFace

# marcoro14-7b-slerp.yml
merge_method: slerp
slices:
  - sources:
      - model: AIDC-ai-business/Marcoroni-7B-v3
        layer_range: [0, 32]
      - model: EmbeddedLLM/Mistral-7B-Merge-14-v0.1
        layer_range: [0, 32]
parameters:
  t: 0.5  # Equal blend
dtype: bfloat16

Results:

Average: 74.32 on Open LLM Leaderboard
Strong across all tasks
Smooth capability combination

goliath-120b (Mixtral MoE)

Method: Linear + SLERP Achievement: Top-performing 120B model

# goliath-120b.yml
merge_method: slerp
slices:
  - sources:
      - model: alpindale/c4ai-command-r-plus-GPTQ
        layer_range: [0, 40]
      - model: CohereForAI/c4ai-command-r-v01
        layer_range: [0, 40]
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]  # Layer-specific blending
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5  # Default
dtype: float16

Mixtral-based Merges

Math + Code Specialist

Goal: Combine mathematical reasoning with code generation

# math-code-mixtral.yml
merge_method: task_arithmetic
base_model: mistralai/Mixtral-8x7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      weight: 0.6  # Emphasize math
  - model: ajibawa-2023/Code-Mixtral-8x7B
    parameters:
      weight: 0.4  # Add code
dtype: bfloat16

Expected capabilities:

Strong mathematical reasoning
Code generation and understanding
Technical problem-solving

Chat + Roleplay Merge

# chat-roleplay.yml
merge_method: slerp
slices:
  - sources:
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
      - model: Undi95/MLewd-ReMM-L2-Chat-20B-Part1
        layer_range: [0, 32]
parameters:
  t: 0.5
dtype: bfloat16

Multi-Task TIES Merge

# multi-task-mixtral.yml
merge_method: ties
base_model: mistralai/Mixtral-8x7B-v0.1
models:
  - model: WizardLM/WizardMath-7B-V1.1
    parameters:
      density: 0.5
      weight: 1.0
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      density: 0.5
      weight: 1.0
  - model: ajibawa-2023/Code-Mixtral-8x7B
    parameters:
      density: 0.5
      weight: 1.0
parameters:
  normalize: true
dtype: bfloat16

Llama-based Merges

Platypus-Hermes Merge

Models: Garage-bAInd/Platypus2-13B + WizardLM/WizardLM-13B-V1.2

# platypus-hermes-13b.yml
merge_method: linear
models:
  - model: garage-bAInd/Platypus2-13B
    parameters:
      weight: 0.5
  - model: WizardLM/WizardLM-13B-V1.2
    parameters:
      weight: 0.3
  - model: psmathur/orca_mini_v3_13b
    parameters:
      weight: 0.2
dtype: float16

DARE-TIES Llama Merge

Source: DARE paper (arXiv 2311.03099)

# dare-ties-llama.yml
merge_method: dare_ties
base_model: meta-llama/Llama-2-7b-hf
models:
  - model: WizardLM/WizardLM-7B-V1.0
    parameters:
      density: 0.5   # Keep top 50%
      weight: 0.6
      dare:
        drop_rate: 0.9  # Drop 90% of deltas
  - model: garage-bAInd/Platypus-7B
    parameters:
      density: 0.5
      weight: 0.4
      dare:
        drop_rate: 0.9
parameters:
  int8_mask: true
dtype: bfloat16

Task-Specific Merges

Medical Domain

Goal: Create medical specialist model

# medical-specialist.yml
merge_method: task_arithmetic
base_model: mistralai/Mistral-7B-v0.1
models:
  - model: medalpaca/medalpaca-7b
    parameters:
      weight: 0.7  # Strong medical knowledge
  - model: teknium/OpenHermes-2.5-Mistral-7B
    parameters:
      weight: 0.3  # Add general chat ability
dtype: bfloat16

Legal Assistant

# legal-assistant.yml
merge_method: slerp
slices:
  - sources:
      - model: law-ai/legal-bert-7b
        layer_range: [0, 32]
      - model: teknium/OpenHermes-2.5-Mistral-7B
        layer_range: [0, 32]
parameters:
  t:
    - filter: self_attn
      value: 0.7  # Emphasize legal model in attention
    - filter: mlp
      value: 0.3  # More general chat in MLPs
    - value: 0.5
dtype: bfloat16

Multilingual Merge

# multilingual-merge.yml
merge_method: linear
models:
  - model: mistralai/Mistral-7B-v0.1
    parameters:
      weight: 0.4  # English
  - model: CohereForAI/aya-23-7B
    parameters:
      weight: 0.3  # Multilingual
  - model: Qwen/Qwen3-7B
    parameters:
      weight: 0.3  # Asian languages
dtype: bfloat16

Production Examples

Gradual Merge (Safer)

Strategy: Merge incrementally, test at each step

# Step 1: Merge two models
# step1.yml
merge_method: slerp
slices:
  - sources:
      - model: base_model
        layer_range: [0, 32]
      - model: specialist_1
        layer_range: [0, 32]
parameters:
  t: 0.3  # Conservative blend
dtype: bfloat16

# Step 2: Add third model to result
# step2.yml
merge_method: slerp
slices:
  - sources:
      - model: ./merged_step1  # Previous merge
        layer_range: [0, 32]
      - model: specialist_2
        layer_range: [0, 32]
parameters:
  t: 0.3  # Conservative
dtype: bfloat16

Benefits:

Test after each merge
Easier to debug
Can stop if quality degrades

A/B Testing Setup

# variant_a.yml - Conservative
merge_method: slerp
slices:
  - sources:
      - model: base_model
        layer_range: [0, 32]
      - model: specialist
        layer_range: [0, 32]
parameters:
  t: 0.3  # 30% specialist
dtype: bfloat16

# variant_b.yml - Aggressive
merge_method: slerp
slices:
  - sources:
      - model: base_model
        layer_range: [0, 32]
      - model: specialist
        layer_range: [0, 32]
parameters:
  t: 0.7  # 70% specialist
dtype: bfloat16

Test both, choose best performer

Frankenmerge (Experimental)

Warning: Experimental, may not work

# frankenmerge.yml
merge_method: passthrough
slices:
  # First 8 layers from model A
  - sources:
      - model: model_a
        layer_range: [0, 8]

  # Middle 16 layers from model B
  - sources:
      - model: model_b
        layer_range: [8, 24]

  # Last 8 layers from model C
  - sources:
      - model: model_c
        layer_range: [24, 32]
dtype: bfloat16

Use case: Create models with non-standard layer counts

MoE from Merges

# moe-from-merges.yml
merge_method: moe
base_model: mistralai/Mistral-7B-v0.1
experts:
  - source_model: WizardLM/WizardMath-7B-V1.1
    positive_prompts:
      - "math"
      - "calculate"
      - "solve"
      - "equation"

  - source_model: ajibawa-2023/Code-Mistral-7B
    positive_prompts:
      - "code"
      - "python"
      - "function"
      - "programming"

  - source_model: teknium/OpenHermes-2.5-Mistral-7B
    positive_prompts:
      - "chat"
      - "conversation"
      - "help"
      - "question"
dtype: bfloat16

Result: Dynamic expert selection based on prompt

Command-Line Examples

Basic Merge

# Simple two-model SLERP
mergekit-yaml config.yml ./output-model \
  --cuda \
  --lazy-unpickle

Large Model Merge (Low VRAM)

# Merge on CPU (slow but works with 8GB VRAM)
mergekit-yaml config.yml ./output-model \
  --allow-crimes \  # Enable CPU offloading
  --low-cpu-memory

Merge and Upload

# Merge and push to HuggingFace
mergekit-yaml config.yml ./merged-model --cuda

cd merged-model
python << EOF
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("./")
tokenizer = AutoTokenizer.from_pretrained("./")

model.push_to_hub("username/my-merged-model")
tokenizer.push_to_hub("username/my-merged-model")
EOF

Batch Merging

# Merge multiple configs
for config in configs/*.yml; do
  output="./output/$(basename $config .yml)"
  mergekit-yaml $config $output --cuda
done

Tips from Successful Merges

1. Start Conservative: Use t=0.3-0.5 for SLERP, test before going higher 2. Match Architectures: Only merge models with same base architecture 3. Test Extensively: Benchmark on multiple tasks before deploying 4. Layer-Specific Merging: Different t values for attention vs MLP often works better 5. DARE for Many Models: When merging 3+ models, DARE-TIES often best 6. Gradual Merging: For production, merge incrementally and test

Resources

HuggingFace Models: Browse merged models for inspiration
Open LLM Leaderboard: See top-performing merges
mergekit Examples: https://github.com/arcee-ai/mergekit/tree/main/examples

Model Merging Methods: Deep Dive

Complete technical guide to model merging algorithms based on research papers.

TIES-Merging Algorithm
DARE (Drop And REscale)
Linear Merging
SLERP
Task Arithmetic
Comparison

TIES-Merging: Resolving Interference

Paper: "TIES-Merging: Resolving Interference When Merging Models" (NeurIPS 2023) Authors: Prateek Yadav et al. Code: https://github.com/prateeky2806/ties-merging

Algorithm Overview

TIES-Merging addresses two major sources of interference: 1. Redundant parameter values 2. Sign disagreement across models

Three-Step Process: TRIM, ELECT, MERGE

Step 1: TRIM (Reset Small Changes)

Remove parameters that changed minimally during fine-tuning.

def trim(task_vector, density=0.2):
    """Keep top-k% parameters by magnitude, reset rest to 0."""
    # Calculate magnitude
    magnitudes = torch.abs(task_vector)

    # Get threshold for top-k%
    k = int(density * task_vector.numel())
    threshold = torch.topk(magnitudes.flatten(), k).values.min()

    # Create mask: keep parameters above threshold
    mask = magnitudes >= threshold

    # Apply mask
    trimmed_vector = task_vector * mask

    return trimmed_vector

# Example
task_vector_1 = finetuned_model_1 - base_model
task_vector_2 = finetuned_model_2 - base_model

trimmed_1 = trim(task_vector_1, density=0.2)  # Keep top 20%
trimmed_2 = trim(task_vector_2, density=0.2)

Step 2: ELECT SIGN (Resolve Conflicts)

When parameters have conflicting signs, elect the dominant sign.

def elect_sign(task_vectors):
    """Resolve sign conflicts across multiple task vectors."""
    # Stack all task vectors
    stacked = torch.stack(task_vectors)  # (num_models, num_params)

    # Count positive vs negative for each parameter
    positive_count = (stacked > 0).sum(dim=0)
    negative_count = (stacked < 0).sum(dim=0)

    # Elect majority sign
    final_sign = torch.where(
        positive_count > negative_count,
        torch.ones_like(stacked[0]),
        -torch.ones_like(stacked[0])
    )

    # Where tie, keep sign from first model
    tie_mask = (positive_count == negative_count)
    final_sign[tie_mask] = torch.sign(stacked[0][tie_mask])

    return final_sign

# Example
task_vectors = [trimmed_1, trimmed_2, trimmed_3]
elected_sign = elect_sign(task_vectors)

Step 3: MERGE (Disjoint Merging)

Merge only parameters that agree with elected sign.

def ties_merge(base_model, task_vectors, density=0.2, lambda_param=1.0):
    """Complete TIES-Merging algorithm."""
    # Step 1: Trim each task vector
    trimmed_vectors = [trim(tv, density) for tv in task_vectors]

    # Step 2: Elect sign
    elected_sign = elect_sign(trimmed_vectors)

    # Step 3: Merge aligned parameters
    merged_task_vector = torch.zeros_like(task_vectors[0])

    for tv in trimmed_vectors:
        # Keep only parameters aligned with elected sign
        aligned_mask = (torch.sign(tv) == elected_sign) | (tv == 0)
        aligned_params = tv * aligned_mask

        # Accumulate
        merged_task_vector += aligned_params

    # Average
    num_models = len(task_vectors)
    merged_task_vector /= num_models

    # Add back to base model
    final_model = base_model + lambda_param * merged_task_vector

    return final_model

# Usage
base = load_model("mistralai/Mistral-7B-v0.1")
model_1 = load_model("WizardLM/WizardMath-7B-V1.1")
model_2 = load_model("teknium/OpenHermes-2.5-Mistral-7B")
model_3 = load_model("NousResearch/Nous-Hermes-2-Mistral-7B-DPO")

task_vectors = [
    model_1 - base,
    model_2 - base,
    model_3 - base
]

merged = ties_merge(base, task_vectors, density=0.5, lambda_param=1.0)

Hyperparameters

density (ρ): Fraction of parameters to keep (default: 0.2)

Lower (0.1-0.3): More aggressive pruning, higher sparsity
Higher (0.5-0.8): Conservative pruning, denser result

lambda (λ): Scaling factor for merged task vector (default: 1.0)

Lower (<1.0): Less influence from fine-tuned models
Higher (>1.0): More influence from fine-tuned models

DARE: Drop And REscale

Paper: "Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch" (arXiv 2311.03099, 2023) Authors: Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, Yongbin Li

Algorithm

DARE randomly drops delta parameters and rescales remaining ones.

Mathematical Formulation

Given:

Base model parameters: θ₀
Fine-tuned model parameters: θₜ
Delta parameters: δₜ = θₜ - θ₀

Step 1: Random Drop

m_t ~ Bernoulli(p)  # Drop mask
δ̃_t = (1 - m_t) ⊙ δ_t  # Element-wise product

Step 2: Rescale

δ̂_t = δ̃_t / (1 - p)  # Rescale to preserve expectation

Final Model

θ̂_t = θ₀ + δ̂_t

Implementation

def dare(base_model, finetuned_model, drop_rate=0.9):
    """DARE: Drop And REscale delta parameters."""
    # Compute delta
    delta = finetuned_model - base_model

    # Random drop mask (Bernoulli)
    drop_mask = torch.bernoulli(torch.full_like(delta, drop_rate))

    # Apply mask (keep 1-p, drop p)
    dropped_delta = delta * (1 - drop_mask)

    # Rescale to preserve expectation
    rescaled_delta = dropped_delta / (1 - drop_rate)

    # Reconstruct model
    result = base_model + rescaled_delta

    return result

# Example
base = load_model("mistralai/Mistral-7B-v0.1")
finetuned = load_model("WizardLM/WizardMath-7B-V1.1")

# Drop 90% of delta parameters
result = dare(base, finetuned, drop_rate=0.9)

DARE + TIES (DARE-TIES)

Combine both methods for best results.

def dare_ties(base_model, finetuned_models, drop_rate=0.9, density=0.5):
    """DARE + TIES-Merging."""
    # Step 1: Apply DARE to each model
    dare_deltas = []
    for model in finetuned_models:
        delta = model - base_model

        # DARE drop
        drop_mask = torch.bernoulli(torch.full_like(delta, drop_rate))
        dropped = delta * (1 - drop_mask)
        rescaled = dropped / (1 - drop_rate)

        dare_deltas.append(rescaled)

    # Step 2: Apply TIES to DARE-processed deltas
    merged = ties_merge(base_model, dare_deltas, density=density)

    return merged

Hyperparameters

drop_rate (p): Probability of dropping each parameter (default: 0.9)

Lower (0.5-0.7): Conservative, keeps more parameters
Higher (0.9-0.99): Aggressive, maximum sparsity
Works well even at 0.99 for large models

Observations:

Larger models tolerate higher drop rates
Delta parameters with small absolute values (<0.002) can be safely dropped
Performance improves with model size

Linear Merging (Model Soup)

Simple weighted average.

def linear_merge(models, weights):
    """Weighted average of model parameters."""
    assert len(models) == len(weights)
    assert sum(weights) == 1.0, "Weights should sum to 1"

    merged = sum(w * model for w, model in zip(weights, models))

    return merged

# Example
models = [model_1, model_2, model_3]
weights = [0.4, 0.3, 0.3]
merged = linear_merge(models, weights)

SLERP: Spherical Linear Interpolation

Interpolate along sphere in weight space.

def slerp(model_1, model_2, t=0.5):
    """SLERP between two models."""
    # Flatten parameters
    p1 = torch.cat([p.flatten() for p in model_1.parameters()])
    p2 = torch.cat([p.flatten() for p in model_2.parameters()])

    # Normalize
    p1_norm = p1 / p1.norm()
    p2_norm = p2 / p2.norm()

    # Compute angle
    dot = (p1_norm * p2_norm).sum()
    theta = torch.acos(torch.clamp(dot, -1.0, 1.0))

    # SLERP formula
    if theta < 1e-6:
        # Vectors nearly parallel, use linear interpolation
        result = (1 - t) * p1 + t * p2
    else:
        # Spherical interpolation
        sin_theta = torch.sin(theta)
        result = (torch.sin((1 - t) * theta) / sin_theta) * p1 + \
                 (torch.sin(t * theta) / sin_theta) * p2

    # Reshape back to model
    merged_model = reshape_to_model(result, model_1)

    return merged_model

# Example
merged = slerp(model_1, model_2, t=0.5)  # 50-50 blend

Task Arithmetic

Add task vectors to base model.

def task_arithmetic(base_model, finetuned_models, lambdas):
    """Task arithmetic merging."""
    # Extract task vectors
    task_vectors = [model - base_model for model in finetuned_models]

    # Weighted sum
    combined_vector = sum(λ * tv for λ, tv in zip(lambdas, task_vectors))

    # Add to base
    merged = base_model + combined_vector

    return merged

# Example
base = load_model("mistralai/Mistral-7B-v0.1")
math_model = load_model("WizardLM/WizardMath-7B-V1.1")
code_model = load_model("ajibawa-2023/Code-Mistral-7B")

merged = task_arithmetic(
    base,
    [math_model, code_model],
    lambdas=[0.6, 0.4]
)

Method Comparison

Method	Pros	Cons	Best For
Linear	Simple, fast	Basic averaging	2-3 similar models
SLERP	Preserves magnitude	Only 2 models	Smooth blending
Task Arithmetic	Intuitive, flexible	Sign conflicts	Multiple specialized models
TIES	Resolves conflicts	More complex	Many task-specific models
DARE	High sparsity	Random variance	Reducing redundancy
DARE-TIES	Best performance	Most complex	Production (state-of-art)

Resources

TIES Paper: https://arxiv.org/abs/2306.01708
DARE Paper: https://arxiv.org/abs/2311.03099
mergekit: https://github.com/arcee-ai/mergekit

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use model-merging for post-merge leaderboard evaluation; use fine-tuning or distributed-training skills when the task is producing the merged checkpoint itself.

FAQ

Which benchmarks does model-merging use?

model-merging centers on six Open LLM Leaderboard tasks—ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8K—plus lm_eval runs and MT-Bench-style conversation tests.

When should model-merging run in the LLM workflow?

model-merging fits immediately after producing merged Hugging Face checkpoints, before serving, when leaderboard scores and per-task regressions must be compared to base models.

Does model-merging train new merge recipes?

model-merging focuses on evaluation and comparison methodology; training merges and dataset preparation require separate fine-tuning or post-training skills in the library.

Is Model Merging safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmresearch

About

Model Merging by the numbers

Add your badge

How do you benchmark merged Hugging Face models?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Model Merging: Combining Pre-trained Models

When to Use This Skill

Installation

Quick Start

Simple Linear Merge

SLERP Merge (Best for 2 Models)

Core Concepts

1. Merge Methods

2. Configuration Structure

Merge Methods Guide

Linear Merge

SLERP Merge

Task Arithmetic

TIES-Merging

DARE Merge

Advanced Patterns

Layer-wise Merging

MoE from Merged Models

Tokenizer Merging

Best Practices

1. Model Compatibility

2. Weight Selection

3. Method Selection

4. Density Tuning (TIES/DARE)

5. Layer-specific Merging

Evaluation & Testing

Benchmark Merged Models

Common Benchmarks

Production Deployment

Save and Upload

Quantize Merged Model

Common Pitfalls

Resources

See Also

Unsupervised Coefficient Tuning via Generation Consistency

Problem: Coefficient Selection in Model Merging

Core Idea: Generation Consistency

Algorithm

Step 1: Define Candidate Coefficients

Step 2: Merge Models for Each Candidate

Step 3: Run Inference on Unlabeled Subset

Step 4: Compute Generation Consistency Scores

Step 5: Select Best Coefficient

Similarity Metrics

Applying to Different Merge Methods

SLERP (t parameter)

Task Arithmetic / TIES (lambda or weight)

Multi-coefficient (e.g., two models with different weights)

Full Pipeline (End-to-End)

Practical Tips

References

Model Merging Evaluation

Table of Contents

Benchmark Suites

Open LLM Leaderboard

MT-Bench

MMLU (Detailed)

HumanEval (Code)

Evaluation Metrics

Performance Metrics

Capability Retention

Conflict Detection

Testing Methodology

Pre-Merge Testing

Post-Merge Testing

A/B Testing

Comparison Framework

Score Comparison Table

Statistical Significance

Quality Assurance

Regression Testing

SLERP (`t` parameter)

Task Arithmetic / TIES (`lambda` or `weight`)