
Model Merging
Benchmark and compare merged Hugging Face models with Open LLM Leaderboard tasks, lm_eval, and MT-Bench-style conversation tests.
Overview
model-merging is an agent skill most often used in Ship (also Build, Operate) that benchmarks merged language models with standard leaderboards and conversation tests.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill model-mergingWhat is this skill?
- Documents 6-task Open LLM Leaderboard suite: ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
- Provides lm_eval simple_evaluate Python example with few-shot and batch settings
- Covers MT-Bench multi-turn conversation evaluation via FastChat tooling
- Includes comparison framework and QA-oriented testing methodology sections
- 6 benchmarks in the Open LLM Leaderboard task list
- MT-Bench focuses on multi-turn conversation quality
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You merged two models but lack a reproducible benchmark plan to know if quality improved or regressed across reasoning, knowledge, and dialogue.
Who is it for?
Solo ML tinkerers with a merged HF model path who want copy-paste lm_eval and FastChat evaluation flows.
Skip if: Builders who only need API prompting without local weights, or beginners without Python/GPU setup for hf model evaluation.
When should I use this skill?
You have a merged model artifact and need standard benchmarks and comparison methodology before shipping or choosing a default weights path.
What do I get? / Deliverables
You run documented leaderboard and MT-Bench-style evaluations and compare scores against baselines before promoting a merged checkpoint.
- Leaderboard task scores and averaged summary
- MT-Bench-oriented evaluation plan
- Documented comparison against baseline models
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Merged models should be validated on standard suites before you ship weights or serve them—canonical placement is ship/testing as quality gate. The skill centers on evaluation metrics, benchmark suites, and comparison methodology, which is testing rather than training or merging code itself.
Where it fits
Run lm_eval on a candidate merged path before wiring it into your agent backend.
Execute the six leaderboard tasks and average scores against the parent models.
Tune batch_size and dtype in evaluation to match inference constraints.
Re-run TruthfulQA and GSM8K after a small merge tweak to catch regressions.
How it compares
Evaluation playbook for merged checkpoints—not a merge recipe skill and not a hosted leaderboard MCP.
Common Questions / FAQ
Who is model-merging for?
It is for developers and researchers who produce merged LLMs locally or on Hugging Face and need standard benchmarks before deployment or release.
When should I use model-merging?
Use it in ship/testing before release; in build/integrations when picking a merge candidate; and in operate/iterate when re-benchmarking after a new merge.
Is model-merging safe to install?
Guide references cloning repos, pip installs, and downloading models; review the Security Audits panel on this Prism page and vet third-party eval code before running with secrets.
SKILL.md
READMESKILL.md - Model Merging
# Model Merging Evaluation Complete guide to benchmarking and testing merged models based on research best practices. ## Table of Contents - Benchmark Suites - Evaluation Metrics - Testing Methodology - Comparison Framework - Quality Assurance ## Benchmark Suites ### Open LLM Leaderboard **URL**: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard **Tasks** (6 benchmarks): 1. **ARC** (AI2 Reasoning Challenge): 25-shot, science questions 2. **HellaSwag**: 10-shot, commonsense reasoning 3. **MMLU** (Massive Multitask Language Understanding): 5-shot, 57 subjects 4. **TruthfulQA**: 0-shot, factual accuracy 5. **Winogrande**: 5-shot, commonsense reasoning 6. **GSM8K**: 5-shot, grade-school math **Running Evaluation**: ```python from lm_eval import evaluator model = "path/to/merged/model" results = evaluator.simple_evaluate( model="hf", model_args=f"pretrained={model},dtype=float16", tasks=[ "arc_challenge", "hellaswag", "hendrycksTest-*", # MMLU "truthfulqa_mc", "winogrande", "gsm8k" ], num_fewshot=5, batch_size=8 ) # Average score avg_score = sum(results['results'].values()) / len(results['results']) print(f"Average: {avg_score:.2f}") ``` ### MT-Bench **Focus**: Multi-turn conversation quality **Installation**: ```bash git clone https://github.com/lm-sys/FastChat cd FastChat pip install -e . ``` **Running**: ```bash # Generate responses python gen_model_answer.py \ --model-path path/to/merged/model \ --model-id merged_model # Judge with GPT-4 python gen_judgment.py \ --model-list merged_model \ --judge-model gpt-4 # View scores python show_result.py ``` **Metrics**: - Turn 1 score (1-10) - Turn 2 score (1-10) - Average score ### MMLU (Detailed) **Subjects** (57 total): - STEM: Math, Physics, Chemistry, Biology, Computer Science - Humanities: History, Philosophy, Law - Social Sciences: Economics, Psychology, Sociology - Other: Professional subjects (Medicine, Accounting, etc.) ```python from lm_eval import evaluator # Run all MMLU subjects results = evaluator.simple_evaluate( model="hf", model_args=f"pretrained={model}", tasks="hendrycksTest-*", # All MMLU tasks num_fewshot=5 ) # Subject breakdown for task, score in results['results'].items(): subject = task.replace('hendrycksTest-', '') print(f"{subject}: {score['acc']:.2%}") ``` ### HumanEval (Code) **Focus**: Python code generation ```python from human_eval.data import write_jsonl, read_problems from human_eval.evaluation import evaluate_functional_correctness # Generate completions problems = read_problems() samples = [] for task_id, problem in problems.items(): prompt = problem['prompt'] completion = model.generate(prompt) samples.append({ 'task_id': task_id, 'completion': completion }) write_jsonl("samples.jsonl", samples) # Evaluate results = evaluate_functional_correctness("samples.jsonl") print(f"Pass@1: {results['pass@1']:.2%}") ``` ## Evaluation Metrics ### Performance Metrics **Accuracy**: Correct predictions / total predictions ```python def accuracy(predictions, labels): correct = sum(p == l for p, l in zip(predictions, labels)) return correct / len(predictions) ``` **Perplexity**: Language modeling quality (lower is better) ```python import torch def perplexity(model, text): tokens = tokenizer(text, return_tensors='pt') with torch.no_grad(): loss = model(**tokens).loss return torch.exp(loss).item() ``` **BLEU Score**: Translation/generation quality ```python from nltk.translate.bleu_score import sentence_bleu reference = [["the", "cat", "sat", "on", "the", "mat"]] candidate = ["the", "cat", "is", "on", "the", "mat"] score = sentence_bleu(reference, candidate) ``` ### Capability Retention **Test**: Does merged model retain parent capabilities? ```python def test_capability_retention(merged_model, parent_models, test_suite): """Check if merged mod