Score: 0

Model Merging via Multi-Teacher Knowledge Distillation

Published: December 24, 2025 | arXiv ID: 2512.21288v1

By: Seyed Arshan Dalili, Mehrdad Mahdavi

Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.

StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

Machine Learning (CS)

Combines many AI models into one.

5 Jun 2025 2

89%

DivMerge: A divergence-based model merging method for multi-tasking

Machine Learning (CS)

Combines many smart computer skills into one.

2 Sep 2025 1

89%

MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation

CV and Pattern Recognition

Teaches AI to learn from different types of information.

9 Jul 2025 2

View PDF Login to Bookmark

Model Merging via Multi-Teacher Knowledge Distillation

Technical Abstract

StatsMerging: Statistics-Guided Model Merging via Task-Specific Teacher Distillation

DivMerge: A divergence-based model merging method for multi-tasking

MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation