Learning Task-Agnostic Representations through Multi-Teacher Distillation
By: Philippe Formont , Maxime Darrin , Banafsheh Karimian and more
Potential Business Impact:
Makes computer models learn better from many teachers.
Casting complex inputs into tractable representations is a critical step across various fields. Diverse embedding models emerge from differences in architectures, loss functions, input modalities and datasets, each capturing unique aspects of the input. Multi-teacher distillation leverages this diversity to enrich representations but often remains tailored to specific tasks. In this paper, we introduce a task-agnostic framework based on a ``majority vote" objective function. We demonstrate that this function is bounded by the mutual information between student and teachers' embeddings, leading to a task-agnostic distillation loss that eliminates dependence on task-specific labels or prior knowledge. Our evaluations across text, vision models, and molecular modeling show that our method effectively leverages teacher diversity, resulting in representations enabling better performance for a wide range of downstream tasks such as classification, clustering, or regression. Additionally, we train and release state-of-the-art embedding models, enhancing downstream performance in various modalities.
Similar Papers
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation
Machine Learning (CS)
Makes one computer model smarter than many.
In Good GRACEs: Principled Teacher Selection for Knowledge Distillation
Machine Learning (CS)
Finds the best AI teacher for smaller AI.
Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Machine Learning (CS)
Makes AI less biased by teaching it better.