Revisiting Knowledge Distillation: The Hidden Role of Dataset Size
By: Giulia Lanzillotta , Felix Sarnthein , Gil Kur and more
Potential Business Impact:
Makes AI learn better with less data.
The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.
Similar Papers
Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions
Computation and Language
Makes big AI models smaller and faster.
Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods
Machine Learning (CS)
Makes AI less biased by teaching it better.
Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models
Computation and Language
Makes smart computer programs smaller and faster.