Delta Knowledge Distillation for Large Language Models
By: Yihan Cao , Yanbin Kang , Zhengming Xing and more
Potential Business Impact:
Makes small AI learn better from big AI.
Knowledge distillation (KD) is a widely adopted approach for compressing large neural networks by transferring knowledge from a large teacher model to a smaller student model. In the context of large language models, token level KD, typically minimizing the KL divergence between student output distribution and teacher output distribution, has shown strong empirical performance. However, prior work assumes student output distribution and teacher output distribution share the same optimal representation space, a premise that may not hold in many cases. To solve this problem, we propose Delta Knowledge Distillation (Delta-KD), a novel extension of token level KD that encourages the student to approximate an optimal representation space by explicitly preserving the distributional shift Delta introduced during the teacher's supervised finetuning (SFT). Empirical results on ROUGE metrics demonstrate that Delta KD substantially improves student performance while preserving more of the teacher's knowledge.
Similar Papers
Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models
Computation and Language
Makes smart computer programs smaller and faster.
A Dual-Space Framework for General Knowledge Distillation of Large Language Models
Computation and Language
Makes big AI models work in smaller ones.
LLM-Oriented Token-Adaptive Knowledge Distillation
Computation and Language
Makes AI learn better by focusing on hard parts.