Score: 0

KL-based self-distillation for large language models

Published: August 14, 2025 | arXiv ID: 2508.15807v1

By: Max Rehman Linder

Potential Business Impact:

Teaches computers new words without forgetting old ones.

Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.

Delta Knowledge Distillation for Large Language Models

Computation and Language

Makes small AI learn better from big AI.

18 Sep 2025 1

91%

A Dual-Space Framework for General Knowledge Distillation of Large Language Models

Computation and Language

Makes big AI models work in smaller ones.

15 Apr 2025 2

91%

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Computation and Language

Makes big AI models smaller and faster.

20 Apr 2025 0

View PDF Login to Bookmark

Page Count

47 pages

KL-based self-distillation for large language models

Teaches computers new words without forgetting old ones.

Technical Abstract

Delta Knowledge Distillation for Large Language Models

A Dual-Space Framework for General Knowledge Distillation of Large Language Models

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions