Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs
By: Cheng Feng , Chaoliang Zhong , Jun Sun and more
Large language models (LLMs) are challenging to deploy for domain-specific tasks due to their massive scale. While distilling a fine-tuned LLM into a smaller student model is a promising alternative, the capacity gap between teacher and student often leads to suboptimal performance. This raises a key question: when and how can a student model match or even surpass its teacher on domain-specific tasks? In this work, we propose a novel theoretical insight: a student can outperform its teacher if its advantage on a Student-Favored Subdomain (SFS) outweighs its deficit on the Teacher-Favored Subdomain (TFS). Guided by this insight, we propose Scheduled Checkpoint Distillation (SCD), which reduces the TFS deficit by emulating the teacher's convergence process during supervised fine-tuning (SFT) on the domain task, and a sample-wise Adaptive Weighting (AW) mechanism to preserve student strengths on SFS. Experiments across diverse domain tasks--including QA, NER, and text classification in multiple languages--show that our method consistently outperforms existing distillation approaches, allowing the student model to match or even exceed the performance of its fine-tuned teacher.
Similar Papers
Knowledge Distillation of Domain-adapted LLMs for Question-Answering in Telecom
Computation and Language
Makes big AI models smaller and smarter.
Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces
Machine Learning (CS)
Makes small AI learn big AI's smarts faster.
From Reasoning LLMs to BERT: A Two-Stage Distillation Framework for Search Relevance
Information Retrieval
Makes online shopping search faster and smarter.