PLD: A Choice-Theoretic List-Wise Knowledge Distillation
By: Ejafa Bassam, Dawei Zhu, Kaigui Bian
Potential Business Impact:
Teaches smaller computer brains to think like bigger ones.
Knowledge distillation is a model compression technique in which a compact "student" network is trained to replicate the predictive behavior of a larger "teacher" network. In logit-based knowledge distillation it has become the de facto approach to augment cross-entropy with a distillation term. Typically this term is either a KL divergence-matching marginal probabilities or a correlation-based loss capturing intra- and inter-class relationships but in every case it sits as an add-on to cross-entropy with its own weight that must be carefully tuned. In this paper we adopt a choice-theoretic perspective and recast knowledge distillation under the Plackett-Luce model by interpreting teacher logits as "worth" scores. We introduce Plackett-Luce Distillation (PLD), a weighted list-wise ranking loss in which the teacher model transfers knowledge of its full ranking of classes, weighting each ranked choice by its own confidence. PLD directly optimizes a single teacher-optimal ranking of the true label first, followed by the remaining classes in descending teacher confidence, yielding a convex, translation-invariant surrogate that subsumes weighted cross-entropy. Empirically on standard image classification benchmarks, PLD improves Top-1 accuracy by an average of +0.42% over DIST (arXiv:2205.10536) and +1.04% over KD (arXiv:1503.02531) in homogeneous settings and by +0.48% and +1.09% over DIST and KD, respectively, in heterogeneous settings.
Similar Papers
Progressive Class-level Distillation
CV and Pattern Recognition
Teaches small computers to learn from big ones better.
Swapped Logit Distillation via Bi-level Teacher Alignment
Machine Learning (CS)
Makes small computers learn as well as big ones.
Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective
Machine Learning (CS)
Teaches computers to learn better from other computers.