Bayesian Self-Distillation for Image Classification
By: Anton Adelöw, Matteo Gamba, Atsuto Maki
Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.
Similar Papers
Dataset Distillation for Pre-Trained Self-Supervised Vision Models
CV and Pattern Recognition
Creates small, smart picture sets for AI.
Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation
CV and Pattern Recognition
Makes AI learn from less data, better.
Dataset Distillation for Super-Resolution without Class Labels and Pre-trained Models
CV and Pattern Recognition
Makes blurry pictures sharp with less data.