Score: 1

Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

Published: November 10, 2025 | arXiv ID: 2511.06848v1

By: Huiyuan Tian, Bonan Xu Shijian Li

Potential Business Impact:

Makes AI models smaller and faster to train.

Business Areas:

Image Recognition Data and Analytics, Software

While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as ``distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided in the supplementary material.

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

CV and Pattern Recognition

Helps AI learn better from images.

19 Nov 2025 1

89%

The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers

CV and Pattern Recognition

Makes computers understand pictures better by focusing on important parts.

8 Dec 2025 0

89%

Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models

CV and Pattern Recognition

Teaches small computers to see actions like big ones.

12 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇭🇰 China, Hong Kong

Page Count

13 pages

Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

Makes AI models smaller and faster to train.

Technical Abstract

From Low-Rank Features to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers

Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models