Mechanisms of Non-Monotonic Scaling in Vision Transformers
By: Anantha Padmanaban Krishna Kumar
Potential Business Impact:
Makes computer "eyes" learn better by changing how they see.
Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.
Similar Papers
Rethinking the Use of Vision Transformers for AI-Generated Image Detection
CV and Pattern Recognition
Finds fake pictures better using more picture parts.
Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction
CV and Pattern Recognition
Makes computer vision models work better, faster.
Diminishing Returns in Self-Supervised Learning
CV and Pattern Recognition
Makes small AI models learn better with less data.