Score: 0

Mechanisms of Non-Monotonic Scaling in Vision Transformers

Published: November 26, 2025 | arXiv ID: 2511.21635v1

By: Anantha Padmanaban Krishna Kumar

Potential Business Impact:

Makes computer "eyes" learn better by changing how they see.

Business Areas:
Image Recognition Data and Analytics, Software

Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.

Country of Origin
🇺🇸 United States

Page Count
16 pages

Category
Computer Science:
Machine Learning (CS)