Diminishing Returns in Self-Supervised Learning
By: Oli Bridge , Huey Sun , Botond Branyicskai-Nagy and more
Potential Business Impact:
Makes small AI models learn better with less data.
While transformer-based architectures have taken computer vision and NLP by storm, they often require a vast amount of parameters and training data to attain strong performance. In this work, we experiment with three distinct pre-training, intermediate fine-tuning, and downstream datasets and training objectives to explore their marginal benefits on a small 5M-parameter vision transformer. We find that while pre-training and fine-tuning always help our model but have diminishing returns, intermediate fine-tuning can actually show harmful impact on downstream performance, potentially due to dissimilarity in task mechanics. Taken together, our results suggest that small-scale ViTs benefit most from targeted pre-training and careful data selection, while indiscriminate stacking of intermediate tasks can waste compute and even degrade performance.
Similar Papers
Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction
CV and Pattern Recognition
Makes computer vision models work better, faster.
Mechanisms of Non-Monotonic Scaling in Vision Transformers
Machine Learning (CS)
Makes computer "eyes" learn better by changing how they see.
Infusing fine-grained visual knowledge to Vision-Language Models
CV and Pattern Recognition
Keeps AI smart while teaching new skills.