Stronger ViTs With Octic Equivariance
By: David Nordström , Johan Edstedt , Fredrik Kahl and more
Potential Business Impact:
Makes computer vision faster and better.
Recent efforts at scaling computer vision models have established Vision Transformers (ViTs) as the leading architecture. ViTs incorporate weight sharing over image patches as an important inductive bias. In this work, we show that ViTs benefit from incorporating equivariance under the octic group, i.e., reflections and 90-degree rotations, as a further inductive bias. We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning. Through extensive experiments on DeiT-III and DINOv2 training on ImageNet-1K, we show that octic ViTs yield more computationally efficient networks while also improving performance. In particular, we achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.
Similar Papers
ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages
CV and Pattern Recognition
Makes AI see pictures faster and better.
Alias-Free ViT: Fractional Shift Invariance via Linear Attention
CV and Pattern Recognition
Makes computer vision better at seeing small changes.
Vision Transformers: the threat of realistic adversarial patches
CV and Pattern Recognition
Tricks AI into seeing people when they aren't there.