Exploring Compositionality in Vision Transformers using Wavelet Representations
By: Akshad Shyam Purushottamdas , Pranav K Nayak , Divya Mehul Rajparia and more
Potential Business Impact:
Helps computers understand pictures by breaking them down.
While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.
Similar Papers
Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform
Computation and Language
Makes computer language understanding smaller, faster, better.
Hands-on Evaluation of Visual Transformers for Object Recognition and Detection
CV and Pattern Recognition
Helps computers see the whole picture, not just parts.
The Inductive Bottleneck: Data-Driven Emergence of Representational Sparsity in Vision Transformers
CV and Pattern Recognition
Makes computers understand pictures better by focusing on important parts.