Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval
By: Xin Jiang , Hao Tang , Yonghua Pan and more
Potential Business Impact:
Finds exact picture matches faster and better.
Large-scale fine-grained image retrieval (FGIR) aims to retrieve images belonging to the same subcategory as a given query by capturing subtle differences in a large-scale setting. Recently, Vision Transformers (ViT) have been employed in FGIR due to their powerful self-attention mechanism for modeling long-range dependencies. However, most Transformer-based methods focus primarily on leveraging self-attention to distinguish fine-grained details, while overlooking the high computational complexity and redundant dependencies inherent to these models, limiting their scalability and effectiveness in large-scale FGIR. In this paper, we propose an Efficient and Effective ViT-based framework, termed \textbf{EET}, which integrates token pruning module with a discriminative transfer strategy to address these limitations. Specifically, we introduce a content-based token pruning scheme to enhance the efficiency of the vanilla ViT, progressively removing background or low-discriminative tokens at different stages by exploiting feature responses and self-attention mechanism. To ensure the resulting efficient ViT retains strong discriminative power, we further present a discriminative transfer strategy comprising both \textit{discriminative knowledge transfer} and \textit{discriminative region guidance}. Using a distillation paradigm, these components transfer knowledge from a larger ``teacher'' ViT to a more efficient ``student'' model, guiding the latter to focus on subtle yet crucial regions in a cost-free manner. Extensive experiments on two widely-used fine-grained datasets and four large-scale fine-grained datasets demonstrate the effectiveness of our method. Specifically, EET reduces the inference latency of ViT-Small by 42.7\% and boosts the retrieval performance of 16-bit hash codes by 5.15\% on the challenging NABirds dataset.
Similar Papers
GFT: Gradient Focal Transformer
CV and Pattern Recognition
Helps computers see tiny differences in pictures.
Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting
CV and Pattern Recognition
Makes computer vision see details better, faster.
Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection
CV and Pattern Recognition
Finds fake pictures made by computers.