Dynamic Granularity Matters: Rethinking Vision Transformers Beyond Fixed Patch Splitting
By: Qiyang Yu , Yu Fang , Tianrui Li and more
Potential Business Impact:
Makes computer vision see details better, faster.
Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global dependencies but often struggle to efficiently represent fine-grained local details. Existing multi-scale approaches alleviate this issue by integrating hierarchical or hybrid features; however, they rely on fixed patch sizes and introduce redundant computation. To address these limitations, we propose Granularity-driven Vision Transformer (Grc-ViT), a dynamic coarse-to-fine framework that adaptively adjusts visual granularity based on image complexity. It comprises two key stages: (1) Coarse Granularity Evaluation module, which assesses visual complexity using edge density, entropy, and frequency-domain cues to estimate suitable patch and window sizes; (2) Fine-grained Refinement module, which refines attention computation according to the selected granularity, enabling efficient and precise feature learning. Two learnable parameters, α and \b{eta}, are optimized end-to-end to balance global reasoning and local perception. Comprehensive evaluations demonstrate that Grc-ViT enhances fine-grained discrimination while achieving a superior trade-off between accuracy and computational efficiency.
Similar Papers
Rethinking Vision Transformer for Large-Scale Fine-Grained Image Retrieval
Multimedia
Finds exact picture matches faster and better.
Next Visual Granularity Generation
CV and Pattern Recognition
Makes computers draw pictures by adding details.
GFT: Gradient Focal Transformer
CV and Pattern Recognition
Helps computers see tiny differences in pictures.