HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models
By: Xu Li , Yuxuan Liang , Xiaolei Chen and more
Potential Business Impact:
Makes AI see details in pictures faster.
By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
Similar Papers
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
CV and Pattern Recognition
Makes smart AI work on your phone.
Differentiable Hierarchical Visual Tokenization
CV and Pattern Recognition
Makes computer vision understand pictures better.
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
CV and Pattern Recognition
Lets computers understand pictures better, faster.