Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
By: Yeji Park , Minyoung Lee , Sanghyuk Chun and more
Potential Business Impact:
Helps computers understand many pictures at once.
Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.
Similar Papers
Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models
Hardware Architecture
Makes AI watch videos faster and use less power.
Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains
CV and Pattern Recognition
Helps computers understand many pictures at once.
MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing
CV and Pattern Recognition
Makes AI pictures match real things better.