Score: 1

Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks

Published: August 19, 2025 | arXiv ID: 2508.13744v1

By: Yeji Park , Minyoung Lee , Sanghyuk Chun and more

Potential Business Impact:

Helps computers understand many pictures at once.

Business Areas:
Image Recognition Data and Analytics, Software

Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.

Repos / Data Links

Page Count
9 pages

Category
Computer Science:
CV and Pattern Recognition