Score: 1

Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks

Published: August 19, 2025 | arXiv ID: 2508.13744v1

By: Yeji Park , Minyoung Lee , Sanghyuk Chun and more

Potential Business Impact:

Helps computers understand many pictures at once.

Business Areas:

Image Recognition Data and Analytics, Software

Large Vision-Language Models (LVLMs) demonstrate strong performance on single-image tasks. However, we observe that their performance degrades significantly when handling multi-image inputs. This occurs because visual cues from different images become entangled in the model's output. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic decoding strategy that mitigates cross-image information leakage during inference. FOCUS sequentially masks all but one image with random noise, guiding the model to focus on the single clean image. We repeat this process across all target images to obtain logits under partially masked contexts. These logits are aggregated and then contrastively refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance across four multi-image benchmarks and diverse LVLM families. This demonstrates that FOCUS offers a general and practical solution for enhancing multi-image reasoning without additional training or architectural modifications.

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Hardware Architecture

Makes AI watch videos faster and use less power.

16 Dec 2025 2

89%

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

CV and Pattern Recognition

Helps computers understand many pictures at once.

28 Apr 2025 1

88%

MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing

CV and Pattern Recognition

Makes AI pictures match real things better.

3 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

9 pages

Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks

Helps computers understand many pictures at once.

Technical Abstract

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

MAP: Mitigating Hallucinations in Large Vision-Language Models with Map-Level Attention Processing