Score: 1

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

Published: April 28, 2025 | arXiv ID: 2504.20199v1

By: Juntian Zhang , Chuanqi cheng , Yuhan Liu and more

Potential Business Impact:

Helps computers understand many pictures at once.

Business Areas:

Visual Search Internet Services

Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Hardware Architecture

Makes AI watch videos faster and use less power.

16 Dec 2025 2

89%

Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

CV and Pattern Recognition

Makes computers create clearer pictures from words.

14 Dec 2025 0

89%

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

CV and Pattern Recognition

Makes AI better at seeing and understanding pictures.

8 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com github.com

Page Count

17 pages

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

Helps computers understand many pictures at once.

Technical Abstract

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning