Score: 1

What do vision-language models see in the context? Investigating multimodal in-context learning

Published: October 28, 2025 | arXiv ID: 2510.24331v1

By: Gabriel O. dos Santos, Esther Colombini, Sandra Avila

Potential Business Impact:

Helps computers understand pictures and words together better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

CV and Pattern Recognition

Computers learn better by explaining their answers.

9 Jun 2025 0

93%

T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

CV and Pattern Recognition

Helps AI understand different picture tasks together.

20 Nov 2025 0

93%

Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

CV and Pattern Recognition

Helps AI understand pictures and questions better.

11 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇧🇷 Brazil

Repos / Data Links

github.com github.com github.com github.com github.com github.com

Page Count

13 pages

What do vision-language models see in the context? Investigating multimodal in-context learning

Helps computers understand pictures and words together better.

Technical Abstract

Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning