Score: 0

Rethinking Visual Layer Selection in Multimodal LLMs

Published: April 30, 2025 | arXiv ID: 2504.21447v1

By: Haoran Chen , Junyan Lin , Xinhao Chen and more

Potential Business Impact:

Helps computers understand pictures better for different jobs.

Business Areas:

Visual Search Internet Services

Multimodal large language models (MLLMs) have achieved impressive performance across a wide range of tasks, typically using CLIP-ViT as their visual encoder due to its strong text-image alignment capabilities. While prior studies suggest that different CLIP-ViT layers capture different types of information, with shallower layers focusing on fine visual details and deeper layers aligning more closely with textual semantics, most MLLMs still select visual features based on empirical heuristics rather than systematic analysis. In this work, we propose a Layer-wise Representation Similarity approach to group CLIP-ViT layers with similar behaviors into {shallow, middle, and deep} categories and assess their impact on MLLM performance. Building on this foundation, we revisit the visual layer selection problem in MLLMs at scale, training LLaVA-style models ranging from 1.4B to 7B parameters. Through extensive experiments across 10 datasets and 4 tasks, we find that: (1) deep layers are essential for OCR tasks; (2) shallow and middle layers substantially outperform deep layers on reasoning tasks involving counting, positioning, and object localization; (3) a lightweight fusion of features across shallow, middle, and deep layers consistently outperforms specialized fusion baselines and single-layer selections, achieving gains on 9 out of 10 datasets. Our work offers the first principled study of visual layer selection in MLLMs, laying the groundwork for deeper investigations into visual representation learning for MLLMs.

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

CV and Pattern Recognition

Shows how AI understands pictures and words.

27 Aug 2025 0

91%

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference

CV and Pattern Recognition

Makes AI understand pictures faster and cheaper.

17 Mar 2025 1

91%

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

CV and Pattern Recognition

Makes AI understand pictures better by picking the best parts.

8 Mar 2025 0

View PDF Login to Bookmark

Page Count

17 pages

Rethinking Visual Layer Selection in Multimodal LLMs

Helps computers understand pictures better for different jobs.

Technical Abstract

How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices