Score: 0

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Published: January 15, 2026 | arXiv ID: 2601.10710v1

By: Cheng Chen , Yuyu Guo , Pengpeng Zeng and more

Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.

Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

CV and Pattern Recognition

Makes computers create clearer pictures from words.

14 Dec 2025 0

90%

Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

CV and Pattern Recognition

Teaches computers to see with just words.

19 Dec 2025 0

90%

From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)

CV and Pattern Recognition

Creates short video summaries from long pharmacy videos.

8 Jan 2026 1

View PDF Login to Bookmark

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Technical Abstract

Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation

Can Synthetic Images Serve as Effective and Efficient Class Prototypes?

From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)