DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding
By: Moulik Choraria , Xinbo Wu , Akhil Bhimaraju and more
Potential Business Impact:
Makes AI learn faster and use less power.
The hyperscaling of data and parameter count in transformer models is yielding diminishing performance improvement, especially when weighed against training costs. Such plateauing underlines a growing need for more efficient finetuning and inference, without sacrificing performance. This is particularly pressing for multimodal learning, where the overhead of processing multimodal tokens alongside language data often limits the practical viability of these systems. In parallel, advances in representation learning and interpretability have deepened our understanding of how such models process and encode information. Notably, recent work has uncovered implicit cross-modal alignment in the deeper layers of large pretrained models. Interestingly, this aligns with our own observations that models naturally defer most cross-modal token interactions to deeper stages of computation. Building on this, we propose a simple modification. Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle, allowing them to entirely bypass the early layers. Our results with diverse modalities: 1) LLaVA \& BLIP for vision, 2) LTU for audio, and 3) MoLCA for molecular data, indicate that our method reduces computational costs during both training and inference, while at the very least, preserving, if not surpassing the performance of existing baselines. Our work has important implications for scaling and composing pretrained models in a resource-efficient manner.
Similar Papers
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
Computation and Language
Makes AI understand pictures and words together better.
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
CV and Pattern Recognition
Makes AI understand pictures faster and cheaper.
Rethinking Visual Layer Selection in Multimodal LLMs
CV and Pattern Recognition
Helps computers understand pictures better for different jobs.