Score: 2

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Published: October 9, 2025 | arXiv ID: 2510.08470v1

By: Bianca-Mihaela Ganescu , Suchir Salhan , Andrew Caines and more

Potential Business Impact:

Helps computers learn to see and understand words.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

CV and Pattern Recognition

Teaches AI to better understand pictures and words together.

19 Aug 2025 2

88%

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

CV and Pattern Recognition

Lets computers see smarter, using less data.

3 Dec 2025 3

88%

BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion

CV and Pattern Recognition

Makes smart AI work on smaller, cheaper computers.

10 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇬🇧 United Kingdom

Repos / Data Links

github.com

Page Count

26 pages

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Helps computers learn to see and understand words.

Technical Abstract

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion