Score: 1

Online In-Context Distillation for Low-Resource Vision Language Models

Published: October 20, 2025 | arXiv ID: 2510.18117v1

By: Zhiqi Kang , Rahaf Aljundi , Vaggelis Dorovatas and more

Potential Business Impact:

Makes small AI understand pictures better with less cost.

Business Areas:

Visual Search Internet Services

As the field continues its push for ever more resources, this work turns the spotlight on a critical question: how can vision-language models (VLMs) be adapted to thrive in low-resource, budget-constrained settings? While large VLMs offer strong performance, they are impractical to deploy in such settings. Small VLMs, on the other hand, are efficient but typically require costly fine-tuning to close the performance gap with larger models in the deployment domain. Inspired by the in-context learning framework, we propose an online In-Context Distillation (ICD) method, in which a small VLM collaborates with a stronger teacher model at inference time, distilling its knowledge via sparse demonstrations to efficiently bridge the gap between them. Our method is built on an in-depth analysis that identifies the scale and the choice of models for which vision-language ICL is currently feasible, and demonstrates the advantage of ICL over fine-tuning under constrained compute budgets. We enhance our method with a novel cross-modal demonstration selection strategy, teacher test-time scaling to reduce noise, and student uncertainty conditioning to dynamically populate a demonstration pool and minimize teacher queries. Our ICD method significantly boosts the performance of small models (up to 33%) using scarce teacher annotations (as low as 4%), and competes with the teacher's zero-shot performance.

What do vision-language models see in the context? Investigating multimodal in-context learning

Machine Learning (CS)

Helps computers understand pictures and words together better.

28 Oct 2025 1

90%

Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

CV and Pattern Recognition

Helps computers understand medical images better, faster.

2 Dec 2025 2

90%

Brewing Knowledge in Context: Distillation Perspectives on In-Context Learning

Machine Learning (CS)

Teaches computers new skills without retraining them.

13 Jun 2025 0

View PDF Login to Bookmark

Page Count

29 pages

Online In-Context Distillation for Low-Resource Vision Language Models

Makes small AI understand pictures better with less cost.

Technical Abstract

What do vision-language models see in the context? Investigating multimodal in-context learning

Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

Brewing Knowledge in Context: Distillation Perspectives on In-Context Learning