Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
By: Sukrit Sriratanawilai , Jhayahgrit Thongwat , Romrawin Chumpu and more
Potential Business Impact:
Makes AI understand many languages better, even when smaller.
Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.
Similar Papers
When Better Teachers Don't Make Better Students: Revisiting Knowledge Distillation for CLIP Models in VQA
CV and Pattern Recognition
Makes smart AI models smaller and faster.
Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions
Computation and Language
Makes big AI models smaller and faster.
EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens
CV and Pattern Recognition
Makes AI understand pictures better without using more power.