Score: 0

Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Published: January 15, 2026 | arXiv ID: 2601.10096v1

By: Piyush Singh Pasi

Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely heavily on machine translation, while advances in multilingual text modeling remain underutilized. We introduce METAL, a lightweight alignment method that learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. Despite its simplicity, METAL matches baseline performance in English (94.9 percent Recall at 10) and achieves strong zero-shot transfer (89.5 percent Recall at 10 averaged across 11 languages, 10 unseen) on XTD text-to-image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, METAL generalizes to audio-text retrieval and cross-lingual text-to-image generation. We release code and checkpoints at https://github.com/m2m-codebase/M2M , as well as multilingual evaluation datasets including MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k ), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual ), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual ), to facilitate further research.

CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning

Computation and Language

Adds new senses to AI without retraining.

29 Nov 2025 1

90%

MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

Computation and Language

Translates speech to text in 70 languages faster.

1 Dec 2025 2

90%

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Computation and Language

Tests AI that understands talking, seeing, and reading.

25 Jul 2025 2

View PDF Login to Bookmark

Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text

Technical Abstract

CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning

MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks