Towards Language-Independent Face-Voice Association with Multimodal Foundation Models
By: Aref Farhadipour, Teodora Vukovic, Volker Dellwo
Potential Business Impact:
Lets computers recognize voices in new languages.
This paper describes the UZH-CL system submitted to the FAME2026 Challenge. The challenge focuses on cross-modal verification under unique multilingual conditions, specifically unseen and unheard languages. Our approach investigates two distinct architectures, consisting of a baseline dual-encoder system trained from scratch using contrastive and orthogonal projection losses, and a foundation model approach leveraging ImageBind with LoRA. To address the data scarcity and language constraints of the challenge, we curated an external Arabic dataset from VoxBlink. Our best-performing system, ImageBind-LoRA, demonstrates remarkable cross-lingual generalization: despite being fine-tuned exclusively on Arabic audio, it achieved an EER of 24.73% on the evaluation set (English and German), securing 2nd place in the competition.
Similar Papers
Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan
CV and Pattern Recognition
Matches faces to voices, even with different languages.
Shared Multi-modal Embedding Space for Face-Voice Association
Sound
Matches voices to faces, even in new languages.
English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM
Computation and Language
Helps computers judge and fix speaking mistakes.