Shared Multi-modal Embedding Space for Face-Voice Association
By: Christopher Simic, Korbinian Riedhammer, Tobias Bocklet
Potential Business Impact:
Matches voices to faces, even in new languages.
The FAME 2026 challenge comprises two demanding tasks: training face-voice associations combined with a multilingual setting that includes testing on languages on which the model was not trained. Our approach consists of separate uni-modal processing pipelines with general face and voice feature extraction, complemented by additional age-gender feature extraction to support prediction. The resulting single-modal features are projected into a shared embedding space and trained with an Adaptive Angular Margin (AAM) loss. Our approach achieved first place in the FAME 2026 challenge, with an average Equal-Error Rate (EER) of 23.99%.
Similar Papers
Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan
CV and Pattern Recognition
Matches faces to voices, even with different languages.
Towards Language-Independent Face-Voice Association with Multimodal Foundation Models
Audio and Speech Processing
Lets computers recognize voices in new languages.
Equitable Electronic Health Record Prediction with FAME: Fairness-Aware Multimodal Embedding
Machine Learning (CS)
Makes medical AI fairer for all patients.