Score: 0

Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan

Published: August 6, 2025 | arXiv ID: 2508.04592v1

By: Marta Moscati , Ahmed Abdullah , Muhammad Saad Saeed and more

Potential Business Impact:

Matches faces to voices, even with different languages.

The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, audio-visual systems are among the most widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to the presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) 2026 Challenge focuses on exploring face-voice association under the unique condition of a multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenarios. The challenge uses a dataset named Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baseline models, and task details for the FAME Challenge.

Shared Multi-modal Embedding Space for Face-Voice Association

Sound

Matches voices to faces, even in new languages.

4 Dec 2025 0

90%

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

Audio and Speech Processing

Lets computers recognize voices in new languages.

2 Dec 2025 0

87%

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

CV and Pattern Recognition

Makes talking videos work in many languages.

8 Oct 2025 0

View PDF Login to Bookmark

Page Count

4 pages

Face-voice Association in Multilingual Environments (FAME) 2026 Challenge Evaluation Plan

Matches faces to voices, even with different languages.

Technical Abstract

Shared Multi-modal Embedding Space for Face-Voice Association

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages