Adaptive Multimodal Person Recognition: A Robust Framework for Handling Missing Modalities
By: Aref Farhadipour , Teodora Vukovic , Volker Dellwo and more
Potential Business Impact:
Identifies people even if some senses are missing.
Person recognition systems often rely on audio, visual, or behavioral cues, but real-world conditions frequently result in missing or degraded modalities. To address this challenge, we propose a Trimodal person identification framework that integrates voice, face, and gesture modalities, while remaining robust to modality loss. Our approach leverages multi-task learning to process each modality independently, followed by a cross-attention and gated fusion mechanisms to facilitate interaction across modalities. Moreover, a confidence-weighted fusion strategy dynamically adapts to missing and low-quality data, ensuring optimal classification even in Unimodal or Bimodal scenarios. We evaluate our method on CANDOR, a newly introduced interview-based multimodal dataset, which we benchmark for the first time. Our results demonstrate that the proposed Trimodal system achieves 99.18% Top-1 accuracy on person identification tasks, outperforming conventional Unimodal and late-fusion approaches. In addition, we evaluate our model on the VoxCeleb1 dataset as a benchmark and reach 99.92% accuracy in Bimodal mode. Moreover, we show that our system maintains high accuracy even when one or two modalities are unavailable, making it a robust solution for real-world person recognition applications. The code and data for this work are publicly available.
Similar Papers
Multi-modal expressive personality recognition in data non-ideal audiovisual based on multi-scale feature enhancement and modal augment
Sound
Computer guesses your personality from voice and face.
Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition
CV and Pattern Recognition
Lets computers understand actions by watching, listening, and feeling.
Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition
CV and Pattern Recognition
Helps computers understand feelings from mixed clues.