Video Joint-Embedding Predictive Architectures for Facial Expression Recognition
By: Lennart Eing , Cristina Luna-Jiménez , Silvan Mertes and more
This paper introduces a novel application of Video Joint-Embedding Predictive Architectures (V-JEPAs) for Facial Expression Recognition (FER). Departing from conventional pre-training methods for video understanding that rely on pixel-level reconstructions, V-JEPAs learn by predicting embeddings of masked regions from the embeddings of unmasked regions. This enables the trained encoder to not capture irrelevant information about a given video like the color of a region of pixels in the background. Using a pre-trained V-JEPA video encoder, we train shallow classifiers using the RAVDESS and CREMA-D datasets, achieving state-of-the-art performance on RAVDESS and outperforming all other vision-based methods on CREMA-D (+1.48 WAR). Furthermore, cross-dataset evaluations reveal strong generalization capabilities, demonstrating the potential of purely embedding-based pre-training approaches to advance FER. We release our code at https://github.com/lennarteingunia/vjepa-for-fer.
Similar Papers
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
CV and Pattern Recognition
Helps computers understand pictures and words better.
From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis
CV and Pattern Recognition
Helps doctors understand brain signals better.
DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture
CV and Pattern Recognition
Teaches computers to see like humans do.