VIBE: Video-Input Brain Encoder for fMRI Response Modeling
By: Daniel Carlstrom Schad , Shrey Dixit , Janis Keck and more
Potential Business Impact:
Reads minds by watching movies and listening.
We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 32.25 on in-distribution Friends S07 and 21.25 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.
Similar Papers
Semantic Matters: Multimodal Features for Affective Analysis
CV and Pattern Recognition
Helps computers understand emotions from voice and face.
VibeVoice Technical Report
Computation and Language
Creates long, natural-sounding conversations with many voices.
Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition
Sound
Helps computers understand emotions even with background noise.