AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization
By: Christos Koutlis, Symeon Papadopoulos
Potential Business Impact:
Finds fake videos by checking if sound and lips match.
With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., lip movements) based on the other (e.g., audio waveform). Cross-modal reconstruction is significantly more challenging in manipulated video segments, leading to amplified discrepancies, thereby providing robust discriminative cues for precise temporal forgery localization. AuViRe outperforms the state of the art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code available at https://github.com/mever-team/auvire.
Similar Papers
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
CV and Pattern Recognition
Spots fake videos by listening to voices.
Investigating self-supervised representations for audio-visual deepfake detection
CV and Pattern Recognition
Finds fake videos by listening and watching.
KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features
Audio and Speech Processing
Finds fake videos by listening to the sound.