Detecting Lip-Syncing Deepfakes: Vision Temporal Transformer for Analyzing Mouth Inconsistencies
By: Soumyya Kanti Datta, Shan Jia, Siwei Lyu
Potential Business Impact:
Finds fake videos where mouths don't match sound.
Deepfakes are AI-generated media in which the original content is digitally altered to create convincing but manipulated images, videos, or audio. Among the various types of deepfakes, lip-syncing deepfakes are one of the most challenging deepfakes to detect. In these videos, a person's lip movements are synthesized to match altered or entirely new audio using AI models. Therefore, unlike other types of deepfakes, the artifacts in lip-syncing deepfakes are confined to the mouth region, making them more subtle and, thus harder to discern. In this paper, we propose LIPINC-V2, a novel detection framework that leverages a combination of vision temporal transformer with multihead cross-attention to detect lip-syncing deepfakes by identifying spatiotemporal inconsistencies in the mouth region. These inconsistencies appear across adjacent frames and persist throughout the video. Our model can successfully capture both short-term and long-term variations in mouth movement, enhancing its ability to detect these inconsistencies. Additionally, we created a new lip-syncing deepfake dataset, LipSyncTIMIT, which was generated using five state-of-the-art lip-syncing models to simulate real-world scenarios. Extensive experiments on our proposed LipSyncTIMIT dataset and two other benchmark deepfake datasets demonstrate that our model achieves state-of-the-art performance. The code and the dataset are available at https://github.com/skrantidatta/LIPINC-V2 .
Similar Papers
Audio-Visual Deepfake Detection With Local Temporal Inconsistencies
CV and Pattern Recognition
Spots fake videos by checking sound and video match.
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers
CV and Pattern Recognition
Makes talking videos match the sound perfectly.
SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild
CV and Pattern Recognition
Makes videos speak any language perfectly.