Audio-Visual Cross-Modal Compression for Generative Face Video Coding
By: Youmin Xu , Mengxi Guo , Shijie Zhao and more
Generative face video coding (GFVC) is vital for modern applications like video conferencing, yet existing methods primarily focus on video motion while neglecting the significant bitrate contribution of audio. Despite the well-established correlation between audio and lip movements, this cross-modal coherence has not been systematically exploited for compression. To address this, we propose an Audio-Visual Cross-Modal Compression (AVCC) framework that jointly compresses audio and video streams. Our framework extracts motion information from video and tokenizes audio features, then aligns them through a unified audio-video diffusion process. This allows synchronized reconstruction of both modalities from a shared representation. In extremely low-rate scenarios, AVCC can even reconstruct one modality from the other. Experiments show that AVCC significantly outperforms the Versatile Video Coding (VVC) standard and state-of-the-art GFVC schemes in rate-distortion performance, paving the way for more efficient multimodal communication systems.
Similar Papers
Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding
CV and Pattern Recognition
Makes video calls look better with less data.
Audio-Visual Driven Compression for Low-Bitrate Talking Head Videos
Image and Video Processing
Makes talking videos smaller, clearer, and in sync.
Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning
CV and Pattern Recognition
Fixes blurry videos using sound and face shapes.