Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup
By: Seokun Kang, Taehwan Kim
Potential Business Impact:
Teaches computers to understand videos using sight and sound.
Video action recognition is a challenging but important task for understanding and discovering what the video does. However, acquiring annotations for a video is costly, and semi-supervised learning (SSL) has been studied to improve performance even with a small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but the video is multi-modal, so utilizing both visuals and audio would be desirable and improve performance further, which has not been explored well. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data, which is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed semi-supervised audio-visual action recognition framework and audio source localization-guided mixup.
Similar Papers
Improving Sound Source Localization with Joint Slot Attention on Image and Audio
CV and Pattern Recognition
Finds where sounds come from in pictures.
Multimodal Fusion with Semi-Supervised Learning Minimizes Annotation Quantity for Modeling Videoconference Conversation Experience
Audio and Speech Processing
Helps video calls feel more natural and fun.
Learning to Highlight Audio by Watching Movies
CV and Pattern Recognition
Makes videos sound better by watching them.