MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
By: Binhua Huang , Ni Wang , Arjun Pakrashi and more
Potential Business Impact:
Lets computers understand videos faster and cheaper.
Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.
Similar Papers
MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation
CV and Pattern Recognition
Makes computer characters move like real people.
AnimalMotionCLIP: Embedding motion in CLIP for Animal Behavior Analysis
CV and Pattern Recognition
Helps computers understand animal movements and actions.
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
CV and Pattern Recognition
Makes phone apps understand videos faster.