UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
By: Hewen Pan , Cong Wei , Dashuang Liang and more
Potential Business Impact:
Lets computers understand videos at different levels.
With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.
Similar Papers
UniVideo: Unified Understanding, Generation, and Editing for Videos
CV and Pattern Recognition
Makes videos from words, pictures, and edits them.
VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand fast actions in videos.
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
CV and Pattern Recognition
Finds exact moments in videos from descriptions.