VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding
By: Zhihong Zhang , Xiaojian Huang , Jin Xu and more
Potential Business Impact:
Helps AI understand videos better.
Multimodal reward models (MRMs) play a crucial role in the training, inference, and evaluation of Large Vision Language Models (LVLMs) by assessing response quality. However, existing benchmarks for evaluating MRMs in the video domain suffer from a limited number and diversity of questions, a lack of comprehensive evaluation dimensions, and inadequate evaluation of diverse types of MRMs. To address these gaps, we introduce VideoRewardBench, the first comprehensive benchmark covering four core aspects of video understanding: perception, knowledge, reasoning, and safety. Through our AI-assisted data pipeline, we curate a high-quality preference dataset of 1,563 annotated samples, including 1,482 unique videos and 1,559 distinct questions--15 times the number found in the most question-rich prior benchmark. Each sample is a triplet consisting of a video-text prompt, a chosen response, and a rejected response. We also conduct a comprehensive evaluation across 28 multimodal reward models spanning three categories: generative, discriminative, and semi-scalar. Results show that even the top-performing model GPT-4o achieves only 57.0% overall accuracy, and the state-of-the-art open-source model Qwen2.5-VL-72B reaches merely 53.3%. Our analysis further reveals three key insights: (i) MRMs trained with reinforcement learning (RL) do not necessarily exhibit stronger cross-modal generalization than those trained without RL; (ii) except for discriminative MRMs, other types of MRMs across varying model capacities can benefit from inference-time scaling; and (iii) variations in input video frame count have different effects on different types of MRMs. We believe VideoRewardBench offers a challenging and valuable benchmark for advancing the evaluation and development of MRMs in the video domain.
Similar Papers
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Computation and Language
Helps AI understand pictures and words together better.
SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
CV and Pattern Recognition
Tests AI's ability to understand science videos.
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos
CV and Pattern Recognition
Teaches computers to understand video stories better.