Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents
By: Shuting Wang , Yunqi Liu , Zixin Yang and more
Potential Business Impact:
Helps AI make videos to answer questions.
Querying generative AI models, e.g., large language models (LLMs), has become a prevalent method for information acquisition. However, existing query-answer datasets primarily focus on textual responses, making it challenging to address complex user queries that require visual demonstrations or explanations for better understanding. To bridge this gap, we construct a benchmark, RealVideoQuest, designed to evaluate the abilities of text-to-video (T2V) models in answering real-world, visually grounded queries. It identifies 7.5K real user queries with video response intents from Chatbot-Arena and builds 4.5K high-quality query-video pairs through a multistage video retrieval and refinement process. We further develop a multi-angle evaluation system to assess the quality of generated video answers. Experiments indicate that current T2V models struggle with effectively addressing real user queries, pointing to key challenges and future research opportunities in multimodal AI.
Similar Papers
Can Vision-Language Models Answer Face to Face Questions in the Real-World?
CV and Pattern Recognition
AI talks about live video like a person.
Video-Bench: Human-Aligned Video Generation Benchmark
CV and Pattern Recognition
Tests AI videos to match what people like.
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
CV and Pattern Recognition
Tests how well computers understand videos.