VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models
By: Zefan Zhang , Kehua Zhu , Shijie Jiang and more
Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.
Similar Papers
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
CV and Pattern Recognition
Fixes AI videos that make up wrong stories.
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding
CV and Pattern Recognition
Fixes AI videos that make up fake events.
Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding
CV and Pattern Recognition
Teaches computers to watch videos better, without making things up.