MMLongCite: A Benchmark for Evaluating Fidelity of Long-Context Vision-Language Models
By: Keyan Zhou , Zecheng Tang , Lingfeng Ming and more
Potential Business Impact:
Tests if AI can remember long videos and pictures.
The rapid advancement of large vision language models (LVLMs) has led to a significant expansion of their context windows. However, an extended context window does not guarantee the effective utilization of the context, posing a critical challenge for real-world applications. Current evaluations of such long-context faithfulness are predominantly focused on the text-only domain, while multimodal assessments remain limited to short contexts. To bridge this gap, we introduce MMLongCite, a comprehensive benchmark designed to evaluate the fidelity of LVLMs in long-context scenarios. MMLongCite comprises 8 distinct tasks spanning 6 context length intervals and incorporates diverse modalities, including text, images, and videos. Our evaluation of state-of-the-art LVLMs reveals their limited faithfulness in handling long multimodal contexts. Furthermore, we provide an in-depth analysis of how context length and the position of crucial content affect the faithfulness of these models.
Similar Papers
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
CV and Pattern Recognition
Tests computers that understand many pictures and words.
AcademicEval: Live Long-Context LLM Benchmark
Computation and Language
Tests if computers can understand long, complex writing.
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
Computation and Language
Tests if computers can understand very long texts.