Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models
By: Shaonan Liu , Guo Yu , Xiaoling Luo and more
Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. To address these challenges, we introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent: discriminating precise targets amid visual noise, (2) Temporal Intent: inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent: verifying protocol compliance through safety checks. Beyond accuracy metrics, we introduce Trap QA mechanisms to stress-test clinical reliability by penalizing hallucinations and cognitive sycophancy. Experiments reveal current MLLMs struggle with egocentric intent due to over-reliance on global features, leading to fabricated observations and uncritical acceptance of invalid instructions.
Similar Papers
In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting
CV and Pattern Recognition
AI watches where you look to help you better.
MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine
Artificial Intelligence
Helps doctors trust AI to read medical pictures.
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
CV and Pattern Recognition
Helps robots understand how things change when used.