Score: 0

RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze

Published: July 12, 2025 | arXiv ID: 2507.09097v1

By: Yunsoo Kim, Jinge Wu, Honghan Wu

Potential Business Impact:

Helps AI understand X-rays by watching doctors' eyes.

Business Areas:
Image Recognition Data and Analytics, Software

Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists' eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists' eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert's knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models' capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.

Country of Origin
πŸ‡¬πŸ‡§ United Kingdom

Page Count
13 pages

Category
Computer Science:
CV and Pattern Recognition