Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA
By: Itbaan Safwan , Muhammad Annas Shaikh , Muhammad Haaris and more
Potential Business Impact:
Helps doctors understand medical images better.
We present a multi-task framework for the MediaEval Medico 2025 challenge, leveraging a LoRA-tuned Florence-2 model for simultaneous visual question answering (VQA), explanation generation, and visual grounding. The proposed system integrates three curated datasets: (1) Kvasir-VQA-x1 for question-answer learning, (2) a synthetically enriched explanation dataset offering structured medical reasoning, and (3) text-to-region pairs linking visual features with segmentation masks. This multi-task setup enables the model to jointly learn visual grounding, reasoning, and interpretation, producing responses that are both accurate and interpretable. Extensive evaluation demonstrates that our approach substantially improves over single-task baselines in both answer accuracy and visual localization, highlighting the effectiveness of grounded multi-task learning for medical VQA applications.
Similar Papers
Medico 2025: Visual Question Answering for Gastrointestinal Imaging
CV and Pattern Recognition
Helps doctors understand stomach pictures better.
MedXplain-VQA: Multi-Component Explainable Medical Visual Question Answering
CV and Pattern Recognition
Shows doctors why AI suggests a diagnosis.
No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
CV and Pattern Recognition
AI learns to see and think better.