Score: 1

MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Published: August 11, 2025 | arXiv ID: 2508.08177v1

By: Zhonghao Yan , Muxi Diao , Yuxuan Yang and more

Potential Business Impact:

Helps doctors find problems in medical pictures.

Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

CV and Pattern Recognition

Helps doctors understand medical images better.

11 Jan 2026 1

91%

MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

CV and Pattern Recognition

Helps doctors find problems in medical pictures.

12 Jun 2025 0

91%

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

CV and Pattern Recognition

Helps computers understand many pictures at once.

8 Jan 2026 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

37 pages

MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Helps doctors find problems in medical pictures.

Technical Abstract

MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models