Score: 1

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Published: January 6, 2026 | arXiv ID: 2601.03054v1

By: Yankai Jiang , Qiaoru Li , Binlu Xu and more

Potential Business Impact:

Helps doctors see tiny details in medical images.

Business Areas:

Image Recognition Data and Analytics, Software

Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model's robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods. All datasets, code, and trained models will be released publicly.

RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

CV and Pattern Recognition

Helps computers find and cut out objects from pictures.

30 Dec 2025 2

88%

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

CV and Pattern Recognition

Helps computers understand pictures by pointing to details.

22 Sep 2025 0

88%

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

CV and Pattern Recognition

Teaches computers to perfectly outline objects in pictures.

11 Mar 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

25 pages

IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation

Helps doctors see tiny details in medical images.

Technical Abstract

RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories