Score: 0

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Published: January 13, 2026 | arXiv ID: 2601.08758v1

By: Juntao Jiang , Jiangning Zhang , Yali Bi and more

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

CV and Pattern Recognition

Tests if AI truly sees and thinks.

9 Dec 2025 0

93%

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Computation and Language

Helps AI "think step-by-step" to solve harder problems.

17 Nov 2025 0

93%

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

Computation and Language

Helps AI "think" step-by-step to solve harder problems.

17 Nov 2025 0

View PDF Login to Bookmark

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Technical Abstract

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models

From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models