Score: 1

SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

Published: October 14, 2025 | arXiv ID: 2510.12267v1

By: Chenghanyu Zhang , Zekun Li , Peipei Li and more

Potential Business Impact:

Helps AI understand spine problems from X-rays.

Business Areas:

A/B Testing Data and Analytics

With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

CV and Pattern Recognition

Helps doctors diagnose back problems better.

3 Oct 2025 0

91%

OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks

CV and Pattern Recognition

Helps computers understand brain scans better.

2 Nov 2025 1

90%

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

CV and Pattern Recognition

Helps AI understand surgery better for safer operations.

26 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇨🇳 United States, China

Page Count

8 pages

SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

Helps AI understand spine problems from X-rays.

Technical Abstract

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

OmniBrainBench: A Comprehensive Multimodal Benchmark for Brain Imaging Analysis Across Multi-stage Clinical Tasks

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding