Score: 2

SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

Published: August 30, 2025 | arXiv ID: 2509.00357v1

By: Zhen Chen , Xingjian Luo , Kun Yuan and more

Potential Business Impact:

Helps doctors understand surgery videos better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

CV and Pattern Recognition

Helps AI understand surgery better for safer operations.

26 Nov 2025 0

91%

SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

CV and Pattern Recognition

Helps surgeons by understanding surgery videos.

3 Jun 2025 0

89%

Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model

CV and Pattern Recognition

Helps surgeons see inside bodies in 3D.

11 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Repos / Data Links

github.com

Page Count

22 pages

SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding

Helps doctors understand surgery videos better.

Technical Abstract

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model