Score: 2

Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Published: December 24, 2025 | arXiv ID: 2512.21284v1

By: Shihao Zou , Jingjing Li , Wei Ji and more

Potential Business Impact:

Helps surgeons see better during operations.

Business Areas:

Image Recognition Data and Analytics, Software

Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.

Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation

Image and Video Processing

Helps robot surgeons see and explain operations.

6 Jul 2025 1

89%

Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

CV and Pattern Recognition

Makes drones find their way using less power.

22 Dec 2025 1

88%

Data-Efficient Learning for Generalizable Surgical Video Understanding

Image and Video Processing

Helps doctors learn and improve surgery with AI.

13 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇨🇦 🇺🇸 Canada, United States, China

Page Count

15 pages

Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Helps surgeons see better during operations.

Technical Abstract

Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation

Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

Data-Efficient Learning for Generalizable Surgical Video Understanding