Score: 2

OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation

Published: September 18, 2025 | arXiv ID: 2509.15096v1

By: Bo-Wen Yin , Jiao-Long Cao , Xuying Zhang and more

Potential Business Impact:

Teaches computers to see and understand many things.

Business Areas:
Image Recognition Data and Analytics, Software

Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt, which contains five popular visual modalities. 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the ImageNeXt. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360.

Repos / Data Links

Page Count
12 pages

Category
Computer Science:
CV and Pattern Recognition