Score: 0

SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

Published: September 22, 2025 | arXiv ID: 2509.17537v2

By: Dian Jin , Yanghao Zhou , Jinxing Zhou and more

Potential Business Impact:

Lets computers find objects in videos using words.

Business Areas:

Semantic Search Internet Services

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

CV and Pattern Recognition

Helps computers understand videos using words and sounds.

2 Jun 2025 1

91%

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Multimedia

Lets computers find objects in videos by listening.

6 Aug 2025 2

88%

OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models

Machine Learning (CS)

Lets computers find sounds in videos.

30 Apr 2025 1

View PDF Login to Bookmark

Page Count

5 pages

SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

Lets computers find objects in videos using words.

Technical Abstract

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models