Score: 1

Referring Video Object Segmentation with Cross-Modality Proxy Queries

Published: November 26, 2025 | arXiv ID: 2511.21139v1

By: Baoli Sun , Xinzhu Ma , Ning Wang and more

Potential Business Impact:

Helps computers find specific things in videos.

Business Areas:

Image Recognition Data and Analytics, Software

Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

CV and Pattern Recognition

Helps computers find objects in videos by text.

19 Aug 2025 2

91%

Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

CV and Pattern Recognition

Lets computers find and track many things in videos.

18 Apr 2025 0

91%

Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation

CV and Pattern Recognition

Lets computers find any object in videos using words.

6 Sep 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

Referring Video Object Segmentation with Cross-Modality Proxy Queries

Helps computers find specific things in videos.

Technical Abstract

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation