VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
By: Zishan Xu , Yifu Guo , Yuquan Lu and more
Potential Business Impact:
Teaches computers to understand and cut out moving objects.
Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.
Similar Papers
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
CV and Pattern Recognition
Helps computers understand moving objects in videos.
Reinforcing Video Reasoning Segmentation to Think Before It Segments
CV and Pattern Recognition
Helps computers understand what you want to see in videos.
ViSS-R1: Self-Supervised Reinforcement Video Reasoning
CV and Pattern Recognition
Makes computers understand videos by watching them closely.