The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
By: Quanzhu Niu , Dengxian Gong , Shihao Chen and more
Potential Business Impact:
Helps computers find and follow anything you describe.
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $J\&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/magic-research/Sa2VA.
Similar Papers
Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track
CV and Pattern Recognition
Finds specific things in videos using words.
4th PVUW MeViS 3rd Place Report: Sa2VA
CV and Pattern Recognition
Helps computers find objects in videos using words.
3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference
CV and Pattern Recognition
Helps computers find objects in videos better.