Guideline-Consistent Segmentation via Multi-Agent Refinement
By: Vanshika Vats, Ashwani Rathee, James Davis
Potential Business Impact:
Makes computer pictures follow tricky rules perfectly.
Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.
Similar Papers
Towards Agentic AI for Multimodal-Guided Video Object Segmentation
CV and Pattern Recognition
Helps computers find objects in videos using words.
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes
CV and Pattern Recognition
Teaches computers to understand satellite images better.
VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
CV and Pattern Recognition
Teaches computers to understand and cut out moving objects.