Unified Open-World Segmentation with Multi-Modal Prompts
By: Yang Liu , Yufei Yin , Chenchen Jing and more
Potential Business Impact:
Lets computers see anything you describe.
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
Similar Papers
COS3D: Collaborative Open-Vocabulary 3D Segmentation
CV and Pattern Recognition
Helps robots understand and grab any object.
Text-guided Visual Prompt DINO for Generic Segmentation
CV and Pattern Recognition
Lets computers see and name anything in pictures.
Prompt-based Multimodal Semantic Communication for Multi-spectral Image Segmentation
Image and Video Processing
Boosts scene splitting from multi-light images for safe driving