Score: 2

HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

Published: June 16, 2025 | arXiv ID: 2506.13925v2

By: Numair Nadeem , Saeed Anwar , Muhammad Hamza Asad and more

Potential Business Impact:

Helps computers understand pictures with less examples.

Business Areas:

Semantic Search Internet Services

In this paper, we address Semi-supervised Semantic Segmentation (SSS) under domain shift by leveraging domain-invariant semantic knowledge from text embeddings of Vision-Language Models (VLMs). We propose a unified Hierarchical Vision-Language framework (HVL) that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network to improve generalization and reduce misclassification under limited supervision. The mentioned textual queries are used for grouping pixels with shared semantics under SSS. HVL is designed to (1) generate textual queries that maximally encode domain-invariant semantics from VLM while capturing intra-class variations; (2) align these queries with spatial visual features to enhance their segmentation ability and improve the semantic clarity of visual features. We also introduce targeted regularization losses that maintain vision--language alignment throughout training to reinforce semantic understanding. HVL establishes a novel state-of-the-art by achieving a +9.3% improvement in mean Intersection over Union (mIoU) on COCO, utilizing 232 labelled images, +3.1% on Pascal VOC employing 92 labels, +4.8% on ADE20 using 316 labels, and +3.4% on Cityscapes with 100 labels, demonstrating superior performance with less than 1% supervision on four benchmark datasets. Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

CV and Pattern Recognition

Helps computers understand pictures better by focusing on important parts.

14 Mar 2025 0

89%

SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation

CV and Pattern Recognition

Helps computers understand pictures better with words.

8 Apr 2025 1

89%

Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation

CV and Pattern Recognition

Helps computers see and name anything, anywhere.

11 Jun 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇦 🇦🇺 Canada, Australia

Page Count

13 pages

HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

Helps computers understand pictures with less examples.

Technical Abstract

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation

Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation