Score: 1

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Published: November 11, 2025 | arXiv ID: 2511.07710v2

By: Jiale Liu , Haoming Zhou , Yishu Zhu and more

Potential Business Impact:

Helps computers understand pictures and words better.

Business Areas:

Image Recognition Data and Analytics, Software

Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

CV and Pattern Recognition

Connects words to exact picture parts better.

11 Nov 2025 1

100%

Cross Modal Fine-grained Alignment via Granularity-aware and Region-uncertain Modeling

CV and Pattern Recognition

Helps computers understand pictures and words better.

11 Nov 2025 1

89%

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval

CV and Pattern Recognition

Finds exact pictures from text descriptions.

10 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

10 pages

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Helps computers understand pictures and words better.

Technical Abstract

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

Cross Modal Fine-grained Alignment via Granularity-aware and Region-uncertain Modeling

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval