Score: 1

Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Published: October 29, 2025 | arXiv ID: 2510.25070v1

By: Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi

Potential Business Impact:

Lets computers understand new pictures without training.

Business Areas:

Image Recognition Data and Analytics, Software

Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets demonstrate significant gains over state-of-the-art zero-shot models in object recognition, activity detection, and scene captioning. The proposed system achieves up to 18% improvement in top-1 accuracy and notable gains in semantic coherence metrics, highlighting the effectiveness of cross-modal alignment and language grounding in enhancing generalization for real-world scene understanding.

Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios

CV and Pattern Recognition

Helps computers understand new places without being taught.

30 Oct 2025 0

91%

Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

CV and Pattern Recognition

Helps computers describe pictures without seeing labels.

14 Oct 2025 1

91%

From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

CV and Pattern Recognition

Lets robots understand what they see and do.

4 Nov 2025 0

View PDF Login to Bookmark

Page Count

8 pages

Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Lets computers understand new pictures without training.

Technical Abstract

Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios

Unifying Vision-Language Latents for Zero-label Image Caption Enhancement

From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics