Score: 1

CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

Published: November 4, 2025 | arXiv ID: 2511.02360v1

By: Jizheng Ma , Xiaofei Zhou , Yanlong Song and more

Potential Business Impact:

Helps computers understand pictures like people do.

Business Areas:

Computer Vision Hardware, Software

In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

CV and Pattern Recognition

Lets computers see and understand pictures better.

24 Nov 2025 0

92%

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Computation and Language

Lets AI understand complex pictures by copying thinking steps.

22 Nov 2025 0

91%

Latent Chain-of-Thought for Visual Reasoning

Artificial Intelligence

Makes AI think step-by-step better for new problems.

27 Oct 2025 2

View PDF Login to Bookmark

Page Count

31 pages

CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

Helps computers understand pictures like people do.

Technical Abstract

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Latent Chain-of-Thought for Visual Reasoning