Score: 1

Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

Published: September 6, 2025 | arXiv ID: 2509.05669v1

By: Weijie Shen , Xinrui Wang , Yuanqi Nie and more

Potential Business Impact:

Helps AI remember past conversations to answer better.

Business Areas:

Computer Vision Hardware, Software

Current Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) excel in single-turn tasks but face significant challenges in multi-turn interactions requiring deep contextual understanding and complex visual reasoning, often leading to fragmented reasoning, context loss, and hallucinations. To address these limitations, we propose Context-Aware Multi-Turn Visual Reasoning (CAMVR), a novel framework designed to empower LVLMs with robust and coherent multi-turn visual-textual inference capabilities. CAMVR introduces two key innovations: a Visual-Textual Context Memory Unit (VCMU), a dynamic read-write memory network that stores and manages critical visual features, textual semantic representations, and their cross-modal correspondences from each interaction turn; and an Adaptive Visual Focus Guidance (AVFG) mechanism, which leverages the VCMU's context to dynamically adjust the visual encoder's attention to contextually relevant image regions. Our multi-level reasoning integration strategy ensures that response generation is deeply coherent with both current inputs and accumulated historical context. Extensive experiments on challenging datasets, including VisDial, an adapted A-OKVQA, and our novel Multi-Turn Instruction Following (MTIF) dataset, demonstrate that CAMVR consistently achieves state-of-the-art performance.

ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

Computation and Language

Helps AI understand and follow long, visual instructions.

21 Aug 2025 1

90%

MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning

Artificial Intelligence

Helps AI understand conversations with many pictures.

24 Mar 2025 0

89%

Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

Computation and Language

Helps computers understand pictures and think step-by-step.

4 Aug 2025 1

View PDF Login to Bookmark

Page Count

14 pages

Context-Aware Multi-Turn Visual-Textual Reasoning in LVLMs via Dynamic Memory and Adaptive Visual Guidance

Helps AI remember past conversations to answer better.

Technical Abstract

ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning

Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models