Score: 1

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Published: March 19, 2025 | arXiv ID: 2503.15108v3

By: Mohamed Salim Aissi , Clemence Grislain , Mohamed Chetouani and more

Potential Business Impact:

Lets robots follow spoken instructions to do tasks.

Business Areas:

Image Recognition Data and Analytics, Software

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Computation and Language

Makes AI understand pictures and words better, faster.

5 Dec 2025 2

90%

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

CV and Pattern Recognition

Helps computers see details better.

28 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇫🇷 France

Page Count

16 pages

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Lets robots follow spoken instructions to do tasks.

Technical Abstract

More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model