Score: 0

More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

Published: December 13, 2025 | arXiv ID: 2512.12487v1

By: Hoang Anh Just , Yifei Fan , Handong Zhao and more

Potential Business Impact:

Makes AI better at seeing and thinking.

Business Areas:

Image Recognition Data and Analytics, Software

Reinforcement learning from verifiable rewards (RLVR) has recently been extended from text-only LLMs to vision-language models (VLMs) to elicit long-chain multimodal reasoning. However, RLVR-trained VLMs still exhibit two persistent failure modes: inaccurate visual extraction (missing or hallucinating details) and logically inconsistent chains-of-thought, largely because verifiable signals supervise only the final answer. We propose PeRL-VL (Perception and Reasoning Learning for Vision-Language Models), a decoupled framework that separately improves visual perception and textual reasoning on top of RLVR. For perception, PeRL-VL introduces a VLM-based description reward that scores the model's self-generated image descriptions for faithfulness and sufficiency. For reasoning, PeRL-VL adds a text-only Reasoning SFT stage on logic-rich chain-of-thought data, enhancing coherence and logical consistency independently of vision. Across diverse multimodal benchmarks, PeRL-VL improves average Pass@1 accuracy from 63.3% (base Qwen2.5-VL-7B) to 68.8%, outperforming standard RLVR, text-only reasoning SFT, and naive multimodal distillation from GPT-4o.

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

CV and Pattern Recognition

Helps computers see and think better.

16 Sep 2025 1

93%

PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

CV and Pattern Recognition

Teaches computers to understand pictures better together.

17 Jun 2025 2

93%

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Machine Learning (CS)

Teaches computers to see and think better.

8 Jun 2025 2

View PDF Login to Bookmark

Page Count

24 pages

More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models

Makes AI better at seeing and thinking.

Technical Abstract

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward