Score: 2

VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

Published: June 10, 2025 | arXiv ID: 2506.08691v1

By: Congzhi Zhang , Jiawei Peng , Zhenglin Wang and more

Potential Business Impact:

Helps computers solve tricky math problems better.

Business Areas:

Visual Search Internet Services

Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.

Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

CV and Pattern Recognition

Helps computers answer questions about pictures better.

9 Jun 2025 2

90%

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

CV and Pattern Recognition

Teaches computers to solve math problems better.

9 Mar 2025 1

90%

Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions

CV and Pattern Recognition

Finds hidden answers in old AI models.

10 Jun 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

20 pages

VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

Helps computers solve tricky math problems better.

Technical Abstract

Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions