Score: 0

Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

Published: August 22, 2025 | arXiv ID: 2508.16148v1

By: Ao Zhou , Zebo Gu , Tenghao Sun and more

Potential Business Impact:

Helps computers understand Japanese documents better.

Business Areas:

Semantic Search Internet Services

Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

CV and Pattern Recognition

Helps computers answer questions using pictures and facts.

16 Oct 2025 3

90%

Unexplored flaws in multiple-choice VQA evaluations

CV and Pattern Recognition

Makes AI answers change just by changing the question's words.

27 Nov 2025 0

90%

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

CV and Pattern Recognition

Helps computers understand pictures by asking themselves questions.

18 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

7 pages

Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

Helps computers understand Japanese documents better.

Technical Abstract

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Unexplored flaws in multiple-choice VQA evaluations

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs