Score: 0

Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Published: September 25, 2025 | arXiv ID: 2509.21251v1

By: You-Won Jang , Yu-Jung Heo , Jaeseok Kim and more

Potential Business Impact:

Helps computers understand pictures by asking questions.

Business Areas:

Semantic Search Internet Services

The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

CV and Pattern Recognition

Helps computers understand pictures by asking themselves questions.

18 Mar 2025 1

90%

Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

CV and Pattern Recognition

Helps computers judge picture quality better.

3 Jun 2025 1

89%

Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment

CV and Pattern Recognition

Makes AI judge picture quality more like people.

10 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Page Count

5 pages

Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

Helps computers understand pictures by asking questions.

Technical Abstract

Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs

Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment