Instruction-tuned Self-Questioning Framework for Multimodal Reasoning
By: You-Won Jang , Yu-Jung Heo , Jaeseok Kim and more
Potential Business Impact:
Helps computers understand pictures by asking questions.
The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.
Similar Papers
Elevating Visual Question Answering through Implicitly Learned Reasoning Pathways in LVLMs
CV and Pattern Recognition
Helps computers understand pictures by asking themselves questions.
Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment
CV and Pattern Recognition
Helps computers judge picture quality better.
Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment
CV and Pattern Recognition
Makes AI judge picture quality more like people.