Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
By: Pu Jian , Donglei Yu , Wen Yang and more
Potential Business Impact:
Helps computers ask for help when confused.
In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.
Similar Papers
CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering
Computation and Language
Helps computers understand what you mean.
The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering
CV and Pattern Recognition
Computers can now answer questions about pictures.
Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue
Robotics
Robot asks questions to do tasks better.