Can Vision-Language Models Answer Face to Face Questions in the Real-World?
By: Reza Pourreza , Rishit Dagli , Apratim Bhattacharyya and more
Potential Business Impact:
AI talks about live video like a person.
AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.
Similar Papers
Respond Beyond Language: A Benchmark for Video Generation in Response to Realistic User Intents
Artificial Intelligence
Helps AI make videos to answer questions.
Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired
Human-Computer Interaction
AI helps blind people see with live video.
Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos
CV and Pattern Recognition
Makes learning videos ask you questions.