Score: 0

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Published: November 25, 2025 | arXiv ID: 2511.20573v1

By: Chenhui Gou , Zilong Chen , Zeyu Wang and more

Potential Business Impact:

Makes computers draw pictures from questions.

Business Areas:

Visual Search Internet Services

This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question -- an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.

Visual question answering: from early developments to recent advances -- a survey

CV and Pattern Recognition

Lets computers answer questions about pictures.

7 Jan 2025 2

90%

The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering

CV and Pattern Recognition

Computers can now answer questions about pictures.

13 Jan 2025 0

90%

ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics

CV and Pattern Recognition

Helps computers understand Vietnamese infographics better.

13 Dec 2025 1

View PDF Login to Bookmark

Page Count

54 pages

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Makes computers draw pictures from questions.

Technical Abstract

Visual question answering: from early developments to recent advances -- a survey

The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering

ViInfographicVQA: A Benchmark for Single and Multi-image Visual Question Answering on Vietnamese Infographics