Score: 1

A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA

Published: September 24, 2025 | arXiv ID: 2509.20119v1

By: Belal Shoer, Yova Kementchedjhieva

Potential Business Impact:

Helps computers understand science pictures and text.

Business Areas:
Text Analytics Data and Analytics, Software

Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this "text-in-image" format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.

Country of Origin
🇦🇪 United Arab Emirates

Repos / Data Links

Page Count
6 pages

Category
Computer Science:
CV and Pattern Recognition