A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA
By: Belal Shoer, Yova Kementchedjhieva
Potential Business Impact:
Helps computers understand science pictures and text.
Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this "text-in-image" format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.
Similar Papers
Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations
CV and Pattern Recognition
Helps computers answer questions about Vietnamese pictures.
Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
CV and Pattern Recognition
Helps computers answer science questions from pictures.
Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis
CV and Pattern Recognition
Computers can now answer questions about text in pictures.