Score: 2

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Published: March 18, 2025 | arXiv ID: 2503.14140v1

By: Zining Wang , Tongkun Guan , Pei Fu and more

BigTech Affiliations: Meituan

Potential Business Impact:

Helps computers understand pictures and words together.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Multi-modal Large Language Models (MLLMs) have introduced a novel dimension to document understanding, i.e., they endow large language models with visual comprehension capabilities; however, how to design a suitable image-text pre-training task for bridging the visual and language modality in document-level MLLMs remains underexplored. In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. The former allows the model to implicitly align images and text at the semantic level. The latter introduces an additional mask generator (discarded during inference) to explicitly ensure alignment between visual texts within images and their corresponding image regions at a spatially-aware level. Together, they can prevent model hallucinations when parsing visual text and effectively promote spatially-aware feature representation learning. To support the proposed VQAMask task, we construct a comprehensive image-mask generation pipeline and provide a large-scale dataset with 6M data (MTMask6M). Subsequently, we demonstrate that introducing the proposed mask generation task yields competitive document-level understanding performance. Leveraging the proposed VQAMask, we introduce Marten, a training-efficient MLLM tailored for document-level understanding. Extensive experiments show that our Marten consistently achieves significant improvements among 8B-MLLMs in document-centric tasks. Code and datasets are available at https://github.com/PriNing/Marten.

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

CV and Pattern Recognition

Helps computers answer questions using text and pictures.

11 Apr 2025 2

89%

Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

Information Retrieval

Helps computers understand Japanese documents better.

22 Aug 2025 0

89%

AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering

CV and Pattern Recognition

Helps computers understand many pictures better.

25 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

17 pages

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Helps computers understand pictures and words together.

Technical Abstract

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering

Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering

AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering