MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning
By: Xuhui Zheng , Kang An , Ziliang Wang and more
Potential Business Impact:
Teaches AI to understand pictures, not just words.
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
Similar Papers
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
CV and Pattern Recognition
Teaches computers to solve math problems better.
Reinforced Visual Perception with Tools
CV and Pattern Recognition
Teaches computers to understand pictures like people.
ViSS-R1: Self-Supervised Reinforcement Video Reasoning
CV and Pattern Recognition
Makes computers understand videos by watching them closely.