STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning
By: Jie Qin , Jiancheng Huang , Limeng Qiao and more
Multimodal large language models (MLLMs) play a pivotal role in advancing the quest for general artificial intelligence. However, achieving unified target for multimodal understanding and generation remains challenging due to optimization conflicts and performance trade-offs. To effectively enhance generative performance while preserving existing comprehension capabilities, we introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning. This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing. By freezing the parameters of the fundamental autoregressive (AR) model and progressively stacking isomorphic AR modules, it avoids cross-task interference while expanding the model's capabilities. Concurrently, we introduce a high-capacity VQ to enhance the granularity of image representations and employ an implicit reasoning mechanism to improve generation quality under complex conditions. Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34), validating its efficacy for unified multimodal learning.
Similar Papers
M-STAR: Multi-Scale Spatiotemporal Autoregression for Human Mobility Modeling
Artificial Intelligence
Creates realistic long trips for computers.
InfinityStar: Unified Spacetime AutoRegressive Modeling for Visual Generation
CV and Pattern Recognition
Creates realistic videos from text, faster than before.
Spanning Tree Autoregressive Visual Generation
CV and Pattern Recognition
Lets computers create and edit pictures better.