Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
By: Linyu Ou
Potential Business Impact:
Teaches small AI to think better with examples.
While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.
Similar Papers
Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models
Computation and Language
Makes AI think smarter, not longer.
Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners
Computation and Language
Makes AI smarter by learning from mistakes.
CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Computation and Language
Makes AI better at thinking through problems.