Empowering Lightweight MLLMs with Reasoning via Long CoT SFT
By: Linyu Ou
Potential Business Impact:
Teaches small AI to think better with examples.
While Reinforcement Learning with Verifiable Rewards has enhanced the reasoning of large-scale language models (LLMs), its efficacy for lightweight multimodal language models (MLLMs) with fewer than seven billion parameters remains underexplored. This paper investigates the role of long Chain-of-Thought (long CoT) data in enhancing the reasoning abilities of such MLLMs. Our findings demonstrate that Supervised Fine-Tuning (SFT) with long CoT data significantly improves MLLM reasoning. Furthermore, we observe that after this initial SFT phase, MLLMs can achieve additional performance gains through a subsequent RL stage. We conclude that a SFT stage with long CoT data is a critical prerequisite for developing the reasoning capabilities of lightweight MLLMs.
Similar Papers
Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models
Computation and Language
Makes AI think smarter, not longer.
The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Computation and Language
Makes AI better at thinking, but not always together.
Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks
Machine Learning (CS)
Teaches computers to solve hard math problems better.