Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation
By: Runqi Ouyang , Haoyun Li , Zhenyuan Zhang and more
Potential Business Impact:
Makes characters move realistically from text descriptions.
Recent advances in large language models, especially in natural language understanding and reasoning, have opened new possibilities for text-to-motion generation. Although existing approaches have made notable progress in semantic alignment and motion synthesis, they often rely on end-to-end mapping strategies that fail to capture deep linguistic structures and logical reasoning. Consequently, generated motions tend to lack controllability, consistency, and diversity. To address these limitations, we propose Motion-R1, a unified motion-language modeling framework that integrates a Chain-of-Thought mechanism. By explicitly decomposing complex textual instructions into logically structured action paths, Motion-R1 provides high-level semantic guidance for motion generation, significantly enhancing the model's ability to interpret and execute multi-step, long-horizon, and compositionally rich commands. To train our model, we adopt Group Relative Policy Optimization, a reinforcement learning algorithm designed for large models, which leverages motion quality feedback to optimize reasoning chains and motion synthesis jointly. Extensive experiments across multiple benchmark datasets demonstrate that Motion-R1 achieves competitive or superior performance compared to state-of-the-art methods, particularly in scenarios requiring nuanced semantic understanding and long-term temporal coherence. The code, model and data will be publicly available.
Similar Papers
Strong and Controllable 3D Motion Generation
CV and Pattern Recognition
Makes computer characters move faster and better.
RM-R1: Reward Modeling as Reasoning
Computation and Language
Makes AI explain its answers better.
IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
CV and Pattern Recognition
Makes computer-made movements look more real.