Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives
By: Shuanghao Bai , Wenxuan Song , Jiayi Chen and more
Recent advances in vision, language, and multimodal learning have substantially accelerated progress in robotic foundation models, with robot manipulation remaining a central and challenging problem. This survey examines robot manipulation from an algorithmic perspective and organizes recent learning-based approaches within a unified abstraction of high-level planning and low-level control. At the high level, we extend the classical notion of task planning to include reasoning over language, code, motion, affordances, and 3D representations, emphasizing their role in structured and long-horizon decision making. At the low level, we propose a training-paradigm-oriented taxonomy for learning-based control, organizing existing methods along input modeling, latent representation learning, and policy learning. Finally, we identify open challenges and prospective research directions related to scalability, data efficiency, multimodal physical interaction, and safety. Together, these analyses aim to clarify the design space of modern foundation models for robotic manipulation.
Similar Papers
Towards a Unified Understanding of Robot Manipulation: A Comprehensive Survey
Robotics
Helps robots learn to pick up and move things.
Foundation Model Driven Robotics: A Comprehensive Review
Robotics
Robots understand and do tasks better with smart AI.
Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs
Robotics
Robots learn new tasks without special training.