Score: 1

Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception

Published: September 2, 2025 | arXiv ID: 2509.02324v1

By: Changshi Zhou , Haichuan Xu , Ningquan Gu and more

Potential Business Impact:

Robots learn to fold clothes from instructions.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Language-guided long-horizon manipulation of deformable objects presents significant challenges due to high degrees of freedom, complex dynamics, and the need for accurate vision-language grounding. In this work, we focus on multi-step cloth folding, a representative deformable-object manipulation task that requires both structured long-horizon planning and fine-grained visual perception. To this end, we propose a unified framework that integrates a Large Language Model (LLM)-based planner, a Vision-Language Model (VLM)-based perception system, and a task execution module. Specifically, the LLM-based planner decomposes high-level language instructions into low-level action primitives, bridging the semantic-execution gap, aligning perception with action, and enhancing generalization. The VLM-based perception module employs a SigLIP2-driven architecture with a bidirectional cross-attention fusion mechanism and weight-decomposed low-rank adaptation (DoRA) fine-tuning to achieve language-conditioned fine-grained visual grounding. Experiments in both simulation and real-world settings demonstrate the method's effectiveness. In simulation, it outperforms state-of-the-art baselines by 2.23, 1.87, and 33.3 on seen instructions, unseen instructions, and unseen tasks, respectively. On a real robot, it robustly executes multi-step folding sequences from language instructions across diverse cloth materials and configurations, demonstrating strong generalization in practical scenarios. Project page: https://language-guided.netlify.app/

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models

Robotics

Robots learn to explore and do tasks better.

16 Aug 2025 0

91%

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Robotics

Robots learn to do tricky jobs with speed and accuracy.

7 Mar 2025 0

91%

Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation

Robotics

Robots learn to do many steps in a row.

27 Aug 2025 1

View PDF Login to Bookmark

Page Count

10 pages

Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception

Robots learn to fold clothes from instructions.

Technical Abstract

ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models

Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation

Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation