Domain-Conditioned Scene Graphs for State-Grounded Task Planning
By: Jonas Herzog, Jiangpin Liu, Yue Wang
Potential Business Impact:
Helps robots understand and plan tasks better.
Recent robotic task planning frameworks have integrated large multimodal models (LMMs) such as GPT-4o. To address grounding issues of such models, it has been suggested to split the pipeline into perceptional state grounding and subsequent state-based planning. As we show in this work, the state grounding ability of LMM-based approaches is still limited by weaknesses in granular, structured, domain-specific scene understanding. To address this shortcoming, we develop a more structured state grounding framework that features a domain-conditioned scene graph as its scene representation. We show that such representation is actionable in nature as it is directly mappable to a symbolic state in planning languages such as the Planning Domain Definition Language (PDDL). We provide an instantiation of our state grounding framework where the domain-conditioned scene graph generation is implemented with a lightweight vision-language approach that classifies domain-specific predicates on top of domain-relevant object detections. Evaluated across three domains, our approach achieves significantly higher state rounding accuracy and task planning success rates compared to LMM-based approaches.
Similar Papers
Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs
Robotics
Robots work together to follow spoken commands.
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
CV and Pattern Recognition
Helps robots understand and use household objects.
Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning
Robotics
Robots learn to do tasks even when things change.