Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding
By: Oriol Barbany, Adrià Colomé, Carme Torras
Potential Business Impact:
Helps robots fold clothes by seeing and remembering.
Manipulating clothes is challenging due to their complex dynamics, high deformability, and frequent self-occlusions. Garments exhibit a nearly infinite number of configurations, making explicit state representations difficult to define. In this paper, we analyze BiFold, a model that predicts language-conditioned pick-and-place actions from visual observations, while implicitly encoding garment state through end-to-end learning. To address scenarios such as crumpled garments or recovery from failed manipulations, BiFold leverages temporal context to improve state estimation. We examine the internal representations of the model and present evidence that its fine-tuning and temporal context enable effective alignment between text and image regions, as well as temporal consistency.
Similar Papers
Real-Time Per-Garment Virtual Try-On with Temporal Consistency for Loose-Fitting Garments
Graphics
Lets you try on clothes virtually, even baggy ones.
Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception
Robotics
Robots learn to fold clothes from instructions.
MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model
Robotics
Teaches robots to fold clothes from any instruction.