Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction
By: Boran Wen , Ye Lu , Keyan Wan and more
Potential Business Impact:
Robots learn to copy human actions from videos.
Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. Thus, in this work, we introduce 4DHOISolver, a novel and efficient optimization framework that constrains the ill-posed 4D HOI reconstruction problem by leveraging sparse, human-in-the-loop contact point annotations, while maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 144 object types and 103 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. However, a comprehensive benchmark of existing 3D foundation models indicates that automatically predicting precise human-object contact correspondences remains an unsolved problem, underscoring the immediate necessity of our human-in-the-loop strategy while posing an open challenge to the community. Data and code will be publicly available at https://wenboran2002.github.io/open4dhoi/
Similar Papers
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
CV and Pattern Recognition
Makes videos of hands touching objects realistic.
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
CV and Pattern Recognition
Makes robots better at picking up and using things.
MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
CV and Pattern Recognition
Helps computers understand how people use things.