Reconstructing Objects along Hand Interaction Timelines in Egocentric Video
By: Zhifan Zhu , Siddhant Bansal , Shashank Tripathi and more
Potential Business Impact:
Helps computers guess object shapes from videos.
We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.
Similar Papers
Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
CV and Pattern Recognition
Makes videos of hands touching objects realistic.
Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views
CV and Pattern Recognition
Finds exact moments hands touch objects.
ECHO: Ego-Centric modeling of Human-Object interactions
CV and Pattern Recognition
Tracks what you're doing with your hands.