Score: 0

EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos

Published: January 3, 2026 | arXiv ID: 2601.01050v1

By: Hongming Fu , Wenjia Wang , Xiaozhen Qiao and more

Potential Business Impact:

Lets robots understand how people grab things.

Business Areas:
Image Recognition Data and Analytics, Software

We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild. Accurate W-HOI reconstruction is critical for understanding human behavior and enabling applications in embodied intelligence and virtual reality. However, existing hand-object interactions (HOI) methods are limited to single images or camera coordinates, failing to model temporal dynamics or consistent global trajectories. Some recent approaches attempt world-space hand estimation but overlook object poses and HOI constraints. Their performance also suffers under severe camera motion and frequent occlusions common in egocentric in-the-wild videos. To address these challenges, we introduce a multi-stage framework with a robust pre-process pipeline built on newly developed spatial intelligence models, a whole-body HOI prior model based on decoupled diffusion models, and a multi-objective test-time optimization paradigm. Our HOI prior model is template-free and scalable to multiple objects. In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.

Page Count
11 pages

Category
Computer Science:
CV and Pattern Recognition