AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation
By: Sisi Dai, Kai Xu
Potential Business Impact:
Makes computers create realistic human-object videos.
Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.
Similar Papers
GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects
CV and Pattern Recognition
Creates realistic human-object actions for computers.
Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors
Graphics
Creates realistic 3D actions from text descriptions.
VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification
CV and Pattern Recognition
Makes videos of people interacting with objects.