Score: 1

AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Published: December 16, 2025 | arXiv ID: 2512.14095v1

By: Sisi Dai, Kai Xu

Potential Business Impact:

Makes computers create realistic human-object videos.

Business Areas:

Image Recognition Data and Analytics, Software

Despite significant progress in text-driven 4D human-object interaction (HOI) generation with supervised methods, the scalability remains limited by the scarcity of large-scale 4D HOI datasets. To overcome this, recent approaches attempt zero-shot 4D HOI generation with pre-trained image diffusion models. However, interaction cues are minimally distilled during the generation process, restricting their applicability across diverse scenarios. In this paper, we propose AnchorHOI, a novel framework that thoroughly exploits hybrid priors by incorporating video diffusion models beyond image diffusion models, advancing 4D HOI generation. Nevertheless, directly optimizing high-dimensional 4D HOI with such priors remains challenging, particularly for human pose and compositional motion. To address this challenge, AnchorHOI introduces an anchor-based prior distillation strategy, which constructs interaction-aware anchors and then leverages them to guide generation in a tractable two-step process. Specifically, two tailored anchors are designed for 4D HOI generation: anchor Neural Radiance Fields (NeRFs) for expressive interaction composition, and anchor keypoints for realistic motion synthesis. Extensive experiments demonstrate that AnchorHOI outperforms previous methods with superior diversity and generalization.

GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

CV and Pattern Recognition

Creates realistic human-object actions for computers.

18 Jun 2025 1

92%

Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

Graphics

Creates realistic 3D actions from text descriptions.

25 Mar 2025 0

91%

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

CV and Pattern Recognition

Makes videos of people interacting with objects.

10 Dec 2025 1

View PDF Login to Bookmark

Page Count

9 pages

AnchorHOI: Zero-shot Generation of 4D Human-Object Interaction via Anchor-based Prior Distillation

Makes computers create realistic human-object videos.

Technical Abstract

GenHOI: Generalizing Text-driven 4D Human-Object Interaction Synthesis for Unseen Objects

Zero-Shot Human-Object Interaction Synthesis with Multimodal Priors

VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification