Learning to Generate Human-Human-Object Interactions from Textual Descriptions
By: Jeonghyeon Na , Sangwon Baik , Inhee Lee and more
Potential Business Impact:
Teaches computers to show people interacting with objects.
The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.
Similar Papers
MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions
CV and Pattern Recognition
Helps computers understand how people use things.
UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space
CV and Pattern Recognition
Helps computers understand how people use things.
Modeling the Multivariate Relationship with Contextualized Representations for Effective Human-Object Interaction Detection
CV and Pattern Recognition
Helps computers understand how people use tools.