Score: 1

Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

Published: September 16, 2025 | arXiv ID: 2509.12554v1

By: Wenxuan Ji, Haichao Shi, Xiao-Yu zhang

Potential Business Impact:

Helps computers understand how people interact with objects.

Business Areas:

Image Recognition Data and Analytics, Software

Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.

Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos

CV and Pattern Recognition

Helps computers understand what people do with objects.

3 Jun 2025 2

90%

Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs

CV and Pattern Recognition

Lets computers understand any action between people and things.

19 Dec 2025 1

89%

MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions

CV and Pattern Recognition

Helps computers understand how people use things.

9 Oct 2025 0

View PDF Login to Bookmark

Page Count

9 pages

Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

Helps computers understand how people interact with objects.

Technical Abstract

Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos

Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs

MMHOI: Modeling Complex 3D Multi-Human Multi-Object Interactions