Score: 1

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Published: March 24, 2025 | arXiv ID: 2503.19157v1

By: Mingzhen Huang , Fu-Jen Chu , Bugra Tekin and more

Potential Business Impact:

Lets computers create and describe 3D hand-object actions.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (\eg text, objects, partial sequences). At its core, HOIGPT utilizes a large language model to predict the bidrectional transformation between HOI sequences and natural language descriptions. Given text inputs, HOIGPT generates a sequence of hand and object meshes; given (partial) HOI sequences, HOIGPT generates text descriptions and completes the sequences. To facilitate HOI understanding with a large language model, this paper introduces two key innovations: (1) a novel physically grounded HOI tokenizer, the hand-object decomposed VQ-VAE, for discretizing HOI sequences, and (2) a motion-aware language model trained to process and generate both text and HOI tokens. Extensive experiments demonstrate that HOIGPT sets new state-of-the-art performance on both text generation (+2.01% R Precision) and HOI generation (-2.56 FID) across multiple tasks and benchmarks.

HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination

CV and Pattern Recognition

Tracks where you look when you use your hands.

28 Apr 2025 0

88%

HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model

CV and Pattern Recognition

Helps robots understand what people do with things.

15 Aug 2025 1

88%

Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation

CV and Pattern Recognition

Makes videos of hands touching objects realistic.

1 Dec 2025 1

View PDF Login to Bookmark

Page Count

11 pages

HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models

Lets computers create and describe 3D hand-object actions.

Technical Abstract

HOIGaze: Gaze Estimation During Hand-Object Interactions in Extended Reality Exploiting Eye-Hand-Head Coordination

HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model

Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation