Score: 0

Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Published: August 11, 2025 | arXiv ID: 2508.07804v1

By: Bao Li , Xiaomei Zhang , Miao Xu and more

Potential Business Impact:

Makes computers create 3D body poses from pictures.

Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

CV and Pattern Recognition

Makes AI understand videos better, like a detective.

9 Apr 2025 1

88%

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control

Robotics

Robots learn new moves from written instructions.

15 Jun 2025 0

88%

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

CV and Pattern Recognition

Teaches computers to understand videos like people.

18 May 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

15 pages

Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Makes computers create 3D body poses from pictures.

Technical Abstract

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control

VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning