Score: 1

Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

Published: December 27, 2025 | arXiv ID: 2512.22464v1

By: Sukhyun Jeong, Yong-Hoon Choi

Potential Business Impact:

Creates realistic character movements from text descriptions.

Business Areas:

Motion Capture Media and Entertainment, Video

Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.

Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization

CV and Pattern Recognition

Makes computer-made people move more realistically.

20 Aug 2025 0

89%

MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

CV and Pattern Recognition

Makes videos follow real-world physics rules.

3 Dec 2025 4

89%

Dynamic Motion Blending for Versatile Motion Editing

CV and Pattern Recognition

Makes animated characters move how you describe.

26 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Repos / Data Links

github.com

Page Count

10 pages

Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

Creates realistic character movements from text descriptions.

Technical Abstract

Making Pose Representations More Expressive and Disentangled via Residual Vector Quantization

MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Dynamic Motion Blending for Versatile Motion Editing