PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment
By: Dewen Zhang , Tahir Hussain , Wangpeng An and more
Potential Business Impact:
Helps computers understand body poses from pictures.
Human pose estimation traditionally relies on architectures that encode keypoint priors, limiting their generalization to novel poses or unseen keypoints. Recent language-guided approaches like LocLLM reformulate keypoint localization as a vision-language task, enabling zero-shot generalization through textual descriptions. However, LocLLM's linear projector fails to capture complex spatial-textual interactions critical for high-precision localization. To address this, we propose PoseLLM, the first Large Language Model (LLM)-based pose estimation framework that replaces the linear projector with a nonlinear MLP vision-language connector. This lightweight two-layer MLP with GELU activation enables hierarchical cross-modal feature transformation, enhancing the fusion of visual patches and textual keypoint descriptions. Trained exclusively on COCO data, PoseLLM achieves 77.8 AP on the COCO validation set, outperforming LocLLM by +0.4 AP, while maintaining strong zero-shot generalization on Human-Art and MPII. Our work demonstrates that a simple yet powerful nonlinear connector significantly boosts localization accuracy without sacrificing generalization, advancing the state-of-the-art in language-guided pose estimation. Code is available at https://github.com/Ody-trek/PoseLLM.
Similar Papers
HMVLM: Human Motion-Vision-Lanuage Model via MoE LoRA
CV and Pattern Recognition
Teaches computers to understand and create human movement.
Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection
CV and Pattern Recognition
Helps drones see objects better using text.
LGM-Pose: A Lightweight Global Modeling Network for Real-time Human Pose Estimation
CV and Pattern Recognition
Makes computers see people's bodies faster.