Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space
By: Kaiwen Wang , Kaili Zheng , Yiming Shi and more
Potential Business Impact:
Makes 3D people in pictures stand correctly.
Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a novel relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code and data will be released publicly.
Similar Papers
MetricHMR: Metric Human Mesh Recovery from Monocular Images
CV and Pattern Recognition
Makes 3D body models from one picture.
PressTrack-HMR: Pressure-Based Top-Down Multi-Person Global Human Mesh Recovery
CV and Pattern Recognition
Lets mats track many people's movements without cameras.
Asset-Driven Sematic Reconstruction of Dynamic Scene with Multi-Human-Object Interactions
CV and Pattern Recognition
Makes 3D models of moving people and things.