Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation
By: Luca Cazzola, Ahed Alboody
Potential Business Impact:
Creates realistic human movements from text descriptions.
The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).
Similar Papers
Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
CV and Pattern Recognition
Checks if fake human videos look real.
A Renaissance of Explicit Motion Information Mining from Transformers for Action Recognition
CV and Pattern Recognition
Helps computers understand fast movements in videos.
Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
CV and Pattern Recognition
Makes videos move exactly how you want.