Beyond Discrete Categories: Multi-Task Valence-Arousal Modeling for Pet Vocalization Analysis
By: Junyao Huang, Rumin Situ
Potential Business Impact:
Helps understand pet feelings from their sounds.
Traditional pet emotion recognition from vocalizations, based on discrete classification, struggles with ambiguity and capturing intensity variations. We propose a continuous Valence-Arousal (VA) model that represents emotions in a two-dimensional space. Our method uses an automatic VA label generation algorithm, enabling large-scale annotation of 42,553 pet vocalization samples. A multi-task learning framework jointly trains VA regression with auxiliary tasks (emotion, body size, gender) to enhance prediction by improving feature learning. Our Audio Transformer model achieves a validation Valence Pearson correlation of r = 0.9024 and an Arousal r = 0.7155, effectively resolving confusion between discrete categories like "territorial" and "happy." This work introduces the first continuous VA framework for pet vocalization analysis, offering a more expressive representation for human-pet interaction, veterinary diagnostics, and behavioral training. The approach shows strong potential for deployment in consumer products like AI pet emotion translators.
Similar Papers
MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
Sound
Matches pictures and music to feelings.
MMVA: Multimodal Matching Based on Valence and Arousal across Images, Music, and Musical Captions
Sound
Matches pictures and music to how they feel.
Interactive Multimodal Fusion with Temporal Modeling
CV and Pattern Recognition
Lets computers guess your feelings from faces and voices.