PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music
By: Hayeon Bang , Eunjin Choi , Seungheon Doh and more
Potential Business Impact:
Helps computers understand piano music's feelings.
Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.
Similar Papers
EBind: a practical approach to space binding
Machine Learning (CS)
Makes AI understand images, sound, and words faster.
PianoVAM: A Multimodal Piano Performance Dataset
Sound
Helps computers learn to play piano by watching.
Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation
Sound
Records piano playing with sound, video, and finger movements.