Training a Perceptual Model for Evaluating Auditory Similarity in Music Adversarial Attack
By: Yuxuan Liu , Rui Sang , Peihong Zhang and more
Potential Business Impact:
Makes music AI understand songs like people do.
Music Information Retrieval (MIR) systems are highly vulnerable to adversarial attacks that are often imperceptible to humans, primarily due to a misalignment between model feature spaces and human auditory perception. Existing defenses and perceptual metrics frequently fail to adequately capture these auditory nuances, a limitation supported by our initial listening tests showing low correlation between common metrics and human judgments. To bridge this gap, we introduce Perceptually-Aligned MERT Transformer (PAMT), a novel framework for learning robust, perceptually-aligned music representations. Our core innovation lies in the psychoacoustically-conditioned sequential contrastive transformer, a lightweight projection head built atop a frozen MERT encoder. PAMT achieves a Spearman correlation coefficient of 0.65 with subjective scores, outperforming existing perceptual metrics. Our approach also achieves an average of 9.15\% improvement in robust accuracy on challenging MIR tasks, including Cover Song Identification and Music Genre Classification, under diverse perceptual adversarial attacks. This work pioneers architecturally-integrated psychoacoustic conditioning, yielding representations significantly more aligned with human perception and robust against music adversarial attacks.
Similar Papers
Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
Sound
Tests if computers truly hear music.
The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS
Artificial Intelligence
Tests AI's ability to understand music.
Real-world Music Plagiarism Detection With Music Segment Transcription System
Artificial Intelligence
Finds copied music, even if it sounds different.