FusID: Modality-Fused Semantic IDs for Generative Music Recommendation
By: Haven Kim, Yupeng Hou, Julian McAuley
Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
Similar Papers
Semantic IDs for Music Recommendation
Information Retrieval
Makes music apps suggest better songs, faster.
Semantic IDs for Joint Generative Search and Recommendation
Information Retrieval
Helps computers recommend and find things better.
From ID-based to ID-free: Rethinking ID Effectiveness in Multimodal Collaborative Filtering Recommendation
Information Retrieval
Recommends better by ignoring user IDs.