MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling
By: Bin Wu , Feifan Yang , Zhangming Chan and more
Potential Business Impact:
Helps online stores show you better ads.
Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.
Similar Papers
MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential Behavior for Retrieval Recommendation
Information Retrieval
Finds better videos by using past actions.
Multimodal Foundation Model-Driven User Interest Modeling and Behavior Analysis on Short Video Platforms
Information Retrieval
Shows you videos you'll actually like.
MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization
CV and Pattern Recognition
Creates images that perfectly match feelings.