Score: 1

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Published: December 8, 2025 | arXiv ID: 2512.07216v1

By: Bin Wu , Feifan Yang , Zhangming Chan and more

BigTech Affiliations: Alibaba

Potential Business Impact:

Helps online stores show you better ads.

Business Areas:

Semantic Search Internet Services

Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.

MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential Behavior for Retrieval Recommendation

Information Retrieval

Finds better videos by using past actions.

20 Aug 2025 1

90%

Multimodal Foundation Model-Driven User Interest Modeling and Behavior Analysis on Short Video Platforms

Information Retrieval

Shows you videos you'll actually like.

5 Sep 2025 0

89%

MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

CV and Pattern Recognition

Creates images that perfectly match feelings.

26 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Helps online stores show you better ads.

Technical Abstract

MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential Behavior for Retrieval Recommendation

Multimodal Foundation Model-Driven User Interest Modeling and Behavior Analysis on Short Video Platforms

MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization