Learning Unified User Quantized Tokenizers for User Representation
By: Chuan He , Yang Chen , Wuliang Huang and more
Potential Business Impact:
Helps apps know you better with less data.
Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U^2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, a causal Q-Former projects domain-specific features into a shared causal representation space to preserve inter-modality dependencies; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U^2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.
Similar Papers
A Hierarchical Quantized Tokenization Framework for Task-Adaptive Graph Representation Learning
Information Retrieval
Helps computers understand and learn from complex data.
MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation
Information Retrieval
Helps online stores recommend better, even for new items.
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
Machine Learning (CS)
Makes smart phone AI run much faster and smaller.