Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion
By: Hai-Dang Kieu , Min Xu , Thanh Trung Huynh and more
Potential Business Impact:
Improves online shopping suggestions using pictures and words.
Recent advances in multimodal recommendation (MMR) have shown that incorporating rich content sources such as images and text can lead to significant gains representation quality. However, existing methods often rely on coarse visual features and uncontrolled fusion, leading to redundant or misaligned representations. As a result, visual encoders often fail to capture salient, item-relevant semantics, limiting their contribution in multimodal fusion. From an information-theoretic perspective, effective fusion should balance the unique, shared, and redundant information across modalities, preserving complementary cues while avoiding correlation bias. This paper presents VLIF, a vision-language and information-theoretic fusion framework that enhances multimodal recommendation through two key components. (i) A VLM-based visual enrichment module generates fine-grained, title-guided descriptions to transform product images into semantically aligned representations. (ii) An information-aware fusion module, inspired by Partial Information Decomposition (PID), disentangles redundant and synergistic signals across modalities for controlled integration. Experiments on three Amazon datasets demonstrate that VLIF consistently outperforms recent multimodal baselines and substantially strengthens the contribution of visual features.
Similar Papers
Enhancing Multimodal Recommendations with Vision-Language Models and Information-Aware Fusion
Information Retrieval
Helps online stores show you better stuff.
Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation
Information Retrieval
Makes movie suggestions better using pictures and words.
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
Information Retrieval
Finds better products you'll like to buy.