Score: 0

Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

Published: December 26, 2025 | arXiv ID: 2512.21863v1

By: Huatuan Sun , Yunshan Ma , Changguang Wu and more

Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual semantics crucial for recommendation; (2) ID embeddings capture irreplaceable collaborative signals, rendering fusion strictly superior to replacement; and (3) the effectiveness of intermediate decoder features varies significantly across layers. Guided by these insights, we propose the Dual Feature Fusion (DFF) Framework, a lightweight and plug-and-play approach that adaptively fuses multi-layer representations from frozen LVLMs with item ID embeddings. DFF achieves state-of-the-art performance on two real-world micro-video recommendation benchmarks, consistently outperforming strong baselines and providing a principled approach to integrating off-the-shelf large vision-language models into micro-video recommender systems.

Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation

Information Retrieval

Makes movie suggestions better using pictures and words.

6 Aug 2025 0

89%

Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning

CV and Pattern Recognition

Helps computers understand pictures and words better.

25 Aug 2025 0

89%

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Information Retrieval

Helps video apps understand what you *really* like.

13 Aug 2025 2

View PDF Login to Bookmark

Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

Technical Abstract

Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation

Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations