See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval
By: Mingyu Jeon , Sungjin Han , Jinkwon Hwang and more
Recent advances in Multimodal Large Language Models (MLLMs) have improved image recognition and reasoning, but video-related tasks remain challenging due to memory constraints from dense frame processing. Existing Video Moment Retrieval (VMR) methodologies rely on sparse frame sampling, risking potential information loss, especially in lengthy videos. We propose SMORE (See MORE, store less), a framework that enhances memory efficiency while maintaining high information resolution. SMORE (1) uses query-guided captions to encode semantics aligned with user intent, (2) applies query-aware importance modulation to highlight relevant segments, and (3) adaptively compresses frames to preserve key content while reducing redundancy. This enables efficient video understanding without exceeding memory budgets. Experimental validation reveals that SMORE achieves state-of-the-art performance on QVHighlights, Charades-STA, and ActivityNet-Captions benchmarks.
Similar Papers
Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning
CV and Pattern Recognition
Teaches robots to learn from seeing and understanding.
VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
CV and Pattern Recognition
Lets computers watch and remember long videos.
SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
CV and Pattern Recognition
Finds exact video moments using words and sound.