Score: 0

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Published: August 25, 2025 | arXiv ID: 2508.17714v1

By: Hanbo Bi , Zhiqiang Yuan , Zexi Jia and more

Potential Business Impact:

Finds important parts in long chats.

Business Areas:

Visual Search Internet Services

Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.

A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

CV and Pattern Recognition

Improves image search by learning from results.

21 Nov 2025 1

89%

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

CV and Pattern Recognition

Helps computers pick right picture from many.

16 Oct 2025 2

89%

Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

CV and Pattern Recognition

Helps AI understand pictures and details better.

11 Dec 2025 1

View PDF Login to Bookmark

Page Count

19 pages

F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Finds important parts in long chats.

Technical Abstract

A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies