Role-SynthCLIP: A Role Play Driven Diverse Synthetic Data Approach
By: Yuanxiang Huangfu, Chaochao Wang, Weilei Wang
Potential Business Impact:
Makes AI understand pictures better with varied descriptions.
The effectiveness of Contrastive Language-Image Pre-training (CLIP) models critically depends on the semantic diversity and quality of their training data. However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose Role-SynthCLIP, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. This mechanism enhances the semantic diversity and fine-grained image-text alignment of synthetic pairs, thereby improving caption expressiveness and accuracy while keeping the total number of image-text pairs unchanged. Experimental results demonstrate the effectiveness and efficiency of our method. A CLIP-B/16 model trained on only 1 million Role-SynthCLIP pairs achieves a Recall@1 of 64.1% on the MS COCO validation set, surpassing the best existing synthetic data baseline (trained on 5M pairs) by 2.8 percentage points. The code and trained models are released at https://github.com/huangfu170/Role-SynthCLIP.
Similar Papers
Contrastive vision-language learning with paraphrasing and negation
CV and Pattern Recognition
Teaches computers to understand words that change meaning.
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
CV and Pattern Recognition
Helps computers understand pictures with many things.
uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data
CV and Pattern Recognition
Helps computers understand pictures in many languages.