Score: 1

SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment

Published: November 4, 2025 | arXiv ID: 2511.03019v1

By: Wenbo Lu

Potential Business Impact:

Teaches computers to understand pictures and words together.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.

Country of Origin
🇺🇸 United States

Page Count
12 pages

Category
Computer Science:
CV and Pattern Recognition