Score: 1

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs

Published: April 2, 2025 | arXiv ID: 2504.01916v1

By: Mothilal Asokan, Kebin Wu, Fatima Albreiki

Potential Business Impact:

Lets computers understand longer descriptions of pictures.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As a pioneering vision-language model, CLIP (Contrastive Language-Image Pre-training) has achieved significant success across various domains and a wide range of downstream vision-language tasks. However, the text encoders in popular CLIP models are limited to processing only 77 text tokens, which constrains their ability to effectively handle longer, detail-rich captions. Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, \textbf{FineLIP}, that extends the capabilities of CLIP. FineLIP enhances cross-modal text-image mapping by incorporating \textbf{Fine}-grained alignment with \textbf{L}onger text input within the CL\textbf{IP}-style framework. FineLIP first extends the positional embeddings to handle longer text, followed by the dynamic aggregation of local image and text tokens. The aggregated results are then used to enforce fine-grained token-to-token cross-modal alignment. We validate our model on datasets with long, detailed captions across two tasks: zero-shot cross-modal retrieval and text-to-image generation. Quantitative and qualitative experimental results demonstrate the effectiveness of FineLIP, outperforming existing state-of-the-art approaches. Furthermore, comprehensive ablation studies validate the benefits of key design elements within FineLIP.

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

CV and Pattern Recognition

Helps computers understand images and long text better.

6 Nov 2025 2

91%

FG-CLIP: Fine-Grained Visual and Textual Alignment

CV and Pattern Recognition

Helps computers understand tiny details in pictures.

8 May 2025 3

91%

CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

CV and Pattern Recognition

Makes computers understand pictures and words better.

4 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇦🇪 United Arab Emirates

Page Count

15 pages

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs

Lets computers understand longer descriptions of pictures.

Technical Abstract

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

FG-CLIP: Fine-Grained Visual and Textual Alignment

CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions