Language-Image Alignment with Fixed Text Encoders
By: Jingfeng Yang , Ziyang Wu , Yue Zhao and more
Potential Business Impact:
Teaches computers to understand pictures using words.
Currently, the most dominant approach to establishing language-image alignment is to pre-train text and image encoders jointly through contrastive learning, such as CLIP and its variants. In this work, we question whether such a costly joint training is necessary. In particular, we investigate if a pre-trained fixed large language model (LLM) offers a good enough text encoder to guide visual representation learning. That is, we propose to learn Language-Image alignment with a Fixed Text encoder (LIFT) from an LLM by training only the image encoder. Somewhat surprisingly, through comprehensive benchmarking and ablation studies, we find that this much simplified framework LIFT is highly effective and it outperforms CLIP in most scenarios that involve compositional understanding and long captions, while achieving considerable gains in computational efficiency. Our work takes a first step towards systematically exploring how text embeddings from LLMs can guide visual learning and suggests an alternative design choice for learning language-aligned visual representations.
Similar Papers
A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
CV and Pattern Recognition
Makes AI draw better pictures from words.
Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays
CV and Pattern Recognition
Helps doctors understand X-rays by reading reports.
PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
CV and Pattern Recognition
Helps computers understand images and long text better.