GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models
By: Bozhou Li , Sihan Yang , Yushuo Guan and more
Potential Business Impact:
Makes AI pictures match words better.
The text encoder is a critical component of text-to-image and text-to-video diffusion models, fundamentally determining the semantic fidelity of the generated content. However, its development has been hindered by two major challenges: the lack of an efficient evaluation framework that reliably predicts downstream generation performance, and the difficulty of effectively adapting pretrained language models for visual synthesis. To address these issues, we introduce GRAN-TED, a paradigm to Generate Robust, Aligned, and Nuanced Text Embeddings for Diffusion models. Our contribution is twofold. First, we propose TED-6K, a novel text-only benchmark that enables efficient and robust assessment of an encoder's representational quality without requiring costly end-to-end model training. We demonstrate that performance on TED-6K, standardized via a lightweight, unified adapter, strongly correlates with an encoder's effectiveness in downstream generation tasks. Second, guided by this validated framework, we develop a superior text encoder using a novel two-stage training paradigm. This process involves an initial fine-tuning stage on a Multimodal Large Language Model for better visual representation, followed by a layer-wise weighting method to extract more nuanced and potent text features. Our experiments show that the resulting GRAN-TED encoder not only achieves state-of-the-art performance on TED-6K but also leads to demonstrable performance gains in text-to-image and text-to-video generation. Our code is available at the following link: https://anonymous.4open.science/r/GRAN-TED-4FCC/.
Similar Papers
Text-Guided Semantic Image Encoder
CV and Pattern Recognition
Helps computers understand pictures better based on questions.
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
CV and Pattern Recognition
Makes computers draw pictures from descriptions.
TeRA: Rethinking Text-guided Realistic 3D Avatar Generation
CV and Pattern Recognition
Creates realistic 3D people from text descriptions.