Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
By: Shangxun Li, Youngjung Uh
Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
Similar Papers
Salient Concept-Aware Generative Data Augmentation
CV and Pattern Recognition
Makes AI create better, more varied pictures from words.
StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization
CV and Pattern Recognition
Makes cartoon characters stay the same in stories.
Geometry-Aware Scene-Consistent Image Generation
CV and Pattern Recognition
Adds objects to pictures while keeping scene real.