Training-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints
By: Mutiara Shabrina , Nova Kurnia Putri , Jefri Satria Ferdiansyah and more
Potential Business Impact:
Changes pictures without messing up faces.
Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute (e.g., adding bangs) unintentionally alters other semantic properties such as identity or appearance. The Predict, Prevent, and Evaluate (PPE) framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing. In this work, we analyze the PPE framework, focusing on its architectural components, including BERT-based attribute prediction and StyleGAN2-based image generation on the CelebA-HQ dataset. Through empirical analysis, we identify a limitation in the original regularization strategy, where latent updates remain dense and prone to semantic leakage. To mitigate this issue, we introduce a sparsity-based constraint using L1 regularization on latent space manipulation. Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.
Similar Papers
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models
CV and Pattern Recognition
Helps computers understand pictures and words better.
SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder
Graphics
Changes pictures precisely, like magic.
Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
CV and Pattern Recognition
Keeps characters the same in generated stories.