Score: 0

Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Published: October 16, 2025 | arXiv ID: 2510.14526v1

By: Yunze Tong , Didi Zhu , Zijing Hu and more

Potential Business Impact:

Makes AI pictures match words better.

Business Areas:

Visual Search Internet Services

In text-to-image generation, different initial noises induce distinct denoising paths with a pretrained Stable Diffusion (SD) model. While this pattern could output diverse images, some of them may fail to align well with the prompt. Existing methods alleviate this issue either by altering the denoising dynamics or by drawing multiple noises and conducting post-selection. In this paper, we attribute the misalignment to a training-inference mismatch: during training, prompt-conditioned noises lie in a prompt-specific subset of the latent space, whereas at inference the noise is drawn from a prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector that applies text-conditioned refinement to the initial noise before denoising. Conditioned on the prompt embedding, it maps the noise to a prompt-aware counterpart that better matches the distribution observed during SD training, without modifying the SD model. Our framework consists of these steps: we first sample some noises and obtain token-level feedback for their corresponding images from a vision-language model (VLM), then distill these signals into a reward model, and finally optimize the noise projector via a quasi-direct preference optimization. Our design has two benefits: (i) it requires no reference images or handcrafted priors, and (ii) it incurs small inference cost, replacing multi-sample selection with a single forward pass. Extensive experiments further show that our prompt-aware noise projection improves text-image alignment across diverse prompts.

Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models

CV and Pattern Recognition

Finds bad AI pictures while they're still being made.

9 Dec 2025 0

89%

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

CV and Pattern Recognition

Makes AI art faster and cheaper.

28 Aug 2025 0

88%

ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models

CV and Pattern Recognition

Makes AI models better at understanding images and text.

6 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models

Makes AI pictures match words better.

Technical Abstract

Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

ANPrompt: Anti-noise Prompt Tuning for Vision-Language Models