Score: 0

CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

Published: December 27, 2025 | arXiv ID: 2512.22681v1

By: ZhenQi Chen, TsaiChing Ni, YuanFu Yang

Potential Business Impact:

Makes AI pictures match words better.

Business Areas:
Semantic Search Internet Services

Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

Country of Origin
🇹🇼 Taiwan, Province of China

Page Count
20 pages

Category
Computer Science:
CV and Pattern Recognition