CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation
By: ZhenQi Chen, TsaiChing Ni, YuanFu Yang
Potential Business Impact:
Makes AI pictures match words better.
Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
Similar Papers
FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection
CV and Pattern Recognition
Finds fake pictures made by computers.
High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance
CV and Pattern Recognition
Makes pictures match words perfectly.
Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach
CV and Pattern Recognition
Combines pictures using words to make better images.