PixelDiT: Pixel Diffusion Transformers for Image Generation
By: Yongsheng Yu , Wei Xiong , Weili Nie and more
Potential Business Impact:
Makes AI create clearer, more detailed pictures.
Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
Similar Papers
DiP: Taming Diffusion Models in Pixel Space
CV and Pattern Recognition
Creates detailed pictures much faster.
Diffusion Transformers with Representation Autoencoders
CV and Pattern Recognition
Makes AI create better, clearer pictures faster.
ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
CV and Pattern Recognition
Makes AI create clearer, bigger pictures without errors.