Score: 0

Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Published: December 18, 2025 | arXiv ID: 2512.16586v1

By: Shaohua Wu , Tong Yu , Shenling Wang and more

Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers

CV and Pattern Recognition

Finds sickness in medical pictures faster.

8 Sep 2025 2

87%

SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer

CV and Pattern Recognition

Helps computers read lips faster and better.

7 May 2025 1

87%

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

CV and Pattern Recognition

Makes pictures clearer by looking at parts.

24 Jul 2025 1

View PDF Login to Bookmark

Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Technical Abstract

Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers

SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer

Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows