Inference-Time Scaling of Diffusion Models for Infrared Data Generation
By: Kai A. Horstmann, Maxim Clouser, Kia Khezeli
Potential Business Impact:
Makes AI create better "night vision" pictures.
Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.
Similar Papers
TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization
CV and Pattern Recognition
Cleans up blurry heat pictures for robots.
Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling
CV and Pattern Recognition
Makes AI draw better pictures faster.
Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection
CV and Pattern Recognition
Finds tiny things in dark pictures using words.