Score: 0

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Published: December 15, 2025 | arXiv ID: 2512.13869v1

By: Wenda Li , Meng Wu , Sungmin Eum and more

Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models

CV and Pattern Recognition

Makes fake pictures better for training AI.

16 Nov 2025 1

87%

HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

CV and Pattern Recognition

Makes AI draw pictures exactly where you say.

10 May 2025 4

87%

From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model

CV and Pattern Recognition

Makes AI create detailed pictures much faster.

12 Nov 2025 1

View PDF Login to Bookmark

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

Technical Abstract

HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models

HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation

From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model