Context-Aware Semantic Segmentation via Stage-Wise Attention
By: Antoine Carreaud , Elias Naha , Arthur Chansel and more
Potential Business Impact:
Maps tiny details in satellite pictures better.
Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.
Similar Papers
Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
CV and Pattern Recognition
**Sees tiny details and big picture in photos.**
Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning
CV and Pattern Recognition
Helps doctors describe medical pictures faster.
Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications
CV and Pattern Recognition
Helps computers understand pictures like people do.