Score: 2

Context-Aware Semantic Segmentation via Stage-Wise Attention

Published: January 16, 2026 | arXiv ID: 2601.11310v1

By: Antoine Carreaud , Elias Naha , Arthur Chansel and more

Potential Business Impact:

Maps tiny details in satellite pictures better.

Business Areas:

Semantic Search Internet Services

Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.

Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

CV and Pattern Recognition

**Sees tiny details and big picture in photos.**

9 Jan 2026 0

88%

Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

CV and Pattern Recognition

Helps doctors describe medical pictures faster.

13 Nov 2025 2

88%

Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications

CV and Pattern Recognition

Helps computers understand pictures like people do.

25 Mar 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

18 pages

Context-Aware Semantic Segmentation via Stage-Wise Attention

Maps tiny details in satellite pictures better.

Technical Abstract

Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications