Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens
By: Yohann Perron , Vladyslav Sydorov , Christophe Pottier and more
Potential Business Impact:
**Sees tiny details and big picture in photos.**
Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .
Similar Papers
UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers
CV and Pattern Recognition
Makes AI create much bigger, clearer pictures.
Differentiable Hierarchical Visual Tokenization
CV and Pattern Recognition
Makes computer vision understand pictures better.
Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation
CV and Pattern Recognition
Makes robots see better in messy places.