Score: 1

CoCAViT: Compact Vision Transformer with Robust Global Coordination

Published: August 7, 2025 | arXiv ID: 2508.05307v1

By: Xuyang Wang, Lingjuan Miao, Zhiqiang Zhou

Potential Business Impact:

Makes small computer vision models work better everywhere.

In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.

CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

CV and Pattern Recognition

Makes AI see medical pictures better with less power.

31 Oct 2025 1

90%

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

CV and Pattern Recognition

Makes computers see better with less effort.

23 Aug 2025 0

90%

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

CV and Pattern Recognition

Makes AI see better using less power.

18 Nov 2025 1

View PDF Login to Bookmark

Page Count

11 pages

CoCAViT: Compact Vision Transformer with Robust Global Coordination

Makes small computer vision models work better everywhere.

Technical Abstract

CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer