Score: 1

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Published: November 18, 2025 | arXiv ID: 2511.14111v1

By: Srivathsan Sivakumar, Faisal Z. Qureshi

Potential Business Impact:

Makes AI see better using less power.

Business Areas:

Image Recognition Data and Analytics, Software

Vision Transformers (ViTs) have demonstrated remarkable performance across a range of computer vision tasks; however, their high computational, memory, and energy demands hinder deployment on resource-constrained platforms. In this paper, we propose \emph{Cascaded-ViT (CViT)}, a lightweight and compute-efficient vision transformer architecture featuring a novel feedforward network design called \emph{Cascaded-Chunk Feed Forward Network (CCFFN)}. By splitting input features, CCFFN improves parameter and FLOP efficiency without sacrificing accuracy. Experiments on ImageNet-1K show that our \emph{CViT-XL} model achieves 75.5\% Top-1 accuracy while reducing FLOPs by 15\% and energy consumption by 3.3\% compared to EfficientViT-M5. Across various model sizes, the CViT family consistently exhibits the lowest energy consumption, making it suitable for deployment on battery-constrained devices such as mobile phones and drones. Furthermore, when evaluated using a new metric called \emph{Accuracy-Per-FLOP (APF)}, which quantifies compute efficiency relative to accuracy, CViT models consistently achieve top-ranking efficiency. Particularly, CViT-L is 2.2\% more accurate than EfficientViT-M2 while having comparable APF scores.

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

CV and Pattern Recognition

Makes computers see better with less effort.

23 Aug 2025 0

91%

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

CV and Pattern Recognition

Makes computer vision faster and smarter.

23 Aug 2025 0

90%

CoCAViT: Compact Vision Transformer with Robust Global Coordination

CV and Pattern Recognition

Makes small computer vision models work better everywhere.

7 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇦 Canada

Page Count

15 pages

CascadedViT: Cascaded Chunk-FeedForward and Cascaded Group Attention Vision Transformer

Makes AI see better using less power.

Technical Abstract

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

CoCAViT: Compact Vision Transformer with Robust Global Coordination