TT-Edge: A Hardware-Software Co-Design for Energy-Efficient Tensor-Train Decomposition on Edge AI
By: Hyunseok Kwak , Kyeongwon Lee , Kyeongpil Min and more
Potential Business Impact:
Makes AI models smaller and faster on phones.
The growing demands of distributed learning on resource constrained edge devices underscore the importance of efficient on device model compression. Tensor Train Decomposition (TTD) offers high compression ratios with minimal accuracy loss, yet repeated singular value decompositions (SVDs) and matrix multiplications can impose significant latency and energy costs on low power processors. In this work, we present TT-Edge, a hardware software co designed framework aimed at overcoming these challenges. By splitting SVD into two phases--bidiagonalization and diagonalization--TT-Edge offloads the most compute intensive tasks to a specialized TTD Engine. This engine integrates tightly with an existing GEMM accelerator, thereby curtailing the frequent matrix vector transfers that often undermine system performance and energy efficiency. Implemented on a RISC-V-based edge AI processor, TT-Edge achieves a 1.7x speedup compared to a GEMM only baseline when compressing a ResNet 32 model via TTD, while reducing overall energy usage by 40.2 percent. These gains come with only a 4 percent increase in total power and minimal hardware overhead, enabled by a lightweight design that reuses GEMM resources and employs a shared floating point unit. Our experimental results on both FPGA prototypes and post-synthesis power analysis at 45 nm demonstrate that TT-Edge effectively addresses the latency and energy bottlenecks of TTD based compression in edge environments.
Similar Papers
Comprehensive Design Space Exploration for Tensorized Neural Network Hardware Accelerators
Hardware Architecture
Makes AI run much faster on small devices.
Comprehensive Design Space Exploration for Tensorized Neural Network Hardware Accelerators
Hardware Architecture
Makes smart devices run faster and use less power.
Efficient Edge Test-Time Adaptation via Latent Feature Coordinate Correction
Machine Learning (CS)
Makes smart devices learn faster with less power.