Computing-In-Memory Aware Model Adaption For Edge Devices
By: Ming-Han Lin, Tian-Sheuan Chang
Potential Business Impact:
Makes AI chips faster and smaller.
Computing-in-Memory (CIM) macros have gained popularity for deep learning acceleration due to their highly parallel computation and low power consumption. However, limited macro size and ADC precision introduce throughput and accuracy bottlenecks. This paper proposes a two-stage CIM-aware model adaptation process. The first stage compresses the model and reallocates resources based on layer importance and macro size constraints, reducing model weight loading latency while improving resource utilization and maintaining accuracy. The second stage performs quantization-aware training, incorporating partial sum quantization and ADC precision to mitigate quantization errors in inference. The proposed approach enhances CIM array utilization to 90\%, enables concurrent activation of up to 256 word lines, and achieves up to 93\% compression, all while preserving accuracy comparable to previous methods.
Similar Papers
Computing-In-Memory Dataflow for Minimal Buffer Traffic
Hardware Architecture
Makes AI chips faster and use less power.
A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation
Hardware Architecture
Makes AI faster and use less power.
A digital SRAM-based compute-in-memory macro for weight-stationary dynamic matrix multiplication in Transformer attention score computation
Hardware Architecture
Makes AI faster and use less power.