Weight Transformations in Bit-Sliced Crossbar Arrays for Fault Tolerant Computing-in-Memory: Design Techniques and Evaluation Framework
By: Akul Malhotra, Sumeet Kumar Gupta
Potential Business Impact:
Fixes computer chips that make AI mistakes.
The deployment of deep neural networks (DNNs) on compute-in-memory (CiM) accelerators offers significant energy savings and speed-up by reducing data movement during inference. However, the reliability of CiM-based systems is challenged by stuck-at faults (SAFs) in memory cells, which corrupt stored weights and lead to accuracy degradation. While closest value mapping (CVM) has been shown to partially mitigate these effects for multibit DNNs deployed on bit-sliced crossbars, its fault tolerance is often insufficient under high SAF rates or for complex tasks. In this work, we propose two training-free weight transformation techniques, sign-flip and bit-flip, that enhance SAF tolerance in multi-bit DNNs deployed on bit-sliced crossbar arrays. Sign-flip operates at the weight-column level by selecting between a weight and its negation, whereas bit-flip provides finer granularity by selectively inverting individual bit slices. Both methods expand the search space for fault-aware mappings, operate synergistically with CVM, and require no retraining or additional memory. To enable scalability, we introduce a look-up-table (LUT)-based framework that accelerates the computation of optimal transformations and supports rapid evaluation across models and fault rates. Extensive experiments on ResNet-18, ResNet-50, and ViT models with CIFAR-100 and ImageNet demonstrate that the proposed techniques recover most of the accuracy lost under SAF injection. Hardware analysis shows that these methods incur negligible overhead, with sign-flip leading to negligible energy, latency, and area cost, and bit-flip providing higher fault resilience with modest overheads. These results establish sign-flip and bit-flip as practical and scalable SAF-mitigation strategies for CiM-based DNN accelerators.
Similar Papers
SafeCiM: Investigating Resilience of Hybrid Floating-Point Compute-in-Memory Deep Learning Accelerators
Hardware Architecture
Makes AI chips more reliable against errors.
Has the Two-Decade-Old Prophecy Come True? Artificial Bad Intelligence Triggered by Merely a Single-Bit Flip in Large Language Models
Cryptography and Security
Makes AI say wrong or bad things by changing one tiny part.
Row-Column Hybrid Grouping for Fault-Resilient Multi-Bit Weight Representation on IMC Arrays
Hardware Architecture
Fixes computer errors, speeds up learning, saves power.