Analysis of Single Event Induced Bit Faults in a Deep Neural Network Accelerator Pipeline
By: Naïn Jonckers , Toon Vinck , Peter Karsmakers and more
Potential Business Impact:
Protects AI chips from radiation damage.
In recent years, the increased interest and the growth in application domains of Artificial Intelligence (AI), and more specifically Deep Neural Networks (DNNs), has led to an extensive usage of domain specific DNN accelerator processors to improve the computational efficiency of DNN inference. However, like any digital circuit, these processors are prone to faults induced by radiation particles such as heavy ions, protons, etc., making their use in harsh radiation environments a challenge. This work presents an in-depth analysis of the impact of such faults on the computational pipeline of a Systolic Array based Deep Neural Network accelerator (SA-DNN accelerator) by means of a Register Transfer Level (RTL) Fault Injection (FI) simulation in order to improve the observability of each hardware block. From this analysis, we present the sensitivity to single bit faults of register groups in the pipeline for three different DNN workloads utilising two datasets, namely MNIST and CIFAR-10. These sensitivity figures are presented in terms of Fault Propagation Probability ($P(f_{non-crit})$) and False Classification Probability ($P(f_{crit})$) which respectively show the probability that an injected fault causes a non-critical error (numerical offset) or a critical error (classification fault). From these results, we devise a fault mitigation strategy to harden the SA-DNN accelerator in an efficient way, both in terms of area and power overhead.
Similar Papers
Fault injection analysis of Real NVP normalising flow model for satellite anomaly detection
Machine Learning (CS)
Makes space computers less likely to break.
AFarePart: Accuracy-aware Fault-resilient Partitioner for DNN Edge Accelerators
Performance
Makes AI work even when parts break.
An ECC-based Fault Tolerance Approach for DNNs
Hardware Architecture
Keeps self-driving cars working when memory glitches happen.