SparsePixels: Efficient Convolution for Sparse Data on FPGAs
By: Ho Fung Tsoi , Dylan Rankin , Vladimir Loncar and more
Potential Business Impact:
Makes AI see important parts of pictures faster.
Inference of standard CNNs on FPGAs often incurs high latency and a long initiation interval due to the deep nested loops required to densely convolve every input pixel regardless of its feature value, especially when the image size is large. However, in some image data, input features can be spatially sparse, and semantic information may occupy only a small fraction of the input pixels. In this case most computation would be wasted on empty regions. In this work, we introduce SparsePixels, a framework for efficient convolution for spatially sparse image data on FPGAs, targeting fast inference applications in constrained environments with latency requirements of microseconds or below. Our approach implements a special class of CNNs that selectively retain and compute on a small subset of pixels that are active while ignoring the rest. We show that, for example, in a neutrino physics dataset for identifying neutrino interactions in LArTPC images that have around 4k input pixels but are naturally very sparse, a standard CNN with a compact size of 4k parameters incurs an inference latency of 48.665 $μ$s on an FPGA, whereas a sparse CNN of the same base architecture computing on less than 1% of the input pixels results in a $\times 73$ inference speedup to 0.665 $μ$s, with resource utilization well within on-chip budgets, trading only a small percent-level performance loss. At least one-order-of magnitude speedups with comparable performance are also demonstrated in similar datasets with sparse image patterns. This work aims to benefit future algorithm developments for fast and efficient data readout in modern experiments such as the trigger and data acquisition systems at the CERN Large Hadron Collider. For easy adoption, we have developed a library to support building sparse CNNs with quantization-aware training, as well as an HLS implementation for FPGA deployment.
Similar Papers
Accelerating Sparse Convolutions in Voxel-Based Point Cloud Networks
Distributed, Parallel, and Cluster Computing
Makes self-driving cars see better and faster.
LogicSparse: Enabling Engine-Free Unstructured Sparsity for Quantised Deep-learning Accelerators
Hardware Architecture
Makes smart devices run faster using less power.
A Resource-Driven Approach for Implementing CNNs on FPGAs Using Adaptive IPs
Hardware Architecture
Makes AI run faster on small chips.