Integrating Pruning with Quantization for Efficient Deep Neural Networks Compression
By: Sara Makenali, Babak Rokh, Ali Azarpeyvand
Potential Business Impact:
Makes smart computer programs smaller and faster.
Deep Neural Networks (DNNs) have achieved significant advances in a wide range of applications. However, their deployment on resource-constrained devices remains a challenge due to the large number of layers and parameters, which result in considerable computational and memory demands. To address this issue, pruning and quantization are two widely used compression techniques, commonly applied individually in most studies to reduce model size and enhance processing speed. Nevertheless, combining these two techniques can yield even greater compression benefits. Effectively integrating pruning and quantization to harness their complementary advantages poses a challenging task, primarily due to their potential impact on model accuracy and the complexity of jointly optimizing both processes. In this paper, we propose two approaches that integrate similarity-based filter pruning with Adaptive Power-of-Two (APoT) quantization to achieve higher compression efficiency while preserving model accuracy. In the first approach, pruning and quantization are applied simultaneously during training. In the second approach, pruning is performed first to remove less important parameters, followed by quantization of the pruned model using low-bit representations. Experimental results demonstrate that our proposed approaches achieve effective model compression with minimal accuracy degradation, making them well-suited for deployment on devices with limited computational resources.
Similar Papers
Compressing CNN models for resource-constrained systems by channel and layer pruning
Machine Learning (CS)
Makes smart computer programs smaller and faster.
CoDeQ: End-to-End Joint Model Compression with Dead-Zone Quantizer for High-Sparsity and Low-Precision Networks
Machine Learning (CS)
Shrinks computer programs without losing quality.
SQS: Bayesian DNN Compression through Sparse Quantized Sub-distributions
Machine Learning (CS)
Makes AI smaller and faster for phones.