CNN Models Acceleration Using Filter Pruning and Sparse Tensor Core

Hong-Xuan Wei, Pangfeng Liu, Ding-Yong Hong, Jan-Jan Wu, An-Tai Chen

Abstract


Convolutional neural network (CNN) is a state-of-the-art technique in machine learning and has achieved high accuracy in many computer vision applications. The number of the parameters of the CNN models is fast increasing for improving accuracy; therefore, it requires more computation time and memory space for both training and inference. As a result, reducing the model size and improving the inference speed have become critical issues for CNN. This paper focuses on filter pruning and special optimization for NVIDIA sparse tensor core. Filter pruning is a model compression technique that evaluates the importance of filters in the CNN model and removes the less critical filters. NVIDIA sparse tensor core is special hardware for CNN computation from NVIDIA Ampere GPU architecture, which can speed up a matrix multiplication if the matrix has a structure that manifests as a 2:4 pattern.

This paper proposed hybrid pruning to prune the CNN models. The hybrid pruning combines filter pruning and 2:4 pruning. We apply filter pruning to remove the redundant filters to reduce the model size. Next, we use 2:4 pruning to prune the model according to a 2:4 pattern to utilize the sparse tensor core hardware for speedup. In this hybrid pruning scenario, we also proposed two hybrid metrics to decide the filter’s importance during filter pruning. The hybrid ranking metrics preserve the essential filters for both pruning steps and achieve higher accuracy than traditional filter prunings by considering both metrics. We test our hybrid pruning algorithm on MNIST, SVHN, CIFAR-10 datasets using AlexNet. Our experiments concluded that our hybrid metrics achieve better accuracy than the classic L1-norm metric and the output L1-norm metric. When we prune away 40 percent of filters in the model, our methods have 2.8% to 3.3%, 2.9% to 3.5%, 2.5% to 2.7% higher accuracy than the classic L1-norm metric and the output L1-norm metric on these three datasets. We also evaluate the inference speed of the model from our hybrid pruning. We compare the hybrid pruning model with the models that result from either filter pruning or 2:4 pruning. We find that a hybrid pruning model runs up to 1.3x faster than the traditional filter pruning model with similar accuracy. 


Keywords


Model Compression; Filter Pruning; CNN; Machine Learning; Sparse Tensor Core

Full Text:

PDF

Refbacks

  • There are currently no refbacks.