C Parallel Programming GPU System Programming

Optimization of Convolution Layer in Neuro Network of MXNet for GPU

Posted by Susie on December 15, 2017

Converted convolution into matrix multiplication by unrolling input features and filters
Implemented tiling method for memory reuse, and double buffering to reduce synchronization overhead using CUDA
Classified 10000 images in 60ms with the speedup of 80 times compared to baseline