Github Source
- Converted convolution into matrix multiplication by unrolling input features and filters
- Implemented tiling method for memory reuse, and double buffering to reduce synchronization overhead using CUDA
- Classified 10000 images in 60ms with the speedup of 80 times compared to baseline