High Performance GPU Computing

Implemented a baseline many-core parallel algorithm for triangle counting using General Matrix Multiplication
Optimized the algorithm by GPU specialized sparse matrix application, reducing runtime by 2-3 times
Optimized the algorithm further by thread coarsening and privatization, accelerated runtime by 25%
Counted the triangles with 200K nodes within 50s