Sparse matrix vector multiplications on graphic processors
M. Wafai. University of Stuttgart, Nobelstr. 19, 70569, Stuttgart, (November 2009)
The modern computer architecture is moving towards multi-core systems. Intel processors are now coming with double or even quad cores like Xeon processor. Graphics Processing Units (GPUs) are considered to be highly parallel multi-core processors with tremendous performance. They are specially designed to deal with 3D and realtime graphics. And after the introduction of the new API, Compute Unified Device Architecture (CUDA), from NVIDA, the GPU became an attractive choice for general purpose parallel computing to solve many complex numerical problems.
Sparse Matrix-Vector (SpMV) multiplication is one of the most important kernels in scientific computing. Its sparsity, irregularity and indirect addressing properties present new challenges to map it to multi-core systems.
The objective of this work is to analyze the speed of execution of SpMV multiplication on NVIDIA GPUs (Tesla C1060). An algorithm based on a tailored version of ELLPACK, called Aligned-ELLPACK-R, as well as different algorithms have been developed using different storage formats. These implementations are done using the programming language CUDA. Finally the comparison of that performance has been done with respect to different implementations of SpMV on Intel Xeon E5560 processor using Jagged Diagonal Formats (JAD), ELLPACK and ELLPACK-R storage formats.
The results show the superiority of JAD storage format over the matrices used to test SpMV on conventional super scaler processors. SpMV on Tesla C1060 based on Aligned-ELLPACK-R outperforms the fastest implementation on CPU with speedup factor 13 times. It also outperforms the CUDA library based on ELLPACK with 2.3 speedup factor.