TY - GEN
T1 - DGEMM using tensor cores, and its accurate and reproducible versions
AU - Mukunoki, Daichi
AU - Ozaki, Katsuhisa
AU - Ogita, Takeshi
AU - Imamura, Toshiyuki
N1 - Funding Information:
This research was partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286 and MEXT as “Exploratory Issue on Post-K computer” (Development of verified numerical computations and super high-performance computing environment for extreme researches). We thank Takeshi Terao (Shibaura Institute of Technology) for his helpful suggestion for the idea of the blocking toward inner product. This research used computational resources of Cygnus (for Tesla V100) provided by Multidisciplinary Cooperative Research Program in Center for Computational Sciences, University of Tsukuba.
Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform 4×4 matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.
AB - This paper proposes a method for implementing dense matrix multiplication on FP64 (DGEMM) and FP32 (SGEMM) using Tensor Cores on NVIDIA’s graphics processing units (GPUs). Tensor Cores are special processing units that perform 4×4 matrix multiplications on FP16 inputs with FP32 precision, and return the result on FP32. The proposed method adopts the Ozaki scheme, an accurate matrix multiplication algorithm based on error-free transformation for matrix multiplication. The proposed method has three prominent advantages: first, it can be built upon the cublasGemmEx routine using Tensor Core operations; second, it can achieve higher accuracy than standard DGEMM, including the correctly-rounded result; third, it ensures bit-level reproducibility even for different numbers of cores and threads. The achievable performance of the method depends on the absolute-value range of each element of the input matrices. For example, when the matrices were initialized with random numbers over a dynamic range of 1E+9, our DGEMM-equivalent implementation achieved up to approximately 980 GFlops of FP64 operation on the Titan RTX GPU (with 130 TFlops on Tensor Cores), although cublasDgemm can achieve only 539 GFlops on FP64 floating-point units. Our results reveal the possibility of utilizing hardware with limited FP32/FP64 resources and fast low-precision processing units (such as AI-oriented processors) for general-purpose workloads.
KW - Accuracy
KW - FP16
KW - GEMM
KW - Half-precision
KW - Linear algebra
KW - Low-precision
KW - Matrix multiplication
KW - Reproducibility
KW - Tensor cores
UR - http://www.scopus.com/inward/record.url?scp=85087032468&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85087032468&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-50743-5_12
DO - 10.1007/978-3-030-50743-5_12
M3 - Conference contribution
AN - SCOPUS:85087032468
SN - 9783030507428
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 230
EP - 248
BT - High Performance Computing - 35th International Conference, ISC High Performance 2020, Proceedings
A2 - Sadayappan, Ponnuswamy
A2 - Chamberlain, Bradford L.
A2 - Juckeland, Guido
A2 - Ltaief, Hatem
PB - Springer
T2 - 35th International Conference on High Performance Computing, ISC High Performance 2020
Y2 - 22 June 2020 through 25 June 2020
ER -