Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

Haidar, Azzam and Tomov, Stanimire and Dongarra, Jack and Higham, Nicholas J. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In: International Conference on Supercomputing, New York, NY, USA, 2018. (In Press)

[thumbnail of International ACM Conference.pdf] Text
International ACM Conference.pdf - Accepted Version

Download (1MB)

Abstract

Low-precision floating-point arithmetic is a powerful tool for accelerating scientific computing applications, especially those in artificial intelligence. Here, we present an investigation showing that other high-performance computing (HPC) applications can also harness this power. Specifically, we use the general HPC problem, $Ax = b$, where $A$ is a large dense matrix, and a double precision (FP64) solution is needed for accuracy. Our approach is based on mixed-precision (FP16 -> FP64) iterative refinement, and we generalize and extend prior advances into a framework, for which we develop architecture-specific algorithms and highly tuned implementations. These new methods show how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4 times speedup. This is due to the performance boost that the FP16-TC provide as well as to the improved accuracy over the classical FP16 arithmetic that is obtained because the GEMM accumulation occurs in FP32 arithmetic.

Item Type: Conference or Workshop Item (Paper)
Subjects: MSC 2010, the AMS's Mathematics Subject Classification > 15 Linear and multilinear algebra; matrix theory
MSC 2010, the AMS's Mathematics Subject Classification > 65 Numerical analysis
Depositing User: Nick Higham
Date Deposited: 02 Oct 2018 08:39
Last Modified: 02 Oct 2018 08:39
URI: https://eprints.maths.manchester.ac.uk/id/eprint/2664

Actions (login required)

View Item View Item