Publications
Designing vector-friendly compact BLAS and LAPACK kernels
Kim, Kyungjoo K.; Costa, Timothy B.; Deveci, Mehmet D.; Bradley, Andrew M.; Hammond, Simon D.; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran R.
Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and IntelĀ® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.