Page 7 – Center for Computing Research (CCR)

Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, KKSPGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

More Details

TYPE Journal Article YEAR 2018

Scopus OSTI DOI

Fast Linear Algebra-Based Triangle Analytics with Kokkos Kernels

Yasar, Abdurrahman Y.; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Berry, Jonathan W.; Catalyurek, Umit V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Experimental design of work chunking for graph algorithms on high bandwidth memory architectures

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018

Slota, George M.; Rajamanickam, Sivasankaran R.

High Bandwidth Memory (HBM) is an additional memory layer between DDR and cache, and it currently exists in the form of Multi-Channel DRAM (MCDRAM) on the Intel Knight's Landing manycore architecture. Its purpose is to increase available memory bandwidth to maximize processor throughput. This work explores optimizing the label propagation community detection algorithm on the KNL, as this algorithm and its variants find broad usage in community detection. This algorithm's processing pattern also represents broader class of vertex-centric programs. As HBM becomes more common in new HPC systems, it is important to determine how best to exploit this memory layer for memory-starved graph and combinatorial algorithms. This work experimentally examines breaking up the algorithmic work into HBM-resident chunks, along with a parametric study of associated variations and optimizations. In general, we find our chunking methodology does not harm solution quality and can improve time to solution for label propagation. We believe these results would likely generalize to other vertex-centric algorithms as well.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Tacho: Memory-scalable task parallel sparse cholesky factorization

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Kim, Kyungjoo K.; Edwards, H.C.; Rajamanickam, Sivasankaran R.

We present a memory-scalable, parallel, sparse multifrontal solver for solving symmetric postive-definite systems arising in scientific and engineering applications. Factorizing sparse matrices requires memory for both the computed factors and the temporary workspaces for computing each frontal matrix - a data structure commonly used within multifrontal methods. To factorize multiple frontal matrices in parallel, the conventional approach is to allocate a uniform workspace for each hardware thread. In the manycore era, this results in increasing memory usage proportional to the number of hardware threads. We remedy this problem by using dynamic task parallelism with a scalable memory pool. Tasks are spawned while traversing an assembly tree and executed after their dependences are satisfied. We also use an idea to respawn the tasks when certain conditions are not met. Temporary workspace for frontal matrices in each task is allocated from a memory pool designed by us. If the requested memory space is not available in the memory pool, the task is respawned to yield the hardware thread to execute other tasks. The respawned task is executed after high priority tasks are executed. This approach allows to have robust parallel performance within a bounded memory space. Experimental results demonstrate the merits of our implementation on Intel multicore and manycore architectures.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Generating massive random graphs that mimic real data

Slota, George M.; Berry, Jonathan W.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Exalearn: Plans for Proxy applications and Performance Portabili

Rajamanickam, Sivasankaran R.; Wolf, Michael W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Fast Triangle Counting Using Cilk

Yasar, Abdurrahman Y.; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Berry, Jonathan W.; Catalyurek, Umit V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI DOI

KokkosKernels Overview

Rajamanickam, Sivasankaran R.; Deveci, Mehmet D.; Kim, Kyungjoo K.; Ellingwood, Nathan D.; Trott, Christian R.; Hu, Jonathan J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Scalable Community Detection Benchmark Generation

Berry, Jonathan W.; Slota, George M.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

A distributed-memory hierarchical solver for general sparse linear systems

Parallel Computing

Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran R.; Boman, Erik G.; Darve, Eric

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by every processor. We present various numerical results to demonstrate the versatility and scalability of the parallel algorithm.

More Details

TYPE Journal Article YEAR 2018

Scopus OSTI DOI

Scalable community detection benchmark generation

Berry, Jonathan W.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran R.; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Scheduling Parallel Tasks using Graph Coloring

Boman, Erik G.; Chen, Chao C.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Deveci, Mehmet D.; Hammond, Simon D.; Wolf, Michael W.; Rajamanickam, Sivasankaran R.

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading highperformance computing architectures — Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as Ki\iLs where the memory subsystems have similar latencies. However, for architectures such as GPUS where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms.

More Details

TYPE Other Report YEAR 2018

OSTI DOI