Publications Search

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael W.; Deveci, Mehmet D.; Berry, Jonathan W.; Hammond, Simon D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI DOI

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael W.; Deveci, Mehmet D.; Berry, Jonathan W.; Hammond, Simon D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI DOI

Performance-portable sparse matrix-matrix multiplication for many-core architectures

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Deveci, Mehmet D.; Trott, Christian R.; Rajamanickam, Sivasankaran R.

We consider the problem of writing performance portablesparse matrix-sparse matrix multiplication (SPGEMM) kernelfor many-core architectures. We approach the SPGEMMkernel from the perspectives of algorithm design and implementation, and its practical usage. First, we design ahierarchical, memory-efficient SPGEMM algorithm. We thendesign and implement thread scalable data structures thatenable us to develop a portable SPGEMM implementation. We show that the method achieves performance portabilityon massively threaded architectures, namely Intel's KnightsLanding processors (KNLs) and NVIDIA's Graphic ProcessingUnits (GPUs), by comparing its performance to specializedimplementations. Second, we study an important aspectof SPGEMM's usage in practice by reusing the structure ofinput matrices, and show speedups up to 3× compared to thebest specialized implementation on KNLs. We demonstratethat the portable method outperforms 4 native methods on2 different GPU architectures (up to 17× speedup), and it ishighly thread scalable on KNLs, in which it obtains 101× speedup on 256 threads.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Partitioning Trillion-Edge Graphs in Minutes

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017

Slota, George M.; Rajamanickam, Sivasankaran R.; Devine, Karen D.; Madduri, Kamesh

We introduce XtraPuLP, a new distributed-memory graph partitioner designed to process trillion-edge graphs. XtraPuLP is based on the scalable label propagation community detection technique, which has been demonstrated as a viable means to produce high quality partitions with minimal computation time. On a collection of large sparse graphs, we show that XtraPuLP partitioning quality is comparable to state-of-the-art partitioning methods. We also demonstrate that XtraPuLP can produce partitions of real-world graphs with billion+ vertices in minutes. Further, we show that using XtraPuLP partitions for distributed-memory graph analytics leads to significant end-to-end execution time reduction.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Order or shuffle: Empirically evaluating vertex order impact on parallel graph computations

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Slota, George M.; Rajamanickam, Sivasankaran R.; Madduri, Kamesh

The in-memory graph layout affects performance of distributed-memory graph computations. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, and reordering vertex and edge identifiers. In this work, we consider one-dimensional graph layouts, where disjoint sets of vertices and their adjacencies are partitioned among processors. Using the PuLP graph partitioning method and a breadth-first search (BFS)-based vertex ordering strategy, we empirically evaluate the impact of this graph layout on a collection of five distributed-memory graph computations. Our evaluation considers several objective metrics in addition to execution time, and we observe a considerable performance improvement over randomization.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Enabling Low Mach Fluid Simulations Using Trilinos

Hu, Jonathan J.; Devine, Karen D.; Hoemmen, Mark F.; Lin, Paul L.; Rajamanickam, Sivasankaran R.; Roberts, Nathan V.; Siefert, Christopher S.; Trott, Christian R.; Prokopenko, Andrey P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

A Hierarchical Low-rank Solver for Sparse Linear Systems

Boman, Erik G.; Chen, Chao C.; Darve, Eric D.; Pouransari, Hadi P.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

ShyLU: A Collection of Node-Scalable Sparse Linear Solvers

Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Kim, Kyungjoo K.; Boman, Erik G.; Deveci, Mehmet D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

KokkosKernels: Compact Layouts for Batched Blas and Sparse Matrix-Matrix multiply

Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Kim, Kyungjoo K.; Deveci, Mehmet D.; Trott, Christian R.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Hierarchical Matrices and Low-Rank Methods for Extreme-Scale Solvers

Boman, Erik G.; Chen, Chao C.; Darve, Eric D.; Rajamanickam, Sivasankaran R.; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Embedded ensemble propagation for improving performance, portability, and scalability of uncertainty quantification on emerging computational architectures

SIAM Journal on Scientific Computing

Phipps, Eric T.; D'Elia, Marta D.; Edwards, Harold C.; Hoemmen, M.; Hu, J.; Rajamanickam, Sivasankaran R.

Quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar, and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in an embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability, and scalability for the approach applied to the simulation of partial differential equations on a variety of multicore and manycore architectures, including up to 16,384 cores on a Cray XK7 (Titan).

More Details

TYPE Journal Article YEAR 2017

Scopus OSTI DOI

Sparse Matrix-matrix multiplication for modern manycore architecture

Deveci, Mehmet D.; Rajamanickam, Sivasankaran R.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

FastILU: Finegrained ASynchronous iterative ILU

Boman, Erik G.; Patel, Aftab P.; Chow, Edmond C.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Performance Portable Sparse Matrix-Matrix Multiplication on Intel Knights Landing and NVIDIA GPUs

Rajamanickam, Sivasankaran R.; Deveci, Mehmet D.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Distributing linear systems for parallel computation

Devine, Karen D.; Boman, Erik G.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Kokkos Task API: A Use Case in Tacho

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Scalable Incomplete Factorization Utilizing Combinatorial Methods to Reduce Overheads

Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Trilinos NGP Planning

Rajamanickam, Sivasankaran R.; Devine, Karen D.; Hu, Jonathan J.; Hoemmen, Mark F.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Sparse Matrix-matrix multiplication for modern manycore architecture

Deveci, Mehmet D.; Rajamanickam, Sivasankaran R.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

KokkosKernels Introduction: Design API and Performance

Deveci, Mehmet D.; Rajamanickam, Sivasankaran R.; Kim, Kyungjoo K.; Bradley, Andrew M.; Trott, Christian R.; Hoemmen, Mark F.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Tacho: Two-level Task Parallel Cholesky Factorization

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Edwards, Harold C.; Dohrmann, Clark R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Hierarchical Task-Data Parallelism using Kokkos and Qthreads

Edwards, Harold C.; Olivier, Stephen L.; Berry, Jonathan W.; Mackey, Greg; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Kim, Kyungjoo K.; Stelle, George

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factor- ization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG . Major chal- lenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respect to hardware concurrency and size of the task DAG; and usability of the application programmer interface (API).

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

Parallel Graph Coloring for Manycore Architectures

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Deveci, Mehmet D.; Boman, Erik G.; Devine, Karen D.; Rajamanickam, Sivasankaran R.

Graph algorithms are challenging to parallelize on manycore architectures due to complex data dependencies and irregular memory access. We consider the well studied problem of coloring the vertices of a graph. In many applications it is important to compute a coloring with few colors in near-lineartime. In parallel, the optimistic (speculative) coloring method by Gebremedhin and Manne is the preferred approach but it needs to be modified for manycore architectures. We discuss a range of implementation issues for this vertex-based optimistic approach. We also propose a novel edge-based optimistic approach that has more parallelism and is better suited to GPUs. We study the performance empirically on two architectures(Xeon Phi and GPU) and across many data sets (from finite element problems to social networks). Our implementation uses the Kokkos library, so it is portable across platforms. We show that on GPUs, we significantly reduce the number of colors (geometric mean 4X, but up to 48X) as compared to the widely used cuSPARSE library. In addition, our edge-based algorithm is 1.5 times faster on average than cuSPARSE, where it hasspeedups up to 139X on a circuit problem. We also show the effect of the coloring on a conjugate gradient solver using multi-colored Symmetric Gauss-Seidel method as preconditioner, the higher coloring quality found by the proposed methods reduces the overall solve time up to 33% compared to cuSPARSE.

More Details

TYPE Conference Poster YEAR 2016

Scopus OSTI

Basker: A threaded sparse LU factorization utilizing hierarchical parallelism and data layouts

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Booth, Joshua D.; Rajamanickam, Sivasankaran R.; Thornquist, Heidi K.

Scalable sparse LU factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse LU factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to match the hierarchy in thread teams, individual threads, and vector level parallelism. Basker is designed to map well to this hierarchy in architectures. There is also a need for data layouts to match multiple levels of hierarchy in memory. Basker uses a two-dimensional hierarchical structure of sparse matrices that maps to the hierarchy in the memory architectures and to the hierarchy in parallelism. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulations. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to KLU. Basker outperforms Intel MKL Pardiso (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.

More Details

TYPE Conference Poster YEAR 2016

Scopus OSTI

A comparison of high-level programming choices for incomplete sparse factorization across different architectures

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Booth, Joshua D.; Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.

All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallel algorithms face various performance trade-offs related to tasking, accesses to global data-structures, and use of shared cache. While programming models provide high level abstractions, such as data and task parallelism, algorithmic choices still remain open on how to best implement irregular algorithms, such as sparse factorizations, while taking into account the trade-offs mentioned above. In this paper, we compare these performance trade-offs for task and data parallelism on different hardware architectures such as Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8. We do this by comparing the scaling of a new task-parallel incomplete sparse Cholesky factorization called Tacho and a new data-parallel incomplete sparse LU factorization called Basker. Both solvers utilize Kokkos programming model and were developed within the ShyLU package of Trilinos. Using these two codes we demonstrate how high-level programming changes affect performance and overhead costs on multiple multi/many-core systems. We find that Kokkos is able to provide comparable performance with both parallel-for and task/futures on traditional x86 multicores. However, the choice of which high-level abstraction to use on many-core systems depends on both the architectures and input matrices.

More Details

TYPE Conference Poster YEAR 2016

Scopus OSTI DOI

Sparse Matrix-Matrix Multiplication for Modern Architectures

Deveci, Mehmet D.; Boman, Erik G.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Robust Solvers for Circuit Simulation on Modern Architectures

Rajamanickam, Sivasankaran R.; Booth, Joshua D.; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Partitioning and Task Placement with Zoltan2

Deveci, Mehmet D.; Devine, Karen D.; Boman, Erik G.; Leung, Vitus J.; Rajamanickam, Sivasankaran R.; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Task and Data Parallelism Based Direct Solvers and Preconditioners in Manycore Architecture: Efforts in Trilinos/ShyLU

Booth, Joshua D.; Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Boman, Erik G.; Kim, Kyungjoo K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Parallel Preconditioners and Solvers for Modern Architectures

Boman, Erik G.; Deveci, Mehmet D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Embedded Ensemble Propagation for Improving Performance Portability and Scalability of Uncertainty Quantification on Emerging Computational Architectures

Phipps, Eric T.; D'Elia, Marta D.; Edwards, Harold C.; Hoemmen, Mark F.; Hu, Jonathan J.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Performance Portability for Linear Algebra with Kokkos

Trott, Christian R.; Edwards, Harold C.; Ellingwood, Nathan D.; Hammond, Simon D.; Deveci, Mehmet D.; Boman, Erik G.; Bradley, Andrew M.; Hoemmen, Mark F.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Multi-Jagged: A Scalable Parallel Spatial Partitioning Algorithm

IEEE Transactions on Parallel and Distributed Systems

Deveci, Mehmet; Rajamanickam, Sivasankaran R.; Devine, Karen D.; Catalyurek, Umit V.

Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficient implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.

More Details

TYPE Journal Article YEAR 2016

Scopus OSTI DOI

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Booth, Joshua D.; Rajamanickam, Sivasankaran R.; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI DOI

Parallel Graph Coloring for Manycore Architectures

Deveci, Mehmet D.; Boman, Erik G.; Devine, Karen D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI DOI

Dungeon Session Application: BDDC Solver Library

Hammond, Simon D.; Dohrmann, Clark R.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Stelle, George; Edwards, Harold C.; Olivier, Stephen L.

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.

More Details

TYPE Other Report YEAR 2016

OSTI DOI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran R.; Yamazaki, I.Y.; Boman, Erik G.; Prokopenko, Andrey V.; Heroux, Michael A.; Dongarra, J.D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Communication-Avoiding Preconditioners for s-step Krylov Methods

Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Basker: A Scalable Sparse Direct Linear Solver for Many-Core Architectures

Booth, Joshua D.; Rajamanickam, Sivasankaran R.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Booth, Joshua D.; Rajamanickam, Sivasankaran R.; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

Task-parallel Sparse Incomplete Cholesky Factorization using Kokkos Portable APIs

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

ShyLU and Thread Scalable Subdomain Solvers

Rajamanickam, Sivasankaran R.; Boman, Erik G.; Bradley, Andrew M.; Booth, Joshua D.; Deveci, Mehmet D.; Kim, Kyungjoo K.; Dohrmann, Clark R.; Thornquist, Heidi K.; Chow, Edmond C.; Patel, Aftab P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

ShyLU: On node Solvers and Kokkos-Kernels

Rajamanickam, Sivasankaran R.; Boman, Erik G.; Bradley, Andrew M.; Booth, Joshua D.; Kim, Kyungjoo K.; Deveci, Mehmet D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran R.; Yamazaki, Ichitaro Y.; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stan T.; Dongarra, Jack D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Architecture-aware Task Placement

Deveci, Mehmet D.; Devine, Karen D.; Leung, Vitus J.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

WebGraphAnalysisontheBlueWaters Supercomputer

Slota, George M.; Rajamanickam, Sivasankaran R.; Madduri, Kamesh M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Irregular Graph Algorithms on Parallel Processing Systems

Slota, George M.; Rajamanickam, Sivasankaran R.; Madduri, Kamesh M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

High-Performance Graph Analytics on Manycore Processors

Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015

Slota, George M.; Rajamanickam, Sivasankaran R.; Madduri, Kamesh

The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manicure architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manicure library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI DOI

High-Performance Computing for Extreme-Scale Data Analytics

Boman, Erik G.; Madduri, Kamesh M.; Rajamanickam, Sivasankaran R.; Wolf, Michael W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Distributing Linear Systems for Parallel Computation

Devine, Karen D.; Boman, Erik G.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Supercomputing for Web Graph Analytics

Slota, George M.; Rajamanickam, Sivasankaran R.; Madduri, Kamesh M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran R.; Yamazaki, Ichitaro Y.; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stanimire T.; Dongarra, Jack D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Exploring Embedded Uncertainty Quantification Methods on Next-Generation Computer Architectures

Phipps, Eric T.; D'Elia, Marta D.; Hu, Jonathan J.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Embedded Sampling?Based Uncertainty Quantification Approaches for Emerging Computer Architectures

D'Elia, Marta D.; Phipps, Eric T.; Edwards, Harold C.; Hu, Jonathan J.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

The Zoltan2 Toolkit: Partitioning Task Placement Coloring and Ordering

Devine, Karen D.; Boman, Erik G.; Rajamanickam, Sivasankaran R.; Leung, Vitus J.; Riesen, Lee A.; Deveci, Mehmet D.; Catalyurek, Umit V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Parallel Graph Coloring

Boman, Erik G.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

PuLP: Scalable multi-objective multi-constraint partitioning for small-world networks

Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

Slota, George M.; Madduri, Kamesh; Rajamanickam, Sivasankaran R.

We present PuLP, a parallel and memory-efficient graph partitioning method specifically designed to partition low-diameter networks with skewed degree distributions. Graph partitioning is an important Big Data problem because it impacts the execution time and energy efficiency of graph analytics on distributed-memory platforms. Partitioning determines the in-memory layout of a graph, which affects locality, intertask load balance, communication time, and overall memory utilization of graph analytics. A novel feature of our method PuLP (Partitioning using Label Propagation) is that it optimizes for multiple objective metrics simultaneously, while satisfying multiple partitioning constraints. Using our method, we are able to partition a web crawl with billions of edges on a single compute server in under a minute. For a collection of test graphs, we show that PuLP uses 8-39× less memory than state-of-the-art partitioners and is up to 14.5× faster, on average, than alternate approaches (with 16-way parallelism). We also achieve better partitioning quality results for the multi-objective scenario.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

Distributed Graph Layout for Scalable Small-world Network Analysis

Rajamanickam, Sivasankaran R.; Slota, George M.; Madduri, Kamesh M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

A hybrid approach for parallel transistor-level full-chip circuit simulation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Thornquist, Heidi K.; Rajamanickam, Sivasankaran R.

The computer-aided design (CAD) applications that are fundamental to the electronic design automation industry need to harness the available hardware resources to be able to perform full-chip simulation for modern technology nodes (45nm and below). We will present a hybrid (MPI+threads) approach for parallel transistor-level transient circuit simulation that achieves scalable performance for some challenging large-scale integrated circuits. This approach focuses on the computationally expensive part of the simulator: the linear system solve. Hybrid versions of two iterative linear solver strategies are presented, one takes advantage of block triangular form structure while the other uses a Schur complement technique. Results indicate up to a 27x improvement in total simulation time on 256 cores.

More Details

TYPE Conference YEAR 2015

Scopus OSTI DOI

Towards extreme-scale simulations for low mach fluids with second-generation trilinos

Parallel Processing Letters

Lin, Paul L.; Bettencourt, Matthew T.; Domino, Stefan P.; Fisher, Travis C.; Hoemmen, Mark F.; Hu, Jonathan J.; Phipps, Eric T.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran R.; Siefert, Christopher S.; Kennon, Stephen

Trilinos is an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. While Trilinos was originally designed for scalable solutions of large problems, the fidelity needed by many simulations is significantly greater than what one could have envisioned two decades ago. When problem sizes exceed a billion elements even scalable applications and solver stacks require a complete revision. The second-generation Trilinos employs C++ templates in order to solve arbitrarily large problems. We present a case study of the integration of Trilinos with a low Mach fluids engineering application (SIERRA low Mach module/Nalu). Through the use of improved algorithms and better software engineering practices, we demonstrate good weak scaling for up to a nine billion element large eddy simulation (LES) problem on unstructured meshes with a 27 billion row matrix on 524,288 cores of an IBM Blue Gene/Q platform.

More Details

TYPE Journal Article YEAR 2014

Scopus OSTI DOI

2D Partitioning for Scalable Matrix Computations on Scale-Free Graphs

Boman, Erik G.; Devine, Karen D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Slota, George M.; Madduri, Kamesh M.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI DOI

War Stories : Graph Algorithms in GPUs

Rajamanickam, Sivasankaran R.; Slota, George M.; Madduri, Kamesh M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Using architecture information and real-time resource state to reduce power consumption and communication costs in parallel applications

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Pedretti, Kevin P.; Rajamanickam, Sivasankaran R.; Bunde, David P.; Deveci, Mehmet D.; Catalyurek, Umit V.

As computer systems grow in both size and complexity, the need for applications and run-time systems to adjust to their dynamic environment also grows. The goal of the RAAMP LDRD was to combine static architecture information and real-time system state with algorithms to conserve power, reduce communication costs, and avoid network contention. We devel- oped new data collection and aggregation tools to extract static hardware information (e.g., node/core hierarchy, network routing) as well as real-time performance data (e.g., CPU uti- lization, power consumption, memory bandwidth saturation, percentage of used bandwidth, number of network stalls). We created application interfaces that allowed this data to be used easily by algorithms. Finally, we demonstrated the benefit of integrating system and application information for two use cases. The first used real-time power consumption and memory bandwidth saturation data to throttle concurrency to save power without increasing application execution time. The second used static or real-time network traffic information to reduce or avoid network congestion by remapping MPI tasks to allocated processors. Results from our work are summarized in this report; more details are available in our publications [2, 6, 14, 16, 22, 29, 38, 44, 51, 54].

More Details

TYPE SAND Report YEAR 2014

OSTI DOI

Kokkos a Manycore DevicePerformance Portability Libraryfor C++ HPC Applications

Rajamanickam, Sivasankaran R.; Edwards, Harold C.; Trott, Christian R.; Sunderland, Daniel S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Computer Science Research Institute (CSRI) Summer Proceedings 2013

Rajamanickam, Sivasankaran R.; Parks, Michael L.; Collis, Samuel S.

The Computer Science Research Institute (CSRI) brings university faculty and students to Sandia National Laboratories for focused collaborative research on computer science, computational science, and mathematics problems that are critical to the mission of the laboratories, the Department of Energy, and the United States. The CSRI provides a mechanism by which university researchers learn about and impact national— and global—scale problems while simultaneously bringing new ideas from the academic research community to bear on these important problems. A key component of CSRI programs over the last decade has been an active and productive summer program where students from around the country conduct internships at CSRI. Each student is paired with a Sandia staff member who serves as technical advisor and mentor. The goals of the summer program are to expose the students to research in mathematical and computer sciences at Sandia and to conduct a meaningful and impactful summer research project with their Sandia mentor. Every effort is made to align summer projects with the student's research objectives and all work is coordinated with the ongoing research activities of the Sandia mentor in alignment with Sandia technical thrusts. For the 2013 CSRI Proceedings, research articles have been organized into the following broad technical focus areas — Computational Mathematics and Algorithms, Combinatorial Algorithms and Visualization, Advanced Architectures and Systems Software, Computational Applications — which are well aligned with Sandia's strategic thrusts in computer and information sciences.

More Details

TYPE Other Report YEAR 2014

OSTI DOI

FASTMath Partitioning and Task Placement

Devine, Karen D.; Diamond, Gerrett D.; Ibanez, Dan I.; Leung, Vitus J.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran R.; Shephard, Mark S.; smith, cameron s.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Zoltan Three-Slide Overview for ATPESC 2014

Devine, Karen D.; Rajamanickam, Sivasankaran R.; Prokopenko, Andrey V.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Building Blocks for Graph Based Network Analysis

Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Towards Extreme-scale Simulations with Next-Generation Trilinos: a low Mach application case study

Lin, Paul L.; Siefert, Christopher S.; Cyr, Eric C.; Bettencourt, Matthew T.; Domino, Stefan P.; Fisher, Travis C.; Hoemmen, Mark F.; Hu, Jonathan J.; Phipps, Eric T.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on Distributed GPUs

Boman, Erik G.; Heroux, Michael A.; Hoemmen, Mark F.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Improving Parallel Performance of Coarse Grids in an Algebraic Multigrid

Prokopenko, Andrey V.; Hu, Jonathan J.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Reducing Coarse Grids Contention in a Parallel Algebraic Multigrid

Prokopenko, Andrey V.; Hu, Jonathan J.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Parallel Strongly Connected Components in Shared Memory Architectures

Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran R.; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stanimire

Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In this paper, we extend these studies by two major contributions. First, we present our implementation of a CA variant of the Generalized Minimum Residual (GMRES) method, called CAGMRES, for solving no symmetric linear systems of equations on a hybrid CPU/GPU cluster. Our performance results on up to 120 GPUs show that CA-GMRES gives a speedup of up to 2.5x in total solution time over standard GMRES on a hybrid cluster with twelve Intel Xeon CPUs and three Nvidia Fermi GPUs on each node. We then outline a domain decomposition framework to introduce a family of preconditioners that are suitable for CA Krylov methods. Our preconditioners do not incur any additional communication and allow the easy reuse of existing algorithms and software for the sub domain solves. Experimental results on the hybrid CPU/GPU cluster demonstrate that CA-GMRES with preconditioning achieve a speedup of up to 7.4x over CAGMRES without preconditioning, and speedup of up to 1.7x over GMRES with preconditioning in total solution time. These results confirm the potential of our framework to develop a practical and effective preconditioned CA Krylov method.

More Details

TYPE Conference YEAR 2014

Scopus OSTI

Exploiting geometric partitioning in task mapping for parallel computers

Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS

Deveci, Mehmet; Rajamanickam, Sivasankaran R.; Leung, Vitus J.; Pedretti, Kevin P.; Olivier, Stephen L.; Bunde, David P.; Catalyurek, Umit V.; Devine, Karen D.

We present a new method for mapping applications' MPI tasks to cores of a parallel computer such that communication and execution time are reduced. We consider the case of sparse node allocation within a parallel machine, where the nodes assigned to a job are not necessarily located within a contiguous block nor within close proximity to each other in the network. The goal is to assign tasks to cores so that interdependent tasks are performed by 'nearby' cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We show that, for the structured finite difference mini-app Mini Ghost, our mapping method reduced execution time 34% on average on 65,536 cores of a Cray XE6. In a molecular dynamics mini-app, Mini MD, our mapping method reduced communication time by 26% on average on 6144 cores. We also compare our mapping with graph-based mappings from the LibTopoMap library and show that our mappings reduced the communication time on average by 15% in MiniGhost and 10% in MiniMD. © 2014 IEEE.

More Details

TYPE Conference YEAR 2014

Scopus OSTI

BFS and coloring-based parallel algorithms for strongly connected components and related problems

Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS

Slota, George M.; Rajamanickam, Sivasankaran R.; Madduri, Kamesh

Finding the strongly connected components (SCCs) of a directed graph is a fundamental graph-theoretic problem. Tarjan's algorithm is an efficient serial algorithm to find SCCs, but relies on the hard-to-parallelize depth-first search (DFS). We observe that implementations of several parallel SCC detection algorithms show poor parallel performance on modern multicore platforms and large-scale networks. This paper introduces the Multistep method, a new approach that avoids work inefficiencies seen in prior SCC approaches. It does not rely on DFS, but instead uses a combination of breadth-first search (BFS) and a parallel graph coloring routine. We show that the Multistep method scales well on several real-world graphs, with performance fairly independent of topological properties such as the size of the largest SCC and the total number of SCCs. On a 16-core Intel Xeon platform, our algorithm achieves a 20X speedup over the serial approach on a 2 billion edge graph, fully decomposing it in under two seconds. For our collection of test networks, we observe that the Multistep method is 1.92X faster (mean speedup) than the state-of-the-art Hong et al. SCC method. In addition, we modify the Multistep method to find connected and weakly connected components, as well as introduce a novel algorithm for determining articulation vertices of biconnected components. These approaches all utilize the same underlying BFS and coloring routines. © 2014 IEEE.

More Details

TYPE Conference YEAR 2014

Scopus OSTI

Enabling extreme-scale simulations with next-generation Trilinos for Sierra low Mach fluid application code

Lin, Paul L.; Rajamanickam, Sivasankaran R.; Siefert, Christopher S.; Bettencourt, Matthew T.; Cyr, Eric C.; Domino, Stefan P.; Fisher, Travis C.; Hoemmen, Mark F.; Hu, Jonathan J.; Phipps, Eric T.; Prokopenko, Andrey V.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Graph Partitioning

Boman, Erik G.; Devine, Karen D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI DOI

Using 2D Matrix Distributions in Trilinos

Devine, Karen D.; Boman, Erik G.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Exploiting Geometric Partitioning in Task Mapping for Parallel Computers

Rajamanickam, Sivasankaran R.; Leung, Vitus J.; Pedretti, Kevin P.; Olivier, Stephen L.; Devine, Karen D.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The Zoltan Toolkits: Parallel Partitioning Load Balancing Coloring and Ordering

Devine, Karen D.; Boman, Erik G.; Rajamanickam, Sivasankaran R.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Computing Strongly Connected Components in Modern Architectures

Rajamanickam, Sivasankaran R.; Slota, George M.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Graph Partitioning

Boman, Erik G.; Devine, Karen D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI DOI

Multi-jagged: A Scalable Multi-section based Spatial Partitioning Algorithm

Rajamanickam, Sivasankaran R.; Devine, Karen D.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Combinatorial Scientific Computing for Exascale Systems and Applications

Devine, Karen D.; Rajamanickam, Sivasankaran R.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Graph Partitioning

Boman, Erik G.; Devine, Karen D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI DOI

Multithreaded Sparse Kernels for Solution of Sparse Linear Systems

Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Experiences with Xeon Phi

Hammond, Simon D.; Rajamanickam, Sivasankaran R.; Ang, James A.; Barrett, Richard F.; Doerfler, Douglas W.; Heroux, Michael A.; Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Neuron Simulation and Analysis with Xyce

Schiek, Richard S.; Mei, Ting M.; Rajamanickam, Sivasankaran R.; Keiter, Eric R.; Warrender, Christina E.; Aimone, James B.; Thornquist, Heidi K.; Russo, Thomas V.; Verley, Jason V.; Crossno, Patricia J.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Multi-jagged: A Scalable Multi-section based Spatial Partitioning Algorithm

Rajamanickam, Sivasankaran R.; Devine, Karen D.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Multithreaded algorithms for maxmum matching in bipartite graphs

Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Azad, Ariful; Halappanavar, Mahantesh; Rajamanickam, Sivasankaran R.; Boman, Erik G.; Khan, Arif; Pothen, Alex

We design, implement, and evaluate algorithms for computing a matching of maximum cardinality in a bipartite graph on multicore and massively multithreaded computers. As computers with larger numbers of slower cores dominate the commodity processor market, the design of multithreaded algorithms to solve large matching problems becomes a necessity. Recent work on serial algorithms for the matching problem has shown that their performance is sensitive to the order in which the vertices are processed for matching. In a multithreaded environment, imposing a serial order in which vertices are considered for matching would lead to loss of concurrency and performance. But this raises the question: Would parallel matching algorithms on multithreaded machines improve performance over a serial algorithm? We answer this question in the affirmative. We report efficient multithreaded implementations of three classes of algorithms based on their manner of searching for augmenting paths: breadth-first-search, depth-first-search, and a combination of both. The Karp-Sipser initialization algorithm is used to make the parallel algorithms practical. We report extensive results and insights using three shared-memory platforms (a 48-core AMD Opteron, a 32-coreIntel Nehalem, and a 128-processor Cray XMT) on a representative set of real-world and synthetic graphs. To the best of our knowledge, this is the first study of augmentation-based parallel algorithms for bipartite cardinality matching that demonstrates good speedups on multithreaded shared memory multiprocessors. © 2012 IEEE.

More Details

TYPE Conference YEAR 2012

Scopus OSTI

ShyLU: A hybrid-hybrid solver for multicore platforms

Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Rajamanickam, Sivasankaran R.; Boman, Erik G.; Heroux, Michael A.

With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a "hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements where we compute the approximate Schur complement using a value-based dropping strategy or structure-based probing strategy. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). ShyLU is useful both in shared-memory environments and on large parallel computers with distributed memory. In the latter case, it should be used as a sub domain solver. We argue that with the increasing complexity of compute nodes, it is important to exploit multiple levels of parallelism even within a single compute node. We show the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 384 cores for a given problem size. We also study the MPI-only performance of ShyLU against a hybrid implementation and conclude that on present multicore nodes MPI-only implementation is better. However, for future multicore machines (96 or more cores) hybrid/ hierarchical algorithms and implementations are important for sustained performance. © 2012 IEEE.

More Details

TYPE Conference YEAR 2012

Scopus OSTI