TYPE Conference Presentation YEAR 2022

DOI OSTI Scopus

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Proceedings - IEEE International Conference on Cluster Computing, ICCC

International Journal of High Performance Computing Applications

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Kolla, Hemanth; Mayo, Jackson R.; Teranishi, Keita; Armstrong, Robert C.

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

More Details

TYPE Conference Paper YEAR 2020

OSTI Scopus

Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Paul, Sri R.; Hayashi, Akihiro; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Grossman, Max; Sarkar, Vivek

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI Scopus

Hayashi, Akihiro-Ex; Paul, Sri R.; Whitlock, Matthew J.; Miles, Jefferyt; Teranishi, Keita; Sarkar, Vivek-Ex

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software

Myers, Jeremy M.; Dunlavy, Daniel M.; Teranishi, Keita; Hollman, David S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data

Teranishi, Keita; Dunlavy, Daniel M.; Myers, Jeremy M.; Barrett, Richard F.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

CoREC: Scalable and Resilient In-memory Data Staging for In-situWorkflows

ACM Transactions on Parallel Computing

DOI OSTI Scopus

Fenix A Portable Flexible Fault Tolerance Programming Framework for MPI Applications

Teranishi, Keita; Gamell, Marc; Van Der Wijingarrt, Rob; Parashar, Manish

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Proceedings of the International Conference on Parallel Processing Workshops

Gamell, Marc; Katz, Daniel S.; Teranishi, Keita; Heroux, Michael A.; Van Der Wijngaart, Rob F.; Mattson, Timothy G.; Parashar, Manish

Exascale systems promise the potential for computation atunprecedented scales and resolutions, but achieving exascale by theend of this decade presents significant challenges. A key challenge isdue to the very large number of cores and components and the resultingmean time between failures (MTBF) in the order of hours orminutes. Since the typical run times of target scientific applicationsare longer than this MTBF, fault tolerance techniques will beessential. An important class of failures that must be addressed isprocess or node failures. While checkpoint/restart (C/R) is currentlythe most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible atexascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerateprocess failures in MPI applications at exascale. The discussedapproach leverages User Level Failure Mitigation (ULFM), which isbeing proposed as an MPI extension to allow applications to createpolicies for tolerating process failures. Specifically, this paper demonstrates how different implementations ofapplication-driven in-memory checkpoint storage and recovery comparein terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability ofthe Fenix online global recovery framework on a production system-the Titan Cray XK7 at ORNL-and demonstrate the ability of Fenix totolerate dynamically injected failures using the execution of fourbenchmarks and mini-applications with different behaviors.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

International Journal of High Performance Computing Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Fault tolerance in an inner-outer solver: A GVR-enabled case study

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Zheng, Ziming; Chien, Andrew A.; Teranishi, Keita

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

More Details

TYPE Journal Article YEAR 2015

DOI OSTI Scopus

Failure Masking and Local Recovery for Stencil-based Applications at Extreme Scales

Gamell, Marc; Teranishi, Keita; Heroux, Michael A.; Mayo, Jackson R.; Kolla, Hemanth; Chen, Jacqueline H.; Parashar, Manish

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Clay, Robert L.; Mayo, Jackson R.; Teranishi, Keita; Slattengren, Nicole L.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Publications

Search results