Publications

Results 101–123 of 123
Skip to search filters

Versioned distributed arrays for resilience in scientific applications: Global View Resilience

Procedia Computer Science

Chien, A.; Balaji, P.; Beckman, P.; Dun, N.; Fang, A.; Fujita, H.; Iskra, K.; Rubenstein, Z.; Zheng, Z.; Schreiber, R.; Hammond, J.; Dinan, J.; Laguna, I.; Richards, D.; Dubey, A.; Van Straalen, B.; Hoemmen, M.; Heroux, Michael A.; Teranishi, Keita T.; Siegel, A.

Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR's interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.

More Details

Fault tolerance in an inner-outer solver: A GVR-enabled case study

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Zheng, Ziming; Chien, Andrew A.; Teranishi, Keita T.

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

More Details

Extreme-scale viability of collective communication for resilient task scheduling and work stealing

Proceedings of the International Conference on Dependable Systems and Networks

Wilke, Jeremiah J.; Bennett, Janine C.; Kolla, Hemanth K.; Teranishi, Keita T.; Slattengren, Nicole S.; Floren, John F.

Extreme-scale computing will bring significant changes to high performance computing system architectures. In particular, the increased number of system components is creating a need for software to demonstrate 'pervasive parallelism' and resiliency. Asynchronous, many-task programming models show promise in addressing both the scalability and resiliency challenges, however, they introduce an enormously challenging distributed, resilient consistency problem. In this work, we explore the viability of resilient collective communication in task scheduling and work stealing and, through simulation with SST/macro, the performance of these collectives on speculative extreme-scale architectures.

More Details

Toward local failure local recovery resilience model using MPI-ULFM

ACM International Conference Proceeding Series

Teranishi, Keita T.; Heroux, Michael A.

The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.

More Details

Report for the ASC CSSE L2 Milestone (4873) - Demonstration of Local Failure Local Recovery Resilient Programming Model

Heroux, Michael A.; Teranishi, Keita T.

Recovery from process loss during the execution of a distributed memory parallel application is presently achieved by restarting the program, typically from a checkpoint file. Future computer system trends indicate that the size of data to checkpoint, the lack of improvement in parallel file system performance and the increase in process failure rates will lead to situations where checkpoint restart becomes infeasible. In this report we describe and prototype the use of a new application level resilient computing model that manages persistent storage of local state for each process such that, if a process fails, recovery can be performed locally without requiring access to a global checkpoint file. LFLR provides application developers with an ability to recover locally and continue application execution when a process is lost. This report discusses what features are required from the hardware, OS and runtime layers, and what approaches application developers might use in the design of future codes, including a demonstration of LFLR-enabled MiniFE code from the Matenvo mini-application suite.

More Details

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading

2014 IEEE High Performance Extreme Computing Conference, HPEC 2014

Hukerikar, Saurabh H.; Teranishi, Keita T.; Diniz, Pedro C.; Lucas, Robert F.

The challenge of resilience for High Performance Computing applications is significant for future extreme scale systems. These systems will experience unprecedented rates of faults and errors as they will be constructed from massive numbers of components that are inherently less reliable than those available today. While the use of redundant computing can provide detection and possible correction of errors, its system-wide use in future extreme-scale HPC systems will incur considerable overheads to application performance. In this paper, we present a framework that provides application level fault detection based on redundant multithreading. In previous work, we demonstrated an adaptive approach based on a language level directive. The computation contained in the programmer directive is executed by duplicate threads. In concert with a runtime system, the redundant multithreading is enabled opportunistically to provide fault detection at more reasonable overheads to application performance. The lazy fault detection approach presented in this work seeks to further optimize the use of redundancy by prioritizing the application's primary computation over the fault detection. Our approach relaxes the requirement that the redundant threads synchronize and compare results immediately. We show that lazy error detection is feasible and yields lower time to solution over adaptive RMT for a range of scientific computational kernels. We also explore a thread-to-core assignment strategy that seeks to reduce the interference between the redundant threads.

More Details
Results 101–123 of 123
Results 101–123 of 123