DHARMA: Distributed asyncHronous Adaptive Resilient Management of Applications
Abstract not provided.
Abstract not provided.
Procedia Computer Science
Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR's interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.
Abstract not provided.
Proceedings of the International Conference on Dependable Systems and Networks
Extreme-scale computing will bring significant changes to high performance computing system architectures. In particular, the increased number of system components is creating a need for software to demonstrate 'pervasive parallelism' and resiliency. Asynchronous, many-task programming models show promise in addressing both the scalability and resiliency challenges, however, they introduce an enormously challenging distributed, resilient consistency problem. In this work, we explore the viability of resilient collective communication in task scheduling and work stealing and, through simulation with SST/macro, the performance of these collectives on speculative extreme-scale architectures.
ACM International Conference Proceeding Series
The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Recovery from process loss during the execution of a distributed memory parallel application is presently achieved by restarting the program, typically from a checkpoint file. Future computer system trends indicate that the size of data to checkpoint, the lack of improvement in parallel file system performance and the increase in process failure rates will lead to situations where checkpoint restart becomes infeasible. In this report we describe and prototype the use of a new application level resilient computing model that manages persistent storage of local state for each process such that, if a process fails, recovery can be performed locally without requiring access to a global checkpoint file. LFLR provides application developers with an ability to recover locally and continue application execution when a process is lost. This report discusses what features are required from the hardware, OS and runtime layers, and what approaches application developers might use in the design of future codes, including a demonstration of LFLR-enabled MiniFE code from the Matenvo mini-application suite.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
2014 IEEE High Performance Extreme Computing Conference, HPEC 2014
The challenge of resilience for High Performance Computing applications is significant for future extreme scale systems. These systems will experience unprecedented rates of faults and errors as they will be constructed from massive numbers of components that are inherently less reliable than those available today. While the use of redundant computing can provide detection and possible correction of errors, its system-wide use in future extreme-scale HPC systems will incur considerable overheads to application performance. In this paper, we present a framework that provides application level fault detection based on redundant multithreading. In previous work, we demonstrated an adaptive approach based on a language level directive. The computation contained in the programmer directive is executed by duplicate threads. In concert with a runtime system, the redundant multithreading is enabled opportunistically to provide fault detection at more reasonable overheads to application performance. The lazy fault detection approach presented in this work seeks to further optimize the use of redundancy by prioritizing the application's primary computation over the fault detection. Our approach relaxes the requirement that the redundant threads synchronize and compare results immediately. We show that lazy error detection is feasible and yields lower time to solution over adaptive RMT for a range of scientific computational kernels. We also explore a thread-to-core assignment strategy that seeks to reduce the interference between the redundant threads.
Abstract not provided.