Publications

Results 1–25 of 144
Skip to search filters

Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Paul, Sri R.; Hayashi, Akihiro; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita T.; Mayo, Jackson M.; Grossman, Max; Sarkar, Vivek

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.

More Details

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Kolla, Hemanth K.; Mayo, Jackson M.; Teranishi, Keita T.; Armstrong, Robert C.

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

More Details

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Gupta, Nikunj; Mayo, Jackson M.; Lemoine, Adrian S.; Kaiser, Hartmut

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.

More Details

Physics-Based Checksums for Silent-Error Detection in PDE Solvers

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Salloum, Maher S.; Mayo, Jackson M.; Armstrong, Robert C.

We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.

More Details

Log-Correlated Large-Deviation Statistics Governing Huygens Fronts in Turbulence

Journal of Statistical Physics

Mayo, Jackson M.; Kerstein, Alan R.

Analyses have disagreed on whether the velocity uT of bulk advancement of a Huygens front in turbulence vanishes or remains finite in the limit of vanishing local front propagation speed u. Here, a connection to the large-deviation statistics of log-correlated random processes enables a definitive determination of the correct small-u asymptotics. This result reconciles several theoretical and phenomenological perspectives with the conclusion that uT remains finite for vanishing u, which implies a propagation anomaly akin to the energy-dissipation anomaly in the limit of vanishing viscosity. Various leading-order structural properties such as a novel u dependence of a bulk length scale associated with front geometry are predicted in this limit. The analysis involves a formal analogy to random advection of diffusive scalars.

More Details

Enabling Resilience in Asynchronous Many-Task Programming Models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Paul, Sri R.; Hayashi, Akihiro; Slattengren, Nicole S.; Kolla, Hemanth K.; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita T.; Mayo, Jackson M.; Sarkar, Vivek

Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and costs in system design. In this paper, we introduce a comprehensive approach to enabling application-level resilience in Asynchronous Many-Task (AMT) programming models with a focus on remedying Silent Data Corruption (SDC) that can often go undetected by the hardware and OS. Our approach makes it possible for the application programmer to declaratively express resilience attributes with minimal code changes, and to delegate the complexity of efficiently supporting resilience to our runtime system. We have created a prototype implementation of our approach as an extension to the Habanero C/C++ library (HClib), where different resilience techniques including task replay, task replication, algorithm-based fault tolerance (ABFT), and checkpointing are available. Our experimental results show that task replay incurs lower overhead than task replication when an appropriate error checking function is provided. Further, task replay matches the low overhead of ABFT. Our results also demonstrate the ability to combine different resilience schemes. To evaluate the effectiveness of our resilience mechanisms in the presence of errors, we injected synthetic errors at different error rates (1.0%, and 10.0%) and found modest increase in execution times. In summary, the results show that our approach supports efficient and scalable recovery, and that our approach can be used to influence the design of future AMT programming models and runtime systems that aim to integrate first-class support for user-level resilience.

More Details

Digital/Analog Cosimulation using CocoTB and Xyce

Smith, Andrew M.; Mayo, Jackson M.; Armstrong, Robert C.; Schiek, Richard S.; Sholander, Peter E.; Mei, Ting M.

In this article, we describe a prototype cosimulation framework using Xyce, GHDL and CocoTB that can be used to analyze digital hardware designs in out-of-nominal environments. We demonstrate current software methods and inspire future work via analysis of an open-source encryption core design. Note that this article is meant as a proof-of-concept to motivate integration of general cosimulation techniques with Xyce, an open-source circuit simulator. ------------------------------------------------

More Details

ASC CSSE Level 2 Milestone #6362: Resilient Asynchronous Many Task Programming Model

Teranishi, Keita T.; Kolla, Hemanth K.; Slattengren, Nicole S.; Whitlock, Matthew J.; Mayo, Jackson M.; Clay, Robert L.; Paul, Sri R.; Hayashi, Akihiro H.; Sarkar, Vivek S.

This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).

More Details
Results 1–25 of 144
Results 1–25 of 144