Publications

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Exploring the effect of noise on the performance benefit of non-blocking MPI_Allreduce

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

System Software Resilience

Ferreira, Kurt; Levy, Scott L.; Widener, Patrick W.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Understanding the Effects of Communication and Coordination on Checkpointing at Scale

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

FlipSphere: A Software-based DRAM Error Detection and Correction Library for HPC

Levy, Scott L.; Ferreira, Kurt; Widener, Patrick W.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Using simulation to evaluate the performance of resilience strategies and process failures

Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales-allowing the simulator to run 4x faster and use over 100x less memory.

More Details

TYPE SAND Report YEAR 2014

OSTI DOI

Asking the right questions: Benchmarking fault-tolerant extreme-scale systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick W.; Ferreira, Kurt; Levy, Scott; Bridges, Patrick G.; Arnold, Dorian; Brightwell, Ronald B.

Much recent research has explored fault-tolerance mechanisms intended for current and future extreme-scale systems. Evaluations of the suitability of checkpoint-based solutions have typically been carried out using relatively uncomplicated computational kernels designed to measure floating point performance. More recent investigations have added scaled-down "proxy" applications to more closely match the composition and behavior of deployed ones. However, the information obtained from these studies (whether floating point performance or application runtime) is not necessarily of the most value in evaluating resilience strategies. We observe that even when using a more sophisticated metric, the information available from evaluating uncoordinated checkpointing using both microbenchmarks and proxy applications does not agree. This implies that not only might researchers be asking the wrong questions, but that the answers to the right ones might be unexpected and potentially misleading. We seek to open a discussion on whether benchmarks designed to provide predictable performance evaluations of HPC hardware and toolchains are providing the right feedback for the evaluation of fault-tolerance in these applications, and more generally on how benchmarking of resilience mechanisms ought to be approached in the exascale design space. © 2014 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2014

Scopus OSTI

Extra Bits on SRAM and DRAM Errors - More Data from the Field

Ferreira, Kurt; Stearley, Jon S.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Understanding the Effects of Communication on Uncoordinated Checkpointing at Scale

Ferreira, Kurt; Widener, Patrick W.; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

HPC Event Log Analysis: Method feasibility for event correlation and prediction

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Checkpoint Compression: Its Limitations and Comparisons with other Optimizations

Levy, Scott L.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Predicting the Impact of Failure Avoidance on Checkpoint/Restart in Extreme-Scale Systems

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Predicting Coordinated and Uncoordinated Checkpoint/Restart Protocol Performance at Extreme Scales

Levy, Scott L.; Ferreira, Kurt; Widener, Patrick W.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI DOI

Event Correlation and Failure Prediction in HPC clusters

Grant, Ryan E.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Evaluating Energy Savings for Checkpoint/Restart

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Energy Consumption of Resilience Mechanisms in Large Scale Systems

Mills, Bryan M.; Ferreira, Kurt; Grant, Ryan E.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A Holistic Approach to Modeling and Simulation for Resilience and Power Configuration

Ferreira, Kurt; Levy, Scott L.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Investigating an API for resilient exascale computing

Stearley, Jon S.; Vandyke, John P.; Ferreira, Kurt; Laros, James H.

Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This report describes an initial investigation of fault types and programming interfaces to mitigate them. Proof-of-concept APIs are presented for the frequent and important cases of memory errors and node failures, and a strategy proposed for lesystem failures. These involve changes to the operating system, runtime, I/O library, and application layers. While a single API for fault handling among hardware and OS and application system-wide remains elusive, the e ort increased our understanding of both the mountainous challenges and the promising trailheads. 3

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

Checkpoint Compression: Its Limits and a Comparison with other Optimizations

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A Simulation Infrastructure for Examining the Performance of Resilience Strategies at Scale

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Shadow Computing: An Energy-Aware Resiliency Scheme for High Performance Computing

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A simulation infrastructure for examining the performance of resilience strategies at scale

Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating appli- cation performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from real

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

Checkpoint Compression - An Application Transparent Performance Optimization

Ferreira, Kurt; Pedretti, Kevin P.; Levy, Scott L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Protect Yourself: Why Your OS Must Protect Against DRAM Failures

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A Comparison of Compression and Increment-based Checkpoint Optimizations

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Accelerating Incremental Checkpointing for Extreme-Scale Computing

Proposed for publication in Future Generation Computer Systems.

Levy, Scott L.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Journal Article YEAR 2013

OSTI

Using Unreliable Virtual Hardware to Inject Errors in Extreme-Scale Systems

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

An examination of content similarity within the memory of HPC applications

Ferreira, Kurt; Thompson, Aidan P.; Trott, Christian R.; Levy, Scott L.

Abstract not provided.

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

A GPU-based Checkpoint Compression Study Size Does Matter -- More Than Speed Anyway

Ferreira, Kurt; Thompson, Aidan P.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Evaluating the Feasibility of Using Memory Content Similarity to Improve System Resilience

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Addressing Message-log Scalability for Extreme-scale Systems

Topp, Bryan E.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

An Operating System Resilient to DRAM Failures

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

The case for extensible operating systems for exascale

Ferreira, Kurt; Fiala, David F.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Exploiting Content Similarity to Improve Memory Performance in Exascale Systems

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

The Viability of Using Compression to Decrease Message Log Sizes

Hoemmen, Mark F.; Ferreira, Kurt; Heroux, Michael A.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Cooperative application/OS DRAM fault recovery

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Evaluating operating system vulnerability to memory errors

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Checkpoint Compression for Improved Checkpoint/Restart

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Alleviating Scalability Issues of Checkpointing Protocols