Publications Search

Sandia Advanced Architecture Testbeds

Ang, James A.; Laros, James H.; Kelly, Suzanne M.; Pedretti, Kevin P.

Abstract not provided.

TYPE Presentation YEAR 2012

OSTI

Early Experiences with Intel MIC Architcture

Ang, James A.; Hammond, Simon D.; Barrett, Richard F.; Levenhagen, Michael J.; Rodrigues, Arun; Pedretti, Kevin P.; Laros, James H.; Kelly, Suzanne M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Evaluating operating system vulnerability to memory errors

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Energy Based Performance Tuning for Large Scale High Performance Computing Systems

Laros, James H.; Pedretti, Kevin P.; Kelly, Suzanne M.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Exascale Design Space Exploration and Co-design

Proposed for publication in Future Generation Computer Systems.

Barrett, Richard F.; Trucano, Timothy G.; Doerfler, Douglas W.; Dosanjh, Sudip S.; Hammond, Simon D.; Hemmert, Karl S.; Heroux, Michael A.; Lin, Paul L.; Pedretti, Kevin P.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

Evaluating Operating System Vulnerability to Memory Errors

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

SST%2Bgem5%3DA Scalable Simulation Infrastructure for High Performance Computing

Hsieh, Mingyu H.; Levenhagen, Michael J.; Pedretti, Kevin P.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Enhancements to Red Storm and Catamount to Increase Power Efficiency During Application Execution

Laros, James H.; Pedretti, Kevin P.; Kelly, Suzanne M.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Towards High Performance Computing Application Energy Efficiency

Laros, James H.; Pedretti, Kevin P.; Kelly, Suzanne M.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

VM-based Emulation of Future Generation HPC Systems

International Journal of High Performance Computing

Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI DOI

Investigating the impact of the Cielo Cray XE6 architecture on scientific application codes

IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Vaughan, Courtenay T.; Rajan, Mahesh R.; Barrett, Richard F.; Doerfler, Douglas W.; Pedretti, Kevin P.

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, and supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo. © 2011 IEEE.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Energy Based Performance Tuning for Large Scale High Performance Computing Systems

Laros, James H.; Pedretti, Kevin P.; Kelly, Suzanne M.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Application-Driven Analysis of Two Generations of Capability Computing Platforms: The Transition to Multicore Processors

Concurreny and Computation: Practice and Experience

Rajan, Mahesh R.; Vaughan, Courtenay T.; Doerfler, Douglas W.; Barrett, Richard F.; Lin, Paul L.; Pedretti, Kevin P.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2011

OSTI

Keeping checkpoint/restart viable for exascale systems

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoint) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.

More Details

TYPE SAND Report YEAR 2011

OSTI DOI

Sierra Structural Dynamics Multi-threaded evaluations

Reese, Garth M.; Dohrmann, Clark R.; Williams, Alan B.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Kitten: A Lightweight Operating System for Ultrascale Supercomputers

Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

VM-based slack emulation of large-scale systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2011

Bridges, Patrick G.; Arnold, Dorian; Pedretti, Kevin P.

This paper describes the design of a system to enable large-scale testing of new software stacks and prospective high-end computing architectures. The proposed architecture combines system virtualization, time dilation, architectural simulation, and slack simulation to provide scalable emulation of hypothetical systems. We also describe virtualization-based full-system measurement and monitoring tools to aid in using the proposed system for co-design of high-performance computing system software and architectural features for future systems. Finally, we provide a description of the implementation strategy and status of the proposed system. © 2011 ACM.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

An Intra-Node Implementation of OpenSHMEM Using Virtual Address Space Mapping

Brightwell, Ronald B.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

The Impact of Injection Bandwidth Performance on Application Scalability

Pedretti, Kevin P.; Brightwell, Ronald B.; Doerfler, Douglas W.; Hemmert, Karl S.; Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Minimal-overhead virtualization of a large scale supercomputer

ACM SIGPLAN Notices

Lange, John R.; Pedretti, Kevin P.; Dinda, Peter; Bae, Chang; Bridges, Patrick G.; Soltero, Philip; Merritt, Alexander

Virtualization has the potential to dramatically increase the usability and reliability of high performance computing (HPC) systems. However, this potential will remain unrealized unless overheads can be minimized. This is particularly challenging on large scale machines that run carefully crafted HPC OSes supporting tightlycoupled, parallel applications. In this paper, we show how careful use of hardware and VMM features enables the virtualization of a large-scale HPC system, specifically a Cray XT4 machine, with .5% overhead on key HPC applications, microbenchmarks, and guests at scales of up to 4096 nodes. We describe three techniques essential for achieving such low overhead: passthrough I/O, workload-sensitive selection of paging mechanisms, and carefully controlled preemption. These techniques are forms of symbiotic virtualization, an approach on which we elaborate. Copyright © 2011 ACM.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Application Driven Analysis of Two Generations of Capability Computing Platforms: Purple and Cielo

Rajan, Mahesh R.; Vaughan, Courtenay T.; Barrett, Richard F.; Doerfler, Douglas W.; Lin, Paul L.; Pedretti, Kevin P.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

From Red Storm to Cielo: Performance Analysis of ASC Simulation Programs Across an Evolution of Multicore Architectures

Parallel Processing Letters

Barrett, Richard F.; Vaughan, Courtenay T.; Rajan, Mahesh R.; Doerfler, Douglas W.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2011

OSTI

Enhanced Support for PGAS Communication in Portals

Barrett, Brian B.; Brightwell, Ronald B.; Hemmert, Karl S.; Pedretti, Kevin P.; Wheeler, Kyle B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Combining HPC and Virtual Machines to understand Internet-scale phenomena

Minnich, Ronald G.; Rudish, Don W.; Floren, John F.; Sweeney, Andrew J.; Fritz, David J.; Vanderveen, Keith V.; Pedretti, Kevin P.; Watts, Kristopher K.; Deccio, Casey T.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Copy of Combining HPC and Virtual Machines to understand Internet-scale phenomena

Minnich, Ronald G.; Rudish, Don W.; Floren, John F.; Sweeney, Andrew J.; Fritz, David J.; Vanderveen, Keith V.; Pedretti, Kevin P.; Watts, Kristopher K.; Deccio, Casey T.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Copy of Application-driven Analysis of Two Generations of Capability Computing Platforms: Purple and Cielo

Rajan, Mahesh R.; Vaughan, Courtenay T.; Doerfler, Douglas W.; Lin, Paul L.; Pedretti, Kevin P.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Investigating the Impact of the Cielo Cray XT6 Architecture on Scientific Application Codes

Vaughan, Courtenay T.; Rajan, Mahesh R.; Barrett, Richard F.; Doerfler, Douglas W.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.

More Details

TYPE SAND Report YEAR 2011

OSTI DOI

A Comparison of the Performance Characteristics of Capability and Capacity Class HPC Systems

Doerfler, Douglas W.; Rajan, Mahesh R.; Epperson, Marcus E.; Vaughan, Courtenay T.; Pedretti, Kevin P.; Barrett, Richard F.; Barrett, Brian B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Evaluating the Viability of Process Replication Reliability for Exascale Systems

Ferreira, Kurt; Stearley, Jon S.; Laros, James H.; Oldfield, Ron A.; Pedretti, Kevin P.; Brightwell, Ronald B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Energy Based Performance Tuning for Large Scale High Performance Computing Systems

Pedretti, Kevin P.; Kelly, Suzanne M.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Application-driven analysis of two generations of capability computing platforms :

Rajan, Mahesh R.; Vaughan, Courtenay T.; Doerfler, Douglas W.; Lin, Paul L.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Application-driven Analysis of Two Generations of Capability Computing Platforms: Purple and Cielo

Rajan, Mahesh R.; Vaughan, Courtenay T.; Doerfler, Douglas W.; Lin, Paul L.; Pedretti, Kevin P.; Barrett, Richard F.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Redundant computing for exascale systems

Ferreira, Kurt; Stearley, Jon S.; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

Investigating the impact of the cielo cray XE6 architecture on scientific application codes

Vaughan, Courtenay T.; Rajan, Mahesh R.; Barrett, Richard F.; Doerfler, Douglas W.; Pedretti, Kevin P.

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, and supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.

More Details

TYPE Conference YEAR 2010

OSTI

Opportunities for Leveraging OS Virtualization in High-End Supercomputing

Pedretti, Kevin P.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

A Scalable Virtualization Environment for HPC

Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Opportunities for leveraging OS virtualization in high-end supercomputing

Pedretti, Kevin P.; Bridges, Patrick G.

This paper examines potential motivations for incorporating virtualization support in the system software stacks of high-end capability supercomputers. We advocate that this will increase the flexibility of these platforms significantly and enable new capabilities that are not possible with current fixed software stacks. Our results indicate that compute, virtual memory, and I/O virtualization overheads are low and can be further mitigated by utilizing well-known techniques such as large paging and VMM bypass. Furthermore, since the addition of virtualization support does not affect the performance of applications using the traditional native environment, there is essentially no disadvantage to its addition.

More Details

TYPE Conference YEAR 2010

OSTI

LDRD final report : a lightweight operating system for multi-core capability class supercomputers

Pedretti, Kevin P.; Levenhagen, Michael J.; Ferreira, Kurt; Brightwell, Ronald B.; Kelly, Suzanne M.; Bridges, Patrick G.

The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, and the development and demonstration of a scalable virtualization environment for HPC. Each of these topics is presented in this report by the inclusion of a published or submitted research paper. The results of this project are being leveraged by several ongoing and new research projects.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

LDRD final report : managing shared memory data distribution in hybrid HPC applications

Pedretti, Kevin P.

MPI is the dominant programming model for distributed memory parallel computers, and is often used as the intra-node programming model on multi-core compute nodes. However, application developers are increasingly turning to hybrid models that use threading within a node and MPI between nodes. In contrast to MPI, most current threaded models do not require application developers to deal explicitly with data locality. With increasing core counts and deeper NUMA hierarchies seen in the upcoming LANL/SNL 'Cielo' capability supercomputer, data distribution poses an upper boundary on intra-node scalability within threaded applications. Data locality therefore has to be identified at runtime using static memory allocation policies such as first-touch or next-touch, or specified by the application user at launch time. We evaluate several existing techniques for managing data distribution using micro-benchmarks on an AMD 'Magny-Cours' system with 24 cores among 4 NUMA domains and argue for the adoption of a dynamic runtime system implemented at the kernel level, employing a novel page table replication scheme to gather per-NUMA domain memory access traces.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing

Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

Lange, John; Pedretti, Kevin P.; Hudson, Trammell; Dinda, Peter; Cui, Zheng; Xia, Lei; Bridges, Patrick; Gocke, Andy; Jaconette, Steven; Levenhagen, Michael J.; Brightwell, Ronald B.

Palacios is a new open-source VMM under development at Northwestern University and the University of New Mexico that enables applications executing in a virtualized environment to achieve scalable high performance on large machines. Palacios functions as a modularized extension to Kitten, a high performance operating system being developed at Sandia National Laboratories to support large-scale supercomputing applications. Together, Palacios and Kitten provide a thin layer over the hardware to support full-featured virtualized environments alongside Kitten's lightweight native environment. Palacios supports existing, unmodified applications and operating systems by using the hardware virtualization technologies in recent AMD and Intel processors. Additionally, Palacios leverages Kitten's simple memory management scheme to enable low-overhead pass-through of native devices to a virtualized environment. We describe the design, implementation, and integration of Palacios and Kitten. Our benchmarks show that Palacios provides near native (within 5%), scalable performance for virtualized environments running important parallel applications. This new architecture provides an incremental path for applications to use supercomputers, running specialized lightweight host operating systems, that is not significantly performance-compromised. © 2010 IEEE.

More Details

TYPE Conference YEAR 2010

Scopus OSTI

Virtualizing a large scale supercomputer with minimal overhead

Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin P.; Stearley, Jon S.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

System Software Research for Extreme-Scale Computing

Oldfield, Ron A.; Brightwell, Ronald B.; Pedretti, Kevin P.; Riesen, Rolf; Ferreira, Kurt; Kelly, Suzanne M.; Laros, James H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An active node and its redundant partner form a node bundle which will only fail, and cause an application restart, when both nodes in the bundle fail. The goal of this library is to learn whether this can be done entirely at the user level, what requirements this library places on a Reliability, Availability, and Serviceability (RAS) system, and what its impact on performance and run time is. We find that our redundant MPI layer library imposes a relatively modest performance penalty for applications, but that it greatly reduces the number of applications interrupts. This reduction in interrupts leads to huge savings in restart and rework time. For large-scale applications the savings compensate for the performance loss and the additional nodes required for redundant computations.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

Investigating methods of supporting dynamically linked executables on high performance computing platforms

Laros, James H.; Kelly, Suzanne M.; Levenhagen, Michael J.; Pedretti, Kevin P.

Shared libraries have become ubiquitous and are used to achieve great resource efficiencies on many platforms. The same properties that enable efficiencies on time-shared computers and convenience on small clusters prove to be great obstacles to scalability on large clusters and High Performance Computing platforms. In addition, Light Weight operating systems such as Catamount have historically not supported the use of shared libraries specifically because they hinder scalability. In this report we will outline the methods of supporting shared libraries on High Performance Computing platforms using Light Weight kernels that we investigated. The considerations necessary to evaluate utility in this area are many and sometimes conflicting. While our initial path forward has been determined based on this evaluation we consider this effort ongoing and remain prepared to re-evaluate any technology that might provide a scalable solution. This report is an evaluation of a range of possible methods of supporting dynamically linked executables on capability class1 High Performance Computing platforms. Efforts are ongoing and extensive testing at scale is necessary to evaluate performance. While performance is a critical driving factor, supporting whatever method is used in a production environment is an equally important and challenging task.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

Palacios and Kitten : high performance operating systems for scalable virtualized and native supercomputing

Pedretti, Kevin P.; Levenhagen, Michael J.; Brightwell, Ronald B.

Palacios and Kitten are new open source tools that enable applications, whether ported or not, to achieve scalable high performance on large machines. They provide a thin layer over the hardware to support both full-featured virtualized environments and native code bases. Kitten is an OS under development at Sandia that implements a lightweight kernel architecture to provide predictable behavior and increased flexibility on large machines, while also providing Linux binary compatibility. Palacios is a VMM that is under development at Northwestern University and the University of New Mexico. Palacios, which can be embedded into Kitten and other OSes, supports existing, unmodified applications and operating systems by using virtualization that leverages hardware technologies. We describe the design and implementation of both Kitten and Palacios. Our benchmarks show that they provide near native, scalable performance. Palacios and Kitten provide an incremental path to using supercomputer resources that is not performance-compromised.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details

TYPE Conference YEAR 2009

OSTI