Publications Search

Proceedings of the 2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Roe, Diana C.; Thompson, David; Wong, Matthew H.

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are. Copyright 2009 ACM.

More Details

TYPE Conference YEAR 2009

Scopus OSTI

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

IPDPS 2009 - Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Using the cloud computing paradigm, a host of companies promise to make huge compute resources available to users on a pay-as-you-go basis. These resources can be configured on the fly to provide the hardware and operating system of choice to the customer on a large scale. While the current target market for these resources in the commercial space is web development/hosting, this model has the lure of savings of ownership, operation, and maintenance costs, and thus sounds like an attractive solution for people who currently invest millions to hundreds of millions of dollars annually on High Performance Computing (HPC) platforms in order to support large-scale scientific simulation codes. Given the current interconnect bandwidth and topologies utilized in these commercial offerings, however, the only current viable market in HPC would be small-memoryfootprint embarrassingly parallel or loosely coupled applications, which inherently require little to no inter-processor communication. While providing the appropriate resources (bandwidth, latency, memory, etc.) for the HPC community would increase the potential to enable HPC in cloud environments, this would not address the need for scalability and reliability, crucial to HPC applications. Providing for these needs is particularly difficult in commercial cloud offerings where the number of virtual resources can far outstrip the number of physical resources, the resources are shared among many users, and the resources may be heterogeneous. Advanced resource monitoring, analysis, and configuration tools can help address these issues, since they bring the ability to dynamically provide and respond to information about the platform and application state and would enable more appropriate, efficient, and flexible use of the resources key to enabling HPC. Additionally such tools could be of benefit to non-HPC cloud providers, users, and applications by providing more efficient resource utilization in general. © 2009 IEEE.

More Details

TYPE Conference YEAR 2009

Scopus OSTI

Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Interactive Data Fusion Capabilities for Large-Scale Compute Cluster Architects and Administrators

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Resource Health Characterizations for Interactive and Autonomous Proactive System Administration and Scheduling Decisions

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Quantifying failure prediction in large scale HPC systems: A case study

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Scalable Information Fusion for Fault Tolerance in Large-Scale HPC

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Quantifying Failure Prediction in Large Scale HPC Systems: A Case Study

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments

Brandt, James M.; Wong, Matthew H.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Combining System Characterization and Novel Execution Modles to Achieve Scalable Robust Computing

Adalsteinsson, Helgi A.; Brandt, James M.; Gentile, Ann C.; Debusschere, Bert D.; Mayo, Jackson M.; Pebay, Philippe P.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

OVIS 2.0 user%3CU%2B2019%3Es guide

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 2.0. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 2.0 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It calculates and reports model values and outliers relative to those models. Additionally, it provides an interactive 3D physical view in which the cluster elements can be colored by raw element values (e.g., temperatures, memory errors) or by the comparison of those values to a given model. The analysis tools and the visual display allow the user to easily determine abnormal or outlier behaviors. The OVIS project envisions the OVIS tool, when applied to compute cluster monitoring, to be used in conjunction with the scheduler or resource manager in order to enable intelligent resource utilization. For example, nodes that are deemed less healthy, that is, nodes that exhibit outlier behavior in some variable, or set of variables, that has shown to be correlated with future failure, can be discovered and assigned to shorter duration or less important jobs. Further, applications with fault-tolerant capabilities can invoke those mechanisms on demand, based upon notification of a node exhibiting impending failure conditions, rather than performing such mechanisms (e.g. checkpointing) at regular intervals unnecessarily.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

Methodologies for advance warning of compute cluster problems via statistical analysis : a case study

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Resilience Activities at Sandia National Laboratories

Stearley, Jon S.; Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2008

OSTI

Ovis-2: A robust distributed architecture for scalable RAS

IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM

Brandt, James M.; Debusschere, Bert D.; Gentile, Ann C.; Mayo, J.R.; Pébay, P.P.; Thompson, D.; Wong, Matthew H.

Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health. ©2008 IEEE.

More Details

TYPE Conference YEAR 2008

Scopus OSTI

Using probabilistic characterization to reduce runtime faults in HPC systems

Proceedings CCGRID 2008 - 8th IEEE International Symposium on Cluster Computing and the Grid

Brandt, James M.; Debusschere, Bert D.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Thompson, David; Wong, Matthew H.

The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application. © 2008 IEEE.

More Details

TYPE Conference YEAR 2008

Scopus OSTI