Research

Formal methods have come into wide use because of their effectiveness in verifying "safety and security" requirements of digital systems; a set of requirements for which testing is mostly ineffective. Formal methods are routinely used in the design and verification of high-consequence digital systems in industry. This report outlines our work in assessing the capabilities of commercial and open source formal tools and the ways in which they can be leveraged in digital design workflows.

More Details

TYPE SAND Report YEAR 2014

OSTI DOI

Modeling Failures in Large-Scale Computer Systems

Mayo, Jackson M.; Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Framework for Enabling System Understanding

Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Framework for Enabling System Understanding

Brandt, James M.; Chen, Frank X.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Baler: Deterministic, lossless log message clustering tool

Computer Science - Research and Development

Taerat, Narate; Brandt, Jim; Gentile, Ann C.; Wong, Matthew H.; Leangsuksun, Chokchai

The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, system log files are usually too large and complex to analyze manually. There are some existing log clustering tools that seek to help analysts in exploring these logs, however they fail to satisfy our needs with respect to scalability, usability and quality of results. Thus, we have developed a log clustering tool to better address these needs. In this paper we present our novel approach and initial experimental results. © Springer-Verlag 2011.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Cleansed Glory Dataset

Gentile, Ann C.; Brandt, James M.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

Scalable HPC monitoring and analysis for understanding and automated response

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

OVIS 3.2 user's guide

Brandt, James M.; Gentile, Ann C.; Houf, Catherine A.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 3.2. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect meaningful conditions and anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 3.2 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It provides an interactive 3-D physical view in which the cluster elements can be colored by raw or derived element values (e.g., temperatures, memory errors). The visual display allows the user to easily determine abnormal or outlier behaviors. Additionally, it provides search capabilities for certain scheduler logs. The OVIS capabilities were designed to be highly interactive - for example, the job search may drive an analysis which in turn may drive the user generation of a derived value which would then be examined on the physical display. The OVIS project envisions the capabilities of its tools applied to compute cluster monitoring. In the future, integration with the scheduler or resource manager will be included in a release to enable intelligent resource utilization. For example, nodes that are deemed less healthy (i.e., nodes that exhibit outlier behavior with respect to some set of variables shown to be correlated with future failure) can be discovered and assigned to shorter duration or less important jobs. Further, HPC applications with fault-tolerant capabilities would respond to changes in resource health and other OVIS notifications as needed, rather than undertaking preventative measures (e.g. checkpointing) at regular intervals unnecessarily.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example

Proceedings of the International Conference on Dependable Systems and Networks

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe contextrelevant methodologies for determining the accuracy and cost-benefit of predictors. © 2010 IEEE.

More Details

TYPE Conference YEAR 2010

Scopus OSTI

Understanding large scale HPC systems through scalable monitoring and analysis

Brandt, James M.; Gentile, Ann C.; Roe, Diana C.; Pebay, Philippe P.; Wong, Matthew H.

As HPC systems grow in size and complexity, diagnosing problems and understanding system behavior, including failure modes, becomes increasingly difficult and time consuming. At Sandia National Laboratories we have developed a tool, OVIS, to facilitate large scale HPC system understanding. OVIS incorporates an intuitive graphical user interface, an extensive and extendable data analysis suite, and a 3-D visualization engine that allows visual inspection of both raw and derived data on a geometrically correct representation of a HPC system. This talk will cover system instrumentation, data collection (including log files and the complications of meaningful parsing), analysis, visualization of both raw and derived information, and how data can be combined to increase system understanding and efficiency.

More Details

TYPE Conference YEAR 2010

OSTI

The OVIS analysis architecture

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

This report summarizes the current statistical analysis capability of OVIS and how it works in conjunction with the OVIS data readers and interpolators. It also documents how to extend these capabilities. OVIS is a tool for parallel statistical analysis of sensor data to improve system reliability. Parallelism is achieved using a distributed data model: many sensors on similar components (metaphorically sheep) insert measurements into a series of databases on computers reserved for analyzing the measurements (metaphorically shepherds). Each shepherd node then processes the sheep data stored locally and the results are aggregated across all shepherds. OVIS uses the Visualization Tool Kit (VTK) statistics algorithm class hierarchy to perform analysis of each process's data but avoids VTK's model aggregation stage which uses the Message Passing Interface (MPI); this is because if a single process in an MPI job fails, the entire job will fail. Instead, OVIS uses asynchronous database replication to aggregate statistical models. OVIS has several additional features beyond those present in VTK that, first, accommodate its particular data format and, second, improve the memory and speed of the statistical analyses. First, because many statistical algorithms are multivariate in nature and sensor data is typically univariate, interpolation of data is required to provide simultaneous observations of metrics. Note that in this report, we will refer to a single value obtained from a sensor as a measurement while a collection of multiple sensor values simultaneously present in the system is an observation. A base class for interpolation is provided that abstracts the operation of converting multiple sensor measurements into simultaneous observations. A concrete implementation is provided that performs piecewise constant temporal interpolation of multiple metrics across a single component. Secondly, because calculations may summarize data too large to fit in memory OVIS analyses batches of observations at a time and aggregates these intermediate intra-process models as it goes before storing the final model for inter-process aggregation via database replication. This reduces the memory footprint of the analysis, interpolation, and the database client and server query processing. This also interleaves processing with the disk I/O required to fetch data from the database - also improving speed. This report documents how OVIS performs analyses and how to create additional analysis components that fetch measurements from the database, perform interpolation, or perform operations on streamed observations (such as model updates or assessments). The rest of this section outlines the OVIS analysis algorithm and is followed by sections specific to each subtask. Note that we are limiting our discussion for now to the creation of a model from a set of measurements, and not including the assessment of observations using a model. The same framework can be used for assessment but that use case is not detailed in this report.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

Copy of Copy of Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Scalable modeling and analysis for resilience

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Wong, Matthew H.; De Sapio, Vincent D.; Roe, Diana C.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Are there observable precursors to HPC platform failures?

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Are there observable precursors to HPC platform resource failures?

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

A framework for graph-based synthesis, analysis, and visualization of HPC cluster job data

De Sapio, Vincent D.; Brandt, James M.; Gentile, Ann C.; Kegelmeyer, William P.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Scalable Information Fusion for Fault Tolerance in Large-Scale HPC

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Combining Virtualization Resource Characterization and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Methodologies for advance warning of compute cluster problems via statistical analysis: A case study

Proceedings of the 2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Roe, Diana C.; Thompson, David; Wong, Matthew H.

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are. Copyright 2009 ACM.

More Details

TYPE Conference YEAR 2009

Scopus OSTI

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

IPDPS 2009 - Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Using the cloud computing paradigm, a host of companies promise to make huge compute resources available to users on a pay-as-you-go basis. These resources can be configured on the fly to provide the hardware and operating system of choice to the customer on a large scale. While the current target market for these resources in the commercial space is web development/hosting, this model has the lure of savings of ownership, operation, and maintenance costs, and thus sounds like an attractive solution for people who currently invest millions to hundreds of millions of dollars annually on High Performance Computing (HPC) platforms in order to support large-scale scientific simulation codes. Given the current interconnect bandwidth and topologies utilized in these commercial offerings, however, the only current viable market in HPC would be small-memoryfootprint embarrassingly parallel or loosely coupled applications, which inherently require little to no inter-processor communication. While providing the appropriate resources (bandwidth, latency, memory, etc.) for the HPC community would increase the potential to enable HPC in cloud environments, this would not address the need for scalability and reliability, crucial to HPC applications. Providing for these needs is particularly difficult in commercial cloud offerings where the number of virtual resources can far outstrip the number of physical resources, the resources are shared among many users, and the resources may be heterogeneous. Advanced resource monitoring, analysis, and configuration tools can help address these issues, since they bring the ability to dynamically provide and respond to information about the platform and application state and would enable more appropriate, efficient, and flexible use of the resources key to enabling HPC. Additionally such tools could be of benefit to non-HPC cloud providers, users, and applications by providing more efficient resource utilization in general. © 2009 IEEE.

More Details

TYPE Conference YEAR 2009

Scopus OSTI

Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Publications