Publications Search

Formal methods have come into wide use because of their effectiveness in verifying "safety and security" requirements of digital systems; a set of requirements for which testing is mostly ineffective. Formal methods are routinely used in the design and verification of high-consequence digital systems in industry. This report outlines our work in assessing the capabilities of commercial and open source formal tools and the ways in which they can be leveraged in digital design workflows.

More Details

TYPE SAND Report YEAR 2014

DOI OSTI

Modeling Failures in Large-Scale Computer Systems

Thompson, David; Mayo, Jackson R.; Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Framework for Enabling System Understanding

Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Framework for Enabling System Understanding

Brandt, James M.; Chen, Frank X.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Baler: Deterministic, lossless log message clustering tool

Computer Science - Research and Development

Taerat, Narate; Brandt, Jim; Gentile, Ann C.; Wong, Matthew H.; Leangsuksun, Chokchai

The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, system log files are usually too large and complex to analyze manually. There are some existing log clustering tools that seek to help analysts in exploring these logs, however they fail to satisfy our needs with respect to scalability, usability and quality of results. Thus, we have developed a log clustering tool to better address these needs. In this paper we present our novel approach and initial experimental results. © Springer-Verlag 2011.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Cleansed Glory Dataset

Gentile, Ann C.; Brandt, James M.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

Scalable HPC monitoring and analysis for understanding and automated response

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

OVIS 3.2 user's guide

Brandt, James M.; Gentile, Ann C.; Houf, Catherine A.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 3.2. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect meaningful conditions and anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 3.2 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It provides an interactive 3-D physical view in which the cluster elements can be colored by raw or derived element values (e.g., temperatures, memory errors). The visual display allows the user to easily determine abnormal or outlier behaviors. Additionally, it provides search capabilities for certain scheduler logs. The OVIS capabilities were designed to be highly interactive - for example, the job search may drive an analysis which in turn may drive the user generation of a derived value which would then be examined on the physical display. The OVIS project envisions the capabilities of its tools applied to compute cluster monitoring. In the future, integration with the scheduler or resource manager will be included in a release to enable intelligent resource utilization. For example, nodes that are deemed less healthy (i.e., nodes that exhibit outlier behavior with respect to some set of variables shown to be correlated with future failure) can be discovered and assigned to shorter duration or less important jobs. Further, HPC applications with fault-tolerant capabilities would respond to changes in resource health and other OVIS notifications as needed, rather than undertaking preventative measures (e.g. checkpointing) at regular intervals unnecessarily.

More Details

TYPE SAND Report YEAR 2010

DOI OSTI

Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example

Proceedings of the International Conference on Dependable Systems and Networks

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pébay, Philippe; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe contextrelevant methodologies for determining the accuracy and cost-benefit of predictors. © 2010 IEEE.

More Details

TYPE Conference YEAR 2010

Scopus OSTI

Understanding large scale HPC systems through scalable monitoring and analysis

Brandt, James M.; Gentile, Ann C.; Roe, Diana C.; Pebay, Philippe P.; Wong, Matthew H.

As HPC systems grow in size and complexity, diagnosing problems and understanding system behavior, including failure modes, becomes increasingly difficult and time consuming. At Sandia National Laboratories we have developed a tool, OVIS, to facilitate large scale HPC system understanding. OVIS incorporates an intuitive graphical user interface, an extensive and extendable data analysis suite, and a 3-D visualization engine that allows visual inspection of both raw and derived data on a geometrically correct representation of a HPC system. This talk will cover system instrumentation, data collection (including log files and the complications of meaningful parsing), analysis, visualization of both raw and derived information, and how data can be combined to increase system understanding and efficiency.

More Details

TYPE Conference YEAR 2010

OSTI

The OVIS analysis architecture

Brandt, James M.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

This report summarizes the current statistical analysis capability of OVIS and how it works in conjunction with the OVIS data readers and interpolators. It also documents how to extend these capabilities. OVIS is a tool for parallel statistical analysis of sensor data to improve system reliability. Parallelism is achieved using a distributed data model: many sensors on similar components (metaphorically sheep) insert measurements into a series of databases on computers reserved for analyzing the measurements (metaphorically shepherds). Each shepherd node then processes the sheep data stored locally and the results are aggregated across all shepherds. OVIS uses the Visualization Tool Kit (VTK) statistics algorithm class hierarchy to perform analysis of each process's data but avoids VTK's model aggregation stage which uses the Message Passing Interface (MPI); this is because if a single process in an MPI job fails, the entire job will fail. Instead, OVIS uses asynchronous database replication to aggregate statistical models. OVIS has several additional features beyond those present in VTK that, first, accommodate its particular data format and, second, improve the memory and speed of the statistical analyses. First, because many statistical algorithms are multivariate in nature and sensor data is typically univariate, interpolation of data is required to provide simultaneous observations of metrics. Note that in this report, we will refer to a single value obtained from a sensor as a measurement while a collection of multiple sensor values simultaneously present in the system is an observation. A base class for interpolation is provided that abstracts the operation of converting multiple sensor measurements into simultaneous observations. A concrete implementation is provided that performs piecewise constant temporal interpolation of multiple metrics across a single component. Secondly, because calculations may summarize data too large to fit in memory OVIS analyses batches of observations at a time and aggregates these intermediate intra-process models as it goes before storing the final model for inter-process aggregation via database replication. This reduces the memory footprint of the analysis, interpolation, and the database client and server query processing. This also interleaves processing with the disk I/O required to fetch data from the database - also improving speed. This report documents how OVIS performs analyses and how to create additional analysis components that fetch measurements from the database, perform interpolation, or perform operations on streamed observations (such as model updates or assessments). The rest of this section outlines the OVIS analysis algorithm and is followed by sections specific to each subtask. Note that we are limiting our discussion for now to the creation of a model from a set of measurements, and not including the assessment of observations using a model. The same framework can be used for assessment but that use case is not detailed in this report.

More Details

TYPE SAND Report YEAR 2010

DOI OSTI

Are there observable precursors to HPC platform resource failures?

Brandt, James M.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Are there observable precursors to HPC platform failures?

Brandt, James M.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Scalable modeling and analysis for resilience

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Wong, Matthew H.; De Sapio, Vincent; Roe, Diana C.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Copy of Copy of Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Copy of Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Copy of Combining Virtualization Resource Characterization and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

A framework for graph-based synthesis, analysis, and visualization of HPC cluster job data

De Sapio, Vincent; Brandt, James M.; Gentile, Ann C.; Kegelmeyer, William P.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Scalable Information Fusion for Fault Tolerance in Large-Scale HPC

Brandt, James M.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Combining Virtualization Resource Characterization and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI OSTI

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Quantifying Failure Prediction in Large Scale HPC Systems: A Case Study

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Scalable Information Fusion for Fault Tolerance in Large-Scale HPC

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Quantifying failure prediction in large scale HPC systems: A case study

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Resource Health Characterizations for Interactive and Autonomous Proactive System Administration and Scheduling Decisions

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Interactive Data Fusion Capabilities for Large-Scale Compute Cluster Architects and Administrators

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Combining System Characterization and Novel Execution Modles to Achieve Scalable Robust Computing

Adalsteinsson, Helgi; Brandt, James M.; Gentile, Ann C.; Debusschere, Bert; Mayo, Jackson R.; Pebay, Philippe P.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments

Brandt, James M.; Wong, Matthew H.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Copy of Methodologies for Advance Warning of Compute Cluster Problems via Statistical Analysis: A Case Study (Conference Presentation)

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

OVIS 2.0 user%3CU%2B2019%3Es guide

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 2.0. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 2.0 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It calculates and reports model values and outliers relative to those models. Additionally, it provides an interactive 3D physical view in which the cluster elements can be colored by raw element values (e.g., temperatures, memory errors) or by the comparison of those values to a given model. The analysis tools and the visual display allow the user to easily determine abnormal or outlier behaviors. The OVIS project envisions the OVIS tool, when applied to compute cluster monitoring, to be used in conjunction with the scheduler or resource manager in order to enable intelligent resource utilization. For example, nodes that are deemed less healthy, that is, nodes that exhibit outlier behavior in some variable, or set of variables, that has shown to be correlated with future failure, can be discovered and assigned to shorter duration or less important jobs. Further, applications with fault-tolerant capabilities can invoke those mechanisms on demand, based upon notification of a node exhibiting impending failure conditions, rather than performing such mechanisms (e.g. checkpointing) at regular intervals unnecessarily.

More Details

TYPE SAND Report YEAR 2009

DOI OSTI

Methodologies for advance warning of compute cluster problems via statistical analysis : a case study

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2008

OSTI

OVIS-2: A Robust Distributed Architecture for Scalable RAS

Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.; Thompson, David; Pebay, Philippe P.; Debusschere, Bert; Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Conference YEAR 2008

OSTI

OVIS-2: A Robust Distributed Architecture for Scalable RAS

Wong, Matthew H.; Thompson, David; Pebay, Philippe P.; Mayo, Jackson R.; Gentile, Ann C.; Debusschere, Bert; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Brandt, James M.; Gentile, Ann C.; Pebay, Philippe P.; Thompson, David; Wong, Matthew H.; Debusschere, Bert; Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

OVIS reliably monitors computers using novel parallel calculations

Brandt, James M.; Gentile, Ann C.; Pebay, Philippe P.; Thompson, David; Wong, Matthew H.; Jolly, James

Abstract not provided.

More Details

TYPE Presentation YEAR 2007

OSTI

Monitoring computational clusters with OVIS

Pebay, Philippe P.; Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.

Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure ''prediction''. We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of real-time changes.

More Details

TYPE SAND Report YEAR 2006

DOI OSTI

OVIS: A Tool for Intelligent Real-time Monitoring of Computational Clusters

Gentile, Ann C.; Wong, Matthew H.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

OVIS: A Tool for Intelligent Real-time Monitoring of Computational Clusters

Gentile, Ann C.; Wong, Matthew H.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Publications

Search results