Publications Search

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 3.2. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect meaningful conditions and anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 3.2 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It provides an interactive 3-D physical view in which the cluster elements can be colored by raw or derived element values (e.g., temperatures, memory errors). The visual display allows the user to easily determine abnormal or outlier behaviors. Additionally, it provides search capabilities for certain scheduler logs. The OVIS capabilities were designed to be highly interactive - for example, the job search may drive an analysis which in turn may drive the user generation of a derived value which would then be examined on the physical display. The OVIS project envisions the capabilities of its tools applied to compute cluster monitoring. In the future, integration with the scheduler or resource manager will be included in a release to enable intelligent resource utilization. For example, nodes that are deemed less healthy (i.e., nodes that exhibit outlier behavior with respect to some set of variables shown to be correlated with future failure) can be discovered and assigned to shorter duration or less important jobs. Further, HPC applications with fault-tolerant capabilities would respond to changes in resource health and other OVIS notifications as needed, rather than undertaking preventative measures (e.g. checkpointing) at regular intervals unnecessarily.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

The scalable, parallel statistics toolkit of Titan

Pebay, Philippe P.; Bennett, Janine C.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Understanding large scale HPC systems through scalable monitoring and analysis

Brandt, James M.; Gentile, Ann C.; Roe, Diana C.; Pebay, Philippe P.; Wong, Matthew H.

As HPC systems grow in size and complexity, diagnosing problems and understanding system behavior, including failure modes, becomes increasingly difficult and time consuming. At Sandia National Laboratories we have developed a tool, OVIS, to facilitate large scale HPC system understanding. OVIS incorporates an intuitive graphical user interface, an extensive and extendable data analysis suite, and a 3-D visualization engine that allows visual inspection of both raw and derived data on a geometrically correct representation of a HPC system. This talk will cover system instrumentation, data collection (including log files and the complications of meaningful parsing), analysis, visualization of both raw and derived information, and how data can be combined to increase system understanding and efficiency.

More Details

TYPE Conference YEAR 2010

OSTI

Determining the Bayesian optimal sampling strategy in a hierarchical system

Boggs, Paul T.; Pebay, Philippe P.; Ringland, James T.

Consider a classic hierarchy tree as a basic model of a 'system-of-systems' network, where each node represents a component system (which may itself consist of a set of sub-systems). For this general composite system, we present a technique for computing the optimal testing strategy, which is based on Bayesian decision analysis. In previous work, we developed a Bayesian approach for computing the distribution of the reliability of a system-of-systems structure that uses test data and prior information. This allows for the determination of both an estimate of the reliability and a quantification of confidence in the estimate. Improving the accuracy of the reliability estimate and increasing the corresponding confidence require the collection of additional data. However, testing all possible sub-systems may not be cost-effective, feasible, or even necessary to achieve an improvement in the reliability estimate. To address this sampling issue, we formulate a Bayesian methodology that systematically determines the optimal sampling strategy under specified constraints and costs that will maximally improve the reliability estimate of the composite system, e.g., by reducing the variance of the reliability distribution. This methodology involves calculating the 'Bayes risk of a decision rule' for each available sampling strategy, where risk quantifies the relative effect that each sampling strategy could have on the reliability estimate. A general numerical algorithm is developed and tested using an example multicomponent system. The results show that the procedure scales linearly with the number of components available for testing.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

Computing contingency statistics in parallel

Pebay, Philippe P.; Bennett, Janine C.

Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.

More Details

TYPE Conference YEAR 2010

OSTI

The OVIS analysis architecture

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

This report summarizes the current statistical analysis capability of OVIS and how it works in conjunction with the OVIS data readers and interpolators. It also documents how to extend these capabilities. OVIS is a tool for parallel statistical analysis of sensor data to improve system reliability. Parallelism is achieved using a distributed data model: many sensors on similar components (metaphorically sheep) insert measurements into a series of databases on computers reserved for analyzing the measurements (metaphorically shepherds). Each shepherd node then processes the sheep data stored locally and the results are aggregated across all shepherds. OVIS uses the Visualization Tool Kit (VTK) statistics algorithm class hierarchy to perform analysis of each process's data but avoids VTK's model aggregation stage which uses the Message Passing Interface (MPI); this is because if a single process in an MPI job fails, the entire job will fail. Instead, OVIS uses asynchronous database replication to aggregate statistical models. OVIS has several additional features beyond those present in VTK that, first, accommodate its particular data format and, second, improve the memory and speed of the statistical analyses. First, because many statistical algorithms are multivariate in nature and sensor data is typically univariate, interpolation of data is required to provide simultaneous observations of metrics. Note that in this report, we will refer to a single value obtained from a sensor as a measurement while a collection of multiple sensor values simultaneously present in the system is an observation. A base class for interpolation is provided that abstracts the operation of converting multiple sensor measurements into simultaneous observations. A concrete implementation is provided that performs piecewise constant temporal interpolation of multiple metrics across a single component. Secondly, because calculations may summarize data too large to fit in memory OVIS analyses batches of observations at a time and aggregates these intermediate intra-process models as it goes before storing the final model for inter-process aggregation via database replication. This reduces the memory footprint of the analysis, interpolation, and the database client and server query processing. This also interleaves processing with the disk I/O required to fetch data from the database - also improving speed. This report documents how OVIS performs analyses and how to create additional analysis components that fetch measurements from the database, perform interpolation, or perform operations on streamed observations (such as model updates or assessments). The rest of this section outlines the OVIS analysis algorithm and is followed by sections specific to each subtask. Note that we are limiting our discussion for now to the creation of a model from a set of measurements, and not including the assessment of observations using a model. The same framework can be used for assessment but that use case is not detailed in this report.

More Details

TYPE SAND Report YEAR 2010

OSTI DOI

Computing contingency statistics in parallel : design trade-offs and limiting cases

Bennett, Janine C.; Pebay, Philippe P.

Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel. We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.

More Details

TYPE Conference YEAR 2010

OSTI

Copy of Copy of Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Scalable modeling and analysis for resilience

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Wong, Matthew H.; De Sapio, Vincent D.; Roe, Diana C.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Are there observable precursors to HPC platform failures?

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Are there observable precursors to HPC platform resource failures?

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

A framework for graph-based synthesis, analysis, and visualization of HPC cluster job data

De Sapio, Vincent D.; Brandt, James M.; Gentile, Ann C.; Kegelmeyer, William P.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Topological feature-based statistical analysis of petascale data

Pebay, Philippe P.; Bennett, Janine C.; Mascarenhas, Ajith A.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Scalable Information Fusion for Fault Tolerance in Large-Scale HPC

Brandt, James M.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Combining Virtualization Resource Characterization and Resource Management to Enable Efficient High Performance Compute Platforms Through Intelligent Dynamic Resource Allocation

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Scalable k-means statistics with Titan

Pebay, Philippe P.

This report summarizes existing statistical engines in VTK/Titan and presents both the serial and parallel k-means statistics engines. It is a sequel to [PT08], [BPRT09], and [PT09] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, and contingency engines. The ease of use of the new parallel k-means engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the k-means engine.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

ParaView Tutorial: Statistics

Pebay, Philippe P.; Bennett, Janine C.; Roe, Diana C.; Fabian, Nathan D.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Topological and Statistical Methods for Segmentation and Feature Detection

Grout, Ray G.; Chen, Jacqueline H.; Mascarenhas, Ajith A.; Yoo, Chunsang N.; Yu, Hongfeng Y.; Pebay, Philippe P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2009

OSTI

Data Fusion and Statistical Analysis: Piercing the Darkness of the Black Box

Brandt, James M.; Chen, Frank X.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Parallel contingency statistics with Titan

Pebay, Philippe P.

This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI