Publications

Results 26–40 of 40
Skip to search filters

OVIS 2.0 user%3CU%2B2019%3Es guide

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 2.0. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 2.0 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It calculates and reports model values and outliers relative to those models. Additionally, it provides an interactive 3D physical view in which the cluster elements can be colored by raw element values (e.g., temperatures, memory errors) or by the comparison of those values to a given model. The analysis tools and the visual display allow the user to easily determine abnormal or outlier behaviors. The OVIS project envisions the OVIS tool, when applied to compute cluster monitoring, to be used in conjunction with the scheduler or resource manager in order to enable intelligent resource utilization. For example, nodes that are deemed less healthy, that is, nodes that exhibit outlier behavior in some variable, or set of variables, that has shown to be correlated with future failure, can be discovered and assigned to shorter duration or less important jobs. Further, applications with fault-tolerant capabilities can invoke those mechanisms on demand, based upon notification of a node exhibiting impending failure conditions, rather than performing such mechanisms (e.g. checkpointing) at regular intervals unnecessarily.

More Details

Ovis-2: A robust distributed architecture for scalable RAS

IPDPS Miami 2008 - Proceedings of the 22nd IEEE International Parallel and Distributed Processing Symposium, Program and CD-ROM

Brandt, James M.; Debusschere, Bert D.; Gentile, Ann C.; Mayo, J.R.; Pébay, P.P.; Thompson, D.; Wong, Matthew H.

Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health. ©2008 IEEE.

More Details

Using probabilistic characterization to reduce runtime faults in HPC systems

Proceedings CCGRID 2008 - 8th IEEE International Symposium on Cluster Computing and the Grid

Brandt, James M.; Debusschere, Bert D.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Thompson, David; Wong, Matthew H.

The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application. © 2008 IEEE.

More Details

Monitoring computational clusters with OVIS

Pebay, Philippe P.; Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.

Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure ''prediction''. We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of real-time changes.

More Details
Results 26–40 of 40
Results 26–40 of 40