Publications

Publications / Conference

Ovis-2: A robust distributed architecture for scalable RAS

Brandt, James M.; Debusschere, Bert D.; Gentile, Ann C.; Mayo, J.R.; Pébay, P.P.; Thompson, D.; Wong, Matthew H.

Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health. ©2008 IEEE.