OVIS reliably monitors computers using novel parallel calculations
Abstract not provided.
Abstract not provided.
Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure ''prediction''. We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of real-time changes.
Abstract not provided.
20th International Parallel and Distributed Processing Symposium, IPDPS 2006
Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure "prediction". We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of real-time changes. Additionally, OVIS incorporates a novel Bayesian inference scheme to dynamically infer models for the normal behavior of a system and to determine bounds on the probability of values evinced in the system. Individual node values that are unlikely given the current applicable model are flagged as aberrant. This can be a much earlier indicator of problems than waiting for the crossing of some threshold that is necessarily set high to preclude too many false alarms. We present OVIS and discuss its applications in cluster configuration and environmental tuning and to abnormality and problem discovery in our production clusters. © 2006 IEEE.
Abstract not provided.
Current copper backplane technology has reached the technical limits of clock speed and width for systems requiring multiple boards. Currently, bus technology such as VME and PCI (types of buses) will face severe limitations are the bus speed approaches 100 MHz. At this speed, the physical length limit of an unterminated bus is barely three inches. Terminating the bus enables much higher clock rates but at drastically higher power cost. Sandia has developed high bandwidth parallel optical interconnects that can provide over 40 Gbps throughput between circuit boards in a system. Based on Sandia's unique VCSEL (Vertical Cavity Surface Emitting Laser) technology, these devices are compatible with CMOS (Complementary Metal Oxide Semiconductor) chips and have single channel bandwidth in excess of 20 GHz. In this project, we are researching the use of this interconnect scheme as the physical layer of a greater ATM (Asynchronous Transfer Mode) based backplane. There are several advantages to this technology including small board space, lower power and non-contact communication. This technology is also easily expandable to meet future bandwidth requirements in excess of 160 Gbps sometimes referred to as UTOPIA 6. ATM over optical backplane will enable automatic switching of wide high-speed circuits between boards in a system. In the first year we developed integrated VCSELs and receivers, identified fiber ribbon based interconnect scheme and a high level architecture. In the second year, we implemented the physical layer in the form of a PCI computer peripheral card. A description of future work including super computer networking deployment and protocol processing is included.
This document highlights the Discom{sup 2}'s Distance computing and communication team activities at the 1999 Supercomputing conference in Portland, Oregon. This conference is sponsored by the IEEE and ACM. Sandia, Lawrence Livermore and Los Alamos National laboratories have participated in this conference for eleven years. For the last four years the three laboratories have come together at the conference under the DOE's ASCI, Accelerated Strategic Computing Initiatives rubric. Communication support for the ASCI exhibit is provided by the ASCI DISCOM{sup 2} project. The DISCOM{sup 2} communication team uses this forum to demonstrate and focus communication and networking developments within the community. At SC 99, DISCOM built a prototype of the next generation ASCI network demonstrated remote clustering techniques, demonstrated the capabilities of the emerging Terabit Routers products, demonstrated the latest technologies for delivering visualization data to the scientific users, and demonstrated the latest in encryption methods including IP VPN technologies and ATM encryption research. The authors also coordinated the other production networking activities within the booth and between their demonstration partners on the exhibit floor. This paper documents those accomplishments, discusses the details of their implementation, and describes how these demonstrations support Sandia's overall strategies in ASCI networking.