One of the authors previously conjectured that the wrinkling of propagating fronts by weak random advection increases the bulk propagation rate (turbulent burning velocity) in proportion to the 4/3 power of the advection strength. An exact derivation of this scaling is reported. The analysis shows that the coefficient of this scaling is equal to the energy density of a lower-dimensional Burgers fluid with a white-in-time forcing whose spatial structure is expressed in terms of the spatial autocorrelation of the flow that advects the front. The replica method of field theory has been used to derive an upper bound on the coefficient as a function of the spatial autocorrelation. High precision numerics show that the bound is usefully sharp. Implications for strongly advected fronts (e.g., turbulent flames) are noted.
Proceedings of the 2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09
The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are. Copyright 2009 ACM.
The goal of this research was to combine theoretical and computational approaches to better understand the potential emergent behaviors of large-scale cyber systems, such as networks of {approx} 10{sup 6} computers. The scale and sophistication of modern computer software, hardware, and deployed networked systems have significantly exceeded the computational research community's ability to understand, model, and predict current and future behaviors. This predictive understanding, however, is critical to the development of new approaches for proactively designing new systems or enhancing existing systems with robustness to current and future cyber threats, including distributed malware such as botnets. We have developed preliminary theoretical and modeling capabilities that can ultimately answer questions such as: How would we reboot the Internet if it were taken down? Can we change network protocols to make them more secure without disrupting existing Internet connectivity and traffic flow? We have begun to address these issues by developing new capabilities for understanding and modeling Internet systems at scale. Specifically, we have addressed the need for scalable network simulation by carrying out emulations of a network with {approx} 10{sup 6} virtualized operating system instances on a high-performance computing cluster - a 'virtual Internet'. We have also explored mappings between previously studied emergent behaviors of complex systems and their potential cyber counterparts. Our results provide foundational capabilities for further research toward understanding the effects of complexity in cyber systems, to allow anticipating and thwarting hackers.
This document describes how to obtain, install, use, and enjoy a better life with OVIS version 2.0. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 2.0 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It calculates and reports model values and outliers relative to those models. Additionally, it provides an interactive 3D physical view in which the cluster elements can be colored by raw element values (e.g., temperatures, memory errors) or by the comparison of those values to a given model. The analysis tools and the visual display allow the user to easily determine abnormal or outlier behaviors. The OVIS project envisions the OVIS tool, when applied to compute cluster monitoring, to be used in conjunction with the scheduler or resource manager in order to enable intelligent resource utilization. For example, nodes that are deemed less healthy, that is, nodes that exhibit outlier behavior in some variable, or set of variables, that has shown to be correlated with future failure, can be discovered and assigned to shorter duration or less important jobs. Further, applications with fault-tolerant capabilities can invoke those mechanisms on demand, based upon notification of a node exhibiting impending failure conditions, rather than performing such mechanisms (e.g. checkpointing) at regular intervals unnecessarily.