Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.
Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.
Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Gupta, Nikunj; Mayo, Jackson M.; Lemoine, Adrian S.; Kaiser, Hartmut
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.
We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.
Analyses have disagreed on whether the velocity uT of bulk advancement of a Huygens front in turbulence vanishes or remains finite in the limit of vanishing local front propagation speed u. Here, a connection to the large-deviation statistics of log-correlated random processes enables a definitive determination of the correct small-u asymptotics. This result reconciles several theoretical and phenomenological perspectives with the conclusion that uT remains finite for vanishing u, which implies a propagation anomaly akin to the energy-dissipation anomaly in the limit of vanishing viscosity. Various leading-order structural properties such as a novel u dependence of a bulk length scale associated with front geometry are predicted in this limit. The analysis involves a formal analogy to random advection of diffusive scalars.
Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and costs in system design. In this paper, we introduce a comprehensive approach to enabling application-level resilience in Asynchronous Many-Task (AMT) programming models with a focus on remedying Silent Data Corruption (SDC) that can often go undetected by the hardware and OS. Our approach makes it possible for the application programmer to declaratively express resilience attributes with minimal code changes, and to delegate the complexity of efficiently supporting resilience to our runtime system. We have created a prototype implementation of our approach as an extension to the Habanero C/C++ library (HClib), where different resilience techniques including task replay, task replication, algorithm-based fault tolerance (ABFT), and checkpointing are available. Our experimental results show that task replay incurs lower overhead than task replication when an appropriate error checking function is provided. Further, task replay matches the low overhead of ABFT. Our results also demonstrate the ability to combine different resilience schemes. To evaluate the effectiveness of our resilience mechanisms in the presence of errors, we injected synthetic errors at different error rates (1.0%, and 10.0%) and found modest increase in execution times. In summary, the results show that our approach supports efficient and scalable recovery, and that our approach can be used to influence the design of future AMT programming models and runtime systems that aim to integrate first-class support for user-level resilience.
In this article, we describe a prototype cosimulation framework using Xyce, GHDL and CocoTB that can be used to analyze digital hardware designs in out-of-nominal environments. We demonstrate current software methods and inspire future work via analysis of an open-source encryption core design. Note that this article is meant as a proof-of-concept to motivate integration of general cosimulation techniques with Xyce, an open-source circuit simulator. ------------------------------------------------
This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).
In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.
Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.
Verifying that hardware design implementations adhere to specifications is a time intensive and sometimes intractable problem due to the massive size of the system's state space. Formal methods techniques can be used to prove certain tractable specification properties; however, they are expensive, and often require subject matter experts to develop and solve. Nonetheless, hardware verification is a critical process to ensure security and safety properties are met, and encapsulates problems associated with trust and reliability. For complex designs where coverage of the entire state space is unattainable, prioritizing regions most vulnerable to security or reliability threats would allow efficient allocation of valuable verification resources. Stackelberg security games model interactions between a defender, whose goal is to assign resources to protect a set of targets, and an attacker, who aims to inflict maximum damage on the targets after first observing the defender's strategy. In equilibrium, the defender has an optimal security deployment strategy, given the attacker's best response. We apply this Stackelberg security framework to synthesized hardware implementations using the design's network structure and logic to inform defender valuations and verification costs. The defender's strategy in equilibrium is thus interpreted as a prioritization of the allocation of verification resources in the presence of an adversary. We demonstrate this technique on several open-source synthesized hardware designs.
We present a characterization of short-term stability of Kauffman's NK (random) Boolean networks under arbitrary distributions of transfer functions. Given such a Boolean network where each transfer function is drawn from the same distribution, we present a formula that determines whether short-term chaos (damage spreading) will happen. Our main technical tool which enables the formal proof of this formula is the Fourier analysis of Boolean functions, which describes such functions as multilinear polynomials over the inputs. Numerical simulations on mixtures of threshold functions and nested canalyzing functions demonstrate the formula's correctness.
We present algorithmic techniques for parallel PDE solvers that leverage numerical smoothness properties of physics simulation to detect and correct silent data corruption within local computations. We initially model such silent hardware errors (which are of concern for extreme scale) via injected DRAM bit flips. Our mitigation approach generalizes previously developed "robust stencils" and uses modified linear algebra operations that spatially interpolate to replace large outlier values. Prototype implementations for 1D hyperbolic and 3D elliptic solvers, tested on up to 2048 cores, show that this error mitigation enables tolerating orders of magnitude higher bit-flip rates. The runtime overhead of the approach generally decreases with greater solver scale and complexity, becoming no more than a few percent in some cases. A key advantage is that silent data corruption can be handled transparently with data in cache, reducing the cost of false-positive detections compared to rollback approaches.
Digital systems in an out-of-nominal environment (e.g., one causing hardware bit flips) may not be expected to function correctly in all respects but may be required to fail safely. We present an approach for understanding and verifying a system’s out-of-nominal behavior as an abstraction of nominal behavior that preserves designated critical safety requirements. Because abstraction and refinement are already widely used for improved tractability in formal design and proof techniques, this additional way of viewing an abstraction can potentially verify a system’s out-of-nominal safety with little additional work. We illustrate the approach with a simple model of a turnstile controller with possible logic faults (formalized in the temporal logic of actions and NuSMV), noting how design choices can be guided by the desired out-of-nominal abstraction. Principles of robustness in complex systems (specifically, Boolean networks) are found to be compatible with the formal abstraction approach. This work indicates a direction for broader use of formal methods in safety-critical systems.
We present a general technique to solve Partial Differential Equations, called robust stencils, which make them tolerant to soft faults, i.e. bit flips arising in memory or CPU calculations. We show how it can be applied to a two-dimensional Lax-Wendroff solver. The resulting 2D robust stencils are derived using an orthogonal application of their 1D counterparts. Combinations of 3 to 5 base stencils can then be created. We describe how these are then implemented in a parallel advection solver. Various robust stencil combinations are explored, representing tradeoff between performance and robustness. The results indicate that the 3-stencil robust combinations are slightly faster on large parallel workloads than Triple Modular Redundancy (TMR). They also have one third of the memory footprint. We expect the improvement to be significant if suitable optimizations are performed. Because faults are avoided each time new points are computed, the proposed stencils are also comparably robust to faults as TMR for a large range of error rates. The technique can be generalized to 3D (or higher dimensions) with similar benefits.
This work outlines an equation-based formulation of a digital control program and transducer interacting with a continuous physical process, and an approach using the Coq theorem prover for verifying the performance of the combined hybrid system. Considering thermal dynamics with linear dissipation for simplicity, we focus on a generalizable, physically consistent description of the interaction of the real-valued temperature and the digital program acting as a thermostat. Of interest in this work is the discovery and formal proof of bounds on the temperature, the degree of variation, and other performance characteristics. Our approach explicitly addresses the need to mathematically represent the decision problem inherent in an analog-to-digital converter, which for rare values can take an arbitrarily long time to produce a digital answer (the so-called Buridan's Principle); this constraint ineluctably manifests itself in the verification of thermostat performance. Furthermore, the temporal causality constraints in the thermal physics must be made explicit to obtain a consistent model for analysis. We discuss the significance of these findings toward the verification of digital control for more complex physical variables and fields.
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.
Current programming languages and programming models make it easy to create software and hardware systems that fulfill an intended function but also leave such systems open to unintended function and vulnerabilities. Software engineering and code hygiene may make systems incrementally safer, but do not produce the wholesale change necessary for secure systems from the outset. Yet there exists an approach with impressive results: We cite recent examples showing that formal methods, coupled with formally informed digital design, have produced objectively more robust code even beyond the properties directly proven. Though discovery of zero-day vulnerabilities is almost always a surprise and powerful tools like semantic fuzzers can cover a larger search space of vulnerabilities than a developer can conceive of, formal models seem to produce robustness of a higher qualitative order than traditionally developed digital systems. Because the claim is necessarily a qualitative one, we illustrate similar results with an idealized programming language in the form of Boolean networks where we have control of parameters related to stability and adaptability. We argue that verifiability with formal methods is an instance of broader design constraints that promote robustness. We draw analogies to real-world programming models and languages that can be mathematically reasoned about in contrast to ones that are essentially undecidable.
Several tensor eigenpair definitions have been put forth in the past decade, but these can all be unified under generalized tensor eigenpair framework, introduced by Chang, Pearson, and Zhang [J. Math. Anal. Appl., 350 (2009), pp. 416-422]. Given mth-order, n-dimensional realvalued symmetric tensors A and B, the goal is to find λ ε ℝ and x ε ℝn, x ≠= 0 such that Axm-1 = λBxm-1. Different choices for B yield different versions of the tensor eigenvalue problem. We present our generalized eigenproblem adaptive power (GEAP) method for solving the problem, which is an extension of the shifted symmetric higher-order power method (SS-HOPM) for finding Z-eigenpairs. A major drawback of SS-HOPM is that its performance depended on choosing an appropriate shift, but our GEAP method also includes an adaptive method for choosing the shift automatically.
Formal methods describe a class of system analysis techniques that seek to prove specific properties about analyzed designs, or locate flaws compromising those properties. As an analysis capability,these techniques are the subject of increased interest from both internal and external customers of Sandia National Laboratories. Given this lab's other areas of expertise, Sandia is uniquely positioned to advance the state-of-the-art with respect to several research and application areas within formal methods. This research project was a one-year effort funded by Sandia's CyberSecurity S&T Investment Area in its Laboratory Directed Research & Development program to investigate the opportunities for formal methods to impact Sandia's present mission areas, more fully understand the needs of the research community in the area of formal methods and where Sandia can contribute, and clarify from those potential research paths those that would best advance the mission-area interests of Sandia. The accomplishments from this project reinforce the utility of formal methods in Sandia, particularly in areas relevant to Cyber Security, and set the stage for continued Sandia investments to ensure this capabilityis utilized and advanced within this laboratory to serve the national interest.
This report presents the answers that an informal and unfunded group at SNL provided for questions concerning computer security posed by Jim Gosler, Sandia Fellow (00002). The primary purpose of this report is to record our current answers; hopefully those answers will turn out to be answers indeed. The group was formed in November 2010. In November 2010 Jim Gosler, Sandia Fellow, asked several of us several pointed questions about computer security metrics. Never mind that some of the best minds in the field have been trying to crack this nut without success for decades. Jim asked Campbell to lead an informal and unfunded group to answer the questions. With time Jim invited several more Sandians to join in. We met a number of times both with Jim and without him. At Jim's direction we contacted a number of people outside Sandia who Jim thought could help. For example, we interacted with IBM's T.J. Watson Research Center and held a one-day, videoconference workshop with them on the questions.
The goal of this research was to explore first principles associated with mixing of diverse implementations in a redundant fashion to increase the security and/or reliability of information systems. Inspired by basic results in computer science on the undecidable behavior of programs and by previous work on fault tolerance in hardware and software, we have investigated the problem and solution space for addressing potentially unknown and unknowable vulnerabilities via ensembles of implementations. We have obtained theoretical results on the degree of security and reliability benefits from particular diverse system designs, and mapped promising approaches for generating and measuring diversity. We have also empirically studied some vulnerabilities in common implementations of the Linux operating system and demonstrated the potential for diversity to mitigate these vulnerabilities. Our results provide foundational insights for further research on diversity and redundancy approaches for information systems.
This document describes how to obtain, install, use, and enjoy a better life with OVIS version 3.2. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect meaningful conditions and anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 3.2 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It provides an interactive 3-D physical view in which the cluster elements can be colored by raw or derived element values (e.g., temperatures, memory errors). The visual display allows the user to easily determine abnormal or outlier behaviors. Additionally, it provides search capabilities for certain scheduler logs. The OVIS capabilities were designed to be highly interactive - for example, the job search may drive an analysis which in turn may drive the user generation of a derived value which would then be examined on the physical display. The OVIS project envisions the capabilities of its tools applied to compute cluster monitoring. In the future, integration with the scheduler or resource manager will be included in a release to enable intelligent resource utilization. For example, nodes that are deemed less healthy (i.e., nodes that exhibit outlier behavior with respect to some set of variables shown to be correlated with future failure) can be discovered and assigned to shorter duration or less important jobs. Further, HPC applications with fault-tolerant capabilities would respond to changes in resource health and other OVIS notifications as needed, rather than undertaking preventative measures (e.g. checkpointing) at regular intervals unnecessarily.
The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitous and their effects are considered as simply another controllable source of error in a scientific computation, can remove such obstacles to exascale computing for certain applications. We have developed a simulation framework, as well as a preliminary implementation in a large-scale emulation environment, for exploration of these 'fault-oblivious computing' approaches. High-performance computing (HPC) faces a fundamental problem of increasing total component failure rates due to increasing system sizes, which threaten to degrade system reliability to an unusable level by the time the exascale range is reached ({approx} 10{sup 18} operations per second, requiring of order millions of processors). As computer scientists seek a way to scale system software for next-generation exascale machines, it is worth considering peer-to-peer (P2P) architectures that are already capable of supporting 10{sup 6}-10{sup 7} unreliable nodes. Exascale platforms will require a different way of looking at systems and software because the machine will likely not be available in its entirety for a meaningful execution time. Realistic estimates of failure rates range from a few times per day to more than once per hour for these platforms. P2P architectures give us a starting point for crafting applications and system software for exascale. In the context of the Internet, P2P applications (e.g., file sharing, botnets) have already solved this problem for 10{sup 6}-10{sup 7} nodes. Usually based on a fractal distributed hash table structure, these systems have proven robust in practice to constant and unpredictable outages, failures, and even subversion. For example, a recent estimate of botnet turnover (i.e., the number of machines leaving and joining) is about 11% per week. Nonetheless, P2P networks remain effective despite these failures: The Conficker botnet has grown to {approx} 5 x 10{sup 6} peers. Unlike today's system software and applications, those for next-generation exascale machines cannot assume a static structure and, to be scalable over millions of nodes, must be decentralized. P2P architectures achieve both, and provide a promising model for 'fault-oblivious computing'. This project aimed to study the dynamics of P2P networks in the context of a design for exascale systems and applications. Having no single point of failure, the most successful P2P architectures are adaptive and self-organizing. While there has been some previous work applying P2P to message passing, little attention has been previously paid to the tightly coupled exascale domain. Typically, the per-node footprint of P2P systems is small, making them ideal for HPC use. The implementation on each peer node cooperates en masse to 'heal' disruptions rather than relying on a controlling 'master' node. Understanding this cooperative behavior from a complex systems viewpoint is essential to predicting useful environments for the inextricably unreliable exascale platforms of the future. We sought to obtain theoretical insight into the stability and large-scale behavior of candidate architectures, and to work toward leveraging Sandia's Emulytics platform to test promising candidates in a realistic (ultimately {ge} 10{sup 7} nodes) setting. Our primary example applications are drawn from linear algebra: a Jacobi relaxation solver for the heat equation, and the closely related technique of value iteration in optimization. We aimed to apply P2P concepts in designing implementations capable of surviving an unreliable machine of 10{sup 6} nodes.
Improved resource utilization and fault tolerance of large-scale HPC systems can be achieved through fine grained, intelligent, and dynamic resource (re)allocation. We explore components and enabling technologies applicable to creating a system to provide this capability: specifically 1) Scalable fine-grained monitoring and analysis to inform resource allocation decisions, 2) Virtualization to enable dynamic reconfiguration, 3) Resource management for the combined physical and virtual resources and 4) Orchestration of the allocation, evaluation, and balancing of resources in a dynamic environment. We discuss both general and HPC-centric issues that impact the design of such a system. Finally, we present our prototype system, giving both design details and examples of its application in real-world scenarios.
Recent work on eigenvalues and eigenvectors for tensors of order m >= 3 has been motivated by applications in blind source separation, magnetic resonance imaging, molecular conformation, and more. In this paper, we consider methods for computing real symmetric-tensor eigenpairs of the form Ax{sup m-1} = lambda x subject to ||x||=1, which is closely related to optimal rank-1 approximation of a symmetric tensor. Our contribution is a shifted symmetric higher-order power method (SS-HOPM), which we show is guaranteed to converge to a tensor eigenpair. SS-HOPM can be viewed as a generalization of the power iteration method for matrices or of the symmetric higher-order power method. Additionally, using fixed point analysis, we can characterize exactly which eigenpairs can and cannot be found by the method. Numerical examples are presented, including examples from an extension of the method to finding complex eigenpairs.
This report summarizes the current statistical analysis capability of OVIS and how it works in conjunction with the OVIS data readers and interpolators. It also documents how to extend these capabilities. OVIS is a tool for parallel statistical analysis of sensor data to improve system reliability. Parallelism is achieved using a distributed data model: many sensors on similar components (metaphorically sheep) insert measurements into a series of databases on computers reserved for analyzing the measurements (metaphorically shepherds). Each shepherd node then processes the sheep data stored locally and the results are aggregated across all shepherds. OVIS uses the Visualization Tool Kit (VTK) statistics algorithm class hierarchy to perform analysis of each process's data but avoids VTK's model aggregation stage which uses the Message Passing Interface (MPI); this is because if a single process in an MPI job fails, the entire job will fail. Instead, OVIS uses asynchronous database replication to aggregate statistical models. OVIS has several additional features beyond those present in VTK that, first, accommodate its particular data format and, second, improve the memory and speed of the statistical analyses. First, because many statistical algorithms are multivariate in nature and sensor data is typically univariate, interpolation of data is required to provide simultaneous observations of metrics. Note that in this report, we will refer to a single value obtained from a sensor as a measurement while a collection of multiple sensor values simultaneously present in the system is an observation. A base class for interpolation is provided that abstracts the operation of converting multiple sensor measurements into simultaneous observations. A concrete implementation is provided that performs piecewise constant temporal interpolation of multiple metrics across a single component. Secondly, because calculations may summarize data too large to fit in memory OVIS analyses batches of observations at a time and aggregates these intermediate intra-process models as it goes before storing the final model for inter-process aggregation via database replication. This reduces the memory footprint of the analysis, interpolation, and the database client and server query processing. This also interleaves processing with the disk I/O required to fetch data from the database - also improving speed. This report documents how OVIS performs analyses and how to create additional analysis components that fetch measurements from the database, perform interpolation, or perform operations on streamed observations (such as model updates or assessments). The rest of this section outlines the OVIS analysis algorithm and is followed by sections specific to each subtask. Note that we are limiting our discussion for now to the creation of a model from a set of measurements, and not including the assessment of observations using a model. The same framework can be used for assessment but that use case is not detailed in this report.
One of the authors previously conjectured that the wrinkling of propagating fronts by weak random advection increases the bulk propagation rate (turbulent burning velocity) in proportion to the 4/3 power of the advection strength. An exact derivation of this scaling is reported. The analysis shows that the coefficient of this scaling is equal to the energy density of a lower-dimensional Burgers fluid with a white-in-time forcing whose spatial structure is expressed in terms of the spatial autocorrelation of the flow that advects the front. The replica method of field theory has been used to derive an upper bound on the coefficient as a function of the spatial autocorrelation. High precision numerics show that the bound is usefully sharp. Implications for strongly advected fronts (e.g., turbulent flames) are noted.
Proceedings of the 2009 Workshop on Resiliency in High Performance, Resilience'09, Co-located with the 2009 International Symposium on High Performance Distributed Computing Conference, HPDC'09
The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are. Copyright 2009 ACM.
The goal of this research was to combine theoretical and computational approaches to better understand the potential emergent behaviors of large-scale cyber systems, such as networks of {approx} 10{sup 6} computers. The scale and sophistication of modern computer software, hardware, and deployed networked systems have significantly exceeded the computational research community's ability to understand, model, and predict current and future behaviors. This predictive understanding, however, is critical to the development of new approaches for proactively designing new systems or enhancing existing systems with robustness to current and future cyber threats, including distributed malware such as botnets. We have developed preliminary theoretical and modeling capabilities that can ultimately answer questions such as: How would we reboot the Internet if it were taken down? Can we change network protocols to make them more secure without disrupting existing Internet connectivity and traffic flow? We have begun to address these issues by developing new capabilities for understanding and modeling Internet systems at scale. Specifically, we have addressed the need for scalable network simulation by carrying out emulations of a network with {approx} 10{sup 6} virtualized operating system instances on a high-performance computing cluster - a 'virtual Internet'. We have also explored mappings between previously studied emergent behaviors of complex systems and their potential cyber counterparts. Our results provide foundational capabilities for further research toward understanding the effects of complexity in cyber systems, to allow anticipating and thwarting hackers.
This document describes how to obtain, install, use, and enjoy a better life with OVIS version 2.0. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 2.0 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It calculates and reports model values and outliers relative to those models. Additionally, it provides an interactive 3D physical view in which the cluster elements can be colored by raw element values (e.g., temperatures, memory errors) or by the comparison of those values to a given model. The analysis tools and the visual display allow the user to easily determine abnormal or outlier behaviors. The OVIS project envisions the OVIS tool, when applied to compute cluster monitoring, to be used in conjunction with the scheduler or resource manager in order to enable intelligent resource utilization. For example, nodes that are deemed less healthy, that is, nodes that exhibit outlier behavior in some variable, or set of variables, that has shown to be correlated with future failure, can be discovered and assigned to shorter duration or less important jobs. Further, applications with fault-tolerant capabilities can invoke those mechanisms on demand, based upon notification of a node exhibiting impending failure conditions, rather than performing such mechanisms (e.g. checkpointing) at regular intervals unnecessarily.
This is meant as a place to put commentary on the whitepaper and is meant to be pretty much ad-hoc. Because the whitepaper describes a potential program in DOE ASCR and because it concerns many researchers in the field, these notes are meant to be extendable by anyone willing to put in the effort. Of course criticisms of the contents of the notes themselves are also welcome.
The goal of this research was to examine foundational methods, both computational and theoretical, that can improve the veracity of entity-based complex system models and increase confidence in their predictions for emergent behavior. The strategy was to seek insight and guidance from simplified yet realistic models, such as cellular automata and Boolean networks, whose properties can be generalized to production entity-based simulations. We have explored the usefulness of renormalization-group methods for finding reduced models of such idealized complex systems. We have prototyped representative models that are both tractable and relevant to Sandia mission applications, and quantified the effect of computational renormalization on the predictive accuracy of these models, finding good predictivity from renormalized versions of cellular automata and Boolean networks. Furthermore, we have theoretically analyzed the robustness properties of certain Boolean networks, relevant for characterizing organic behavior, and obtained precise mathematical constraints on systems that are robust to failures. In combination, our results provide important guidance for more rigorous construction of entity-based models, which currently are often devised in an ad-hoc manner. Our results can also help in designing complex systems with the goal of predictable behavior, e.g., for cybersecurity.
One of the authors previously conjectured that the wrinkling of propagating fronts by weak random advection increases the bulk propagation rate (turbulent burning velocity) in proportion to the 4/3 power of the advection strength. An exact derivation of this scaling is reported. The analysis shows that the coefficient of this scaling is equal to the energy density of a lower-dimensional Burgers fluid with a white-in-time forcing whose spatial structure is expressed in terms of the spatial autocorrelation of the flow that advects the front. The replica method of field theory has been used to derive an upper bound on the coefficient as a function of the spatial auto-correlation. High precision numerics show that the bound is usefully sharp. Implications for strongly advected fronts (e.g., turbulent flames) are noted.