Publications

Results 51–67 of 67
Skip to search filters

Fault oblivious high performance computing with dynamic task replication and substitution

Computer Science - Research and Development

Vorobeychik, Yevgeniy; Mayo, Jackson M.; Armstrong, Robert C.; Minnich, Ronald G.; Rudish, Don W.

Traditional parallel programming techniques will suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. To address this challenge, we introduce and simulate a novel software architecture that combines a task dependency graph with a substitution graph. The role of the dependency graph is to limit communication and checkpointing and enhance fault tolerance by allowing graph neighbors to exchange data, while the substitution graph promotes fault oblivious computing by allowing a failed task to be substituted onthe- fly by another task, incurring a quantifiable error. We present optimization formulations for trading off substitution errors and other factors such as available system capacity and low-overlap task partitioning among processors, and demonstrate that these can be approximately solved in real time after some simplifications. Simulation studies of our proposed approach indicate that a substitution network adds considerable resilience and simple enhancements can limit the aggregate substitution errors. © Springer-Verlag 2011.

More Details

The theory of diversity and redundancy in information system security : LDRD final report

Mayo, Jackson M.; Armstrong, Robert C.; Allan, Benjamin A.; Walker, Andrea M.

The goal of this research was to explore first principles associated with mixing of diverse implementations in a redundant fashion to increase the security and/or reliability of information systems. Inspired by basic results in computer science on the undecidable behavior of programs and by previous work on fault tolerance in hardware and software, we have investigated the problem and solution space for addressing potentially unknown and unknowable vulnerabilities via ensembles of implementations. We have obtained theoretical results on the degree of security and reliability benefits from particular diverse system designs, and mapped promising approaches for generating and measuring diversity. We have also empirically studied some vulnerabilities in common implementations of the Linux operating system and demonstrated the potential for diversity to mitigate these vulnerabilities. Our results provide foundational insights for further research on diversity and redundancy approaches for information systems.

More Details

Peer-to-peer architectures for exascale computing : LDRD final report

Mayo, Jackson M.; Vorobeychik, Yevgeniy V.; Armstrong, Robert C.; Minnich, Ronald G.; Rudish, Don W.

The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitous and their effects are considered as simply another controllable source of error in a scientific computation, can remove such obstacles to exascale computing for certain applications. We have developed a simulation framework, as well as a preliminary implementation in a large-scale emulation environment, for exploration of these 'fault-oblivious computing' approaches. High-performance computing (HPC) faces a fundamental problem of increasing total component failure rates due to increasing system sizes, which threaten to degrade system reliability to an unusable level by the time the exascale range is reached ({approx} 10{sup 18} operations per second, requiring of order millions of processors). As computer scientists seek a way to scale system software for next-generation exascale machines, it is worth considering peer-to-peer (P2P) architectures that are already capable of supporting 10{sup 6}-10{sup 7} unreliable nodes. Exascale platforms will require a different way of looking at systems and software because the machine will likely not be available in its entirety for a meaningful execution time. Realistic estimates of failure rates range from a few times per day to more than once per hour for these platforms. P2P architectures give us a starting point for crafting applications and system software for exascale. In the context of the Internet, P2P applications (e.g., file sharing, botnets) have already solved this problem for 10{sup 6}-10{sup 7} nodes. Usually based on a fractal distributed hash table structure, these systems have proven robust in practice to constant and unpredictable outages, failures, and even subversion. For example, a recent estimate of botnet turnover (i.e., the number of machines leaving and joining) is about 11% per week. Nonetheless, P2P networks remain effective despite these failures: The Conficker botnet has grown to {approx} 5 x 10{sup 6} peers. Unlike today's system software and applications, those for next-generation exascale machines cannot assume a static structure and, to be scalable over millions of nodes, must be decentralized. P2P architectures achieve both, and provide a promising model for 'fault-oblivious computing'. This project aimed to study the dynamics of P2P networks in the context of a design for exascale systems and applications. Having no single point of failure, the most successful P2P architectures are adaptive and self-organizing. While there has been some previous work applying P2P to message passing, little attention has been previously paid to the tightly coupled exascale domain. Typically, the per-node footprint of P2P systems is small, making them ideal for HPC use. The implementation on each peer node cooperates en masse to 'heal' disruptions rather than relying on a controlling 'master' node. Understanding this cooperative behavior from a complex systems viewpoint is essential to predicting useful environments for the inextricably unreliable exascale platforms of the future. We sought to obtain theoretical insight into the stability and large-scale behavior of candidate architectures, and to work toward leveraging Sandia's Emulytics platform to test promising candidates in a realistic (ultimately {ge} 10{sup 7} nodes) setting. Our primary example applications are drawn from linear algebra: a Jacobi relaxation solver for the heat equation, and the closely related technique of value iteration in optimization. We aimed to apply P2P concepts in designing implementations capable of surviving an unreliable machine of 10{sup 6} nodes.

More Details

Approaches for scalable modeling and emulation of cyber systems : LDRD final report

Mayo, Jackson M.; Minnich, Ronald G.; Rudish, Don W.; Armstrong, Robert C.

The goal of this research was to combine theoretical and computational approaches to better understand the potential emergent behaviors of large-scale cyber systems, such as networks of {approx} 10{sup 6} computers. The scale and sophistication of modern computer software, hardware, and deployed networked systems have significantly exceeded the computational research community's ability to understand, model, and predict current and future behaviors. This predictive understanding, however, is critical to the development of new approaches for proactively designing new systems or enhancing existing systems with robustness to current and future cyber threats, including distributed malware such as botnets. We have developed preliminary theoretical and modeling capabilities that can ultimately answer questions such as: How would we reboot the Internet if it were taken down? Can we change network protocols to make them more secure without disrupting existing Internet connectivity and traffic flow? We have begun to address these issues by developing new capabilities for understanding and modeling Internet systems at scale. Specifically, we have addressed the need for scalable network simulation by carrying out emulations of a network with {approx} 10{sup 6} virtualized operating system instances on a high-performance computing cluster - a 'virtual Internet'. We have also explored mappings between previously studied emergent behaviors of complex systems and their potential cyber counterparts. Our results provide foundational capabilities for further research toward understanding the effects of complexity in cyber systems, to allow anticipating and thwarting hackers.

More Details

Copy of Using Emulation and Simulation to Understand the Large-Scale Behavior of the Internet

Adalsteinsson, Helgi A.; Armstrong, Robert C.; Chiang, Ken C.; Gentile, Ann C.; Lloyd, Levi L.; Minnich, Ronald G.; Vanderveen, Keith V.; Vanrandwyk, Jamie V.; Rudish, Don W.

We report on the work done in the late-start LDRDUsing Emulation and Simulation toUnderstand the Large-Scale Behavior of the Internet. We describe the creation of a researchplatform that emulates many thousands of machines to be used for the study of large-scale inter-net behavior. We describe a proof-of-concept simple attack we performed in this environment.We describe the successful capture of a Storm bot and, from the study of the bot and furtherliterature search, establish large-scale aspects we seek to understand via emulation of Storm onour research platform in possible follow-on work. Finally, we discuss possible future work.3

More Details

Mathematical approaches for complexity/predictivity trade-offs in complex system models : LDRD final report

Mayo, Jackson M.; Armstrong, Robert C.; Vanderveen, Keith V.

The goal of this research was to examine foundational methods, both computational and theoretical, that can improve the veracity of entity-based complex system models and increase confidence in their predictions for emergent behavior. The strategy was to seek insight and guidance from simplified yet realistic models, such as cellular automata and Boolean networks, whose properties can be generalized to production entity-based simulations. We have explored the usefulness of renormalization-group methods for finding reduced models of such idealized complex systems. We have prototyped representative models that are both tractable and relevant to Sandia mission applications, and quantified the effect of computational renormalization on the predictive accuracy of these models, finding good predictivity from renormalized versions of cellular automata and Boolean networks. Furthermore, we have theoretically analyzed the robustness properties of certain Boolean networks, relevant for characterizing organic behavior, and obtained precise mathematical constraints on systems that are robust to failures. In combination, our results provide important guidance for more rigorous construction of entity-based models, which currently are often devised in an ad-hoc manner. Our results can also help in designing complex systems with the goal of predictable behavior, e.g., for cybersecurity.

More Details

On the role of self-similarity in component-based software

Armstrong, Robert C.

This is a speculative work meant to stimulate discussion about the role of subsumability in self-similar software structures for computational simulations. As in natural phenomena, self-similar features in framework structures allow the size and complexity of code to grow without bound and still maintain apparent coherence. As in crystal growth, the coherence may be maintained by the application of a repeated pattern, or patterns may, as in fluid mechanical turbulence, be scaled by size and nested. Examples of these kinds of patterns applied to component systems in particular will be given. Conclusions and questions for discussion will be drawn regarding the applicability of these ideas to component architectures, complexity, and scientific computing.

More Details

Ccaffeine framework : composing and debugging applications interactively and running them statically

Allan, Benjamin A.; Armstrong, Robert C.

Ccaffeine is a Common Component Architecture (CCA) framework devoted to high-performance computing. In this note we give an overview of the system features of Ccaffeine and CCA that support component-based HPC application development. Object-oriented, single-threaded and lightweight, Ccaffeine is designed to get completely out of the way of the running application after it has been composed from components. Ccaffeine is one of the few frameworks, CCA or otherwise, that can compose and run applications on a parallel machine interactively and then automatically generate a static, possibly self-tuning, executable for production runs. Users can experiment with and debug applications interactively, improving their productivity. When the application is ready, a script is automatically generated, parsed and turned into a static executable for production runs. Within this static executable, dynamic replacement of components can be performed by self-tuning applications.

More Details
Results 51–67 of 67
Results 51–67 of 67