Results of Software Threading Experiments in ASC Codes
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, and the development and demonstration of a scalable virtualization environment for HPC. Each of these topics is presented in this report by the inclusion of a published or submitted research paper. The results of this project are being leveraged by several ongoing and new research projects.
Abstract not provided.
Abstract not provided.
Shared libraries have become ubiquitous and are used to achieve great resource efficiencies on many platforms. The same properties that enable efficiencies on time-shared computers and convenience on small clusters prove to be great obstacles to scalability on large clusters and High Performance Computing platforms. In addition, Light Weight operating systems such as Catamount have historically not supported the use of shared libraries specifically because they hinder scalability. In this report we will outline the methods of supporting shared libraries on High Performance Computing platforms using Light Weight kernels that we investigated. The considerations necessary to evaluate utility in this area are many and sometimes conflicting. While our initial path forward has been determined based on this evaluation we consider this effort ongoing and remain prepared to re-evaluate any technology that might provide a scalable solution. This report is an evaluation of a range of possible methods of supporting dynamically linked executables on capability class1 High Performance Computing platforms. Efforts are ongoing and extensive testing at scale is necessary to evaluate performance. While performance is a critical driving factor, supporting whatever method is used in a production environment is an equally important and challenging task.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report summarizes our investigations into multi-core processors and programming models for parallel scientific applications. The motivation for this study was to better understand the landscape of multi-core hardware, future trends, and the implications on system software for capability supercomputers. The results of this study are being used as input into the design of a new open-source light-weight kernel operating system being targeted at future capability supercomputers made up of multi-core processors. A goal of this effort is to create an agile system that is able to adapt to and efficiently support whatever multi-core hardware and programming models gain acceptance by the community.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Catamount is designed to be a low overhead operating system for a parallel computing environment. Functionality is limited to the minimum set needed to run a scientific computation. The design choices and implementations will be presented. A massively parallel processor (MPP), high performance computing (HPC) system is particularly sensitive to operating system overhead. Traditional, multi-purpose, operating systems are designed to support a wide range of usage models and requirements. To support the range of needs, a large number of system processes are provided and are often interdependent on each other. The overhead of these processes leads to an unpredictable amount of processor time available to a parallel application. Except in the case of the most embarrassingly parallel of applications, an MPP application must share interim results with its peers before it can make further progress. These synchronization events are made at specific points in the application code. If one processor takes longer to reach that point than all the other processors, everyone must wait. The overall finish time is increased. Sandia National Laboratories began addressing this problem more than a decade ago with an architecture based on node specialization. Sets of nodes in an MPP are designated to perform specific tasks, each running an operating system best suited to the specialized function. Sandia chose to not use a multi-purpose operating system for the computational nodes and instead began developing its first light weight operating system, SUNMOS, which ran on the compute nodes on the Intel Paragon system. Based on its viability, the architecture evolved into the PUMA operating system. Intel ported PUMA to the ASCI Red TFLOPS system, thus creating the Cougar operating system. Most recently, Cougar has been ported to Cray's XT3 system and renamed to Catamount. As the references indicate, there are a number of descriptions of the predecessor operating systems. While the majority of those discussions still apply to Catamount, this paper takes a fresh look at the architecture as it is currently implemented.
Red Storm is a massively parallel processor. The Red Storm design goals are: (1) Balanced system performance - CPU, memory, interconnect, and I/O; (2) Usability - functionality of hardware and software meets needs of users for Massively Parallel Computing; (3)S calability - system hardware and software scale, single cabinet system to {approx} 30,000 processor system; (4) reliability - machines tays up long enough between interrupts to make real progress on completing application run (at least 50 hours MTBI), requires full system RAS capability; (5) Upgradability - system can be upgraded with a processor swap and additional cabinets to 100T or greater; (6) red/black switching - capability to switch major portions of the machine between classified and unclassified computing environments; (7) space, power, cooling - high density, low power system; and (8) price/performance - excellent performance per dollar, use high volume commodity parts where feasible.
It has been recognized that documentation for new customers of ASCI Red, aka janus or the Intel Teraflops at Sandia National Laboratories, has been sadly lacking. This document has been prepared by a team of subject matter experts to fill that void and to provide a starting point for providing a similar document for ASCI Red Storm in the future. This document is intended for SNL users who need to jumpstart their use of Janus and Janus-s.
A study has been completed into the RAS features necessary for Massively Parallel Processor (MPP) systems. As part of this research, a use case model was built of how RAS features would be employed in an operational MPP system. Use cases are an effective way to specify requirements so that all involved parties can easily understand them. This technique is in contrast to laundry lists of requirements that are subject to misunderstanding as they are without context. As documented in the use case model, the study included a look at incorporating system software and end-user applications, as well as hardware, into the RAS system.
The Transportation Surety Center, 6300, has been conducting continuing research into and development of information systems for the Configurable Transportation Security and Information Management System (CTSS) project, an Object-Oriented Framework approach that uses Component-Based Software Development to facilitate rapid deployment of new systems while improving software cost containment, development reliability, compatibility, and extensibility. The direction has been to develop a Fleet Management System (FMS) framework using object-oriented technology. The goal for the current development is to provide a software and hardware environment that will demonstrate and support object-oriented development commonly in the FMS Central Command Center and Vehicle domains.