Publications

114 Results
Skip to search filters

A-SST Initial Specification

Rodrigues, Arun; Hammond, Simon D.; Hemmert, Karl S.; Hughes, Clayton H.; Kenny, Joseph P.; Voskuilen, Gwendolyn R.

The U.S. Army Research Office (ARO), in partnership with IARPA, are investigating innovative, efficient, and scalable computer architectures that are capable of executing next-generation large scale data-analytic applications. These applications are increasingly sparse, unstructured, non-local, and heterogeneous. Under the Advanced Graphic Intelligence Logical computing Environment (AGILE) program, Performer teams will be asked to design computer architectures to meet the future needs of the DoD and the Intelligence Community (IC). This design effort will require flexible, scalable, and detailed simulation to assess the performance, efficiency, and validity of their designs. To support AGILE, Sandia National Labs will be providing the AGILE-enhanced Structural Simulation Toolkit (A-SST). This toolkit is a computer architecture simulation framework designed to support fast, parallel, and multi-scale simulation of novel architectures. This document describes the A-SST framework, some of its library of simulation models, and how it may be used by AGILE Performers.

More Details

ERAS: Enabling the Integration of Real-World Intellectual Properties (IPs) in Architectural Simulators

Nema, Shubham N.; Razdan, Rohin R.; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Adak, Debratim A.; Hammond, Simon D.; Awad, Amro A.; Hughes, Clayton H.

Sandia National Laboratories is investigating scalable architectural simulation capabilities with a focus on simulating and evaluating highly scalable supercomputers for high performance computing applications. There is a growing demand for RTL model integration to provide the capability to simulate customized node architectures and heterogeneous systems. This report describes the first steps integrating the ESSENTial Signal Simulation Enabled by Netlist Transforms (ESSENT) tool with the Structural Simulation Toolkit (SST). ESSENT can emit C++ models from models written in FIRRTL to automatically generate components. The integration workflow will automatically generate the SST component and necessary interfaces to ’plug’ the ESSENT model into the SST framework.

More Details

Multiscale System Modeling of Single-Event-Induced Faults in Advanced Node Processors

IEEE Transactions on Nuclear Science

Cannon, Matthew J.; Rodrigues, Arun; Black, Dolores A.; Black, Jeff; Bustamante, Luis G.; Breeding, Matthew; Feinberg, Benjamin F.; Skoufis, Micahel; Quinn, Heather; Clark, Lawrence T.; Brunhaver, John S.; Barnaby, Hugh; McLain, Michael L.; Agarwal, Sapan A.; Marinella, Matthew J.

Integration-technology feature shrink increases computing-system susceptibility to single-event effects (SEE). While modeling SEE faults will be critical, an integrated processor's scope makes physically correct modeling computationally intractable. Without useful models, presilicon evaluation of fault-tolerance approaches becomes impossible. To incorporate accurate transistor-level effects at a system scope, we present a multiscale simulation framework. Charge collection at the 1) device level determines 2) circuit-level transient duration and state-upset likelihood. Circuit effects, in turn, impact 3) register-transfer-level architecture-state corruption visible at 4) the system level. Thus, the physically accurate effects of SEEs in large-scale systems, executed on a high-performance computing (HPC) simulator, could be used to drive cross-layer radiation hardening by design. We demonstrate the capabilities of this model with two case studies. First, we determine a D flip-flop's sensitivity at the transistor level on 14-nm FinFet technology, validating the model against published cross sections. Second, we track and estimate faults in a microprocessor without interlocked pipelined stages (MIPS) processor for Adams 90% worst case environment in an isotropic space environment.

More Details

Challenges & Roadmap for Beyond CMOS Computing Simulation

Rodrigues, Arun; Frank, Michael P.

Simulating HPC systems is a difficult task and the emergence of “Beyond CMOS” architectures and execution models will increase that difficulty. This document presents a “tutorial” on some of the simulation challenges faced by conventional and non-conventional architectures (Section 1) and goals and requirements for simulating Beyond CMOS systems (Section 2). These provide background for proposed short- and long-term roadmaps for simulation efforts at Sandia (Sections 3 and 4). Additionally, a brief explanation of a proof-of-concept integration of a Beyond CMOS architectural simulator is presented (Section 2.3).

More Details

Performance analysis for using non-volatile memory DIMMs: Opportunities and challenges

ACM International Conference Proceeding Series

Awad, Amro A.; Hammond, Simon D.; Hughes, Clayton H.; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

DRAM scalability is becoming more challenging, pushing the focus of the research community towards alternative memory technologies. Many emerging non-volatile memory (NVM) devices are proving themselves to be good candidates to replace DRAM in the coming years. For example, the recently announced 3D-XPoint memory by Intel/Micron promises latencies that are comparable to DRAM, while being non-volatile and much more dense. While emerging NVMs can be fabricated in different form factors, the most promising (from a performance perspective) are NVM-based DIMMs. Unfortunately, there is a shortage of studies that explore the design options for NVM-based DIMMs. Because of the read and write asymmetries in both power consumption and latency, as well as limited write endurance, which often requires wear-leveling techniques, NVMs require a specialized controller. The fact that future on-die memory controllers are expected to handle different memory technologies pushes future hardware towards on-DIMM controllers. In this paper, we propose an architectural model for NVM-based DIMMs with internal controllers, explore their design space, evaluate different optimizations and reach out to several architectural suggestions. Finally, we make our model publicly available and integrate it with a widely used architectural simulator.

More Details

Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

Journal of Parallel and Distributed Computing

Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.; Hemmert, Karl S.; McCauley, Samuel; Moore, Branden J.; Moseley, Benjamin; Phillips, Cynthia A.; Resnick, David R.; Rodrigues, Arun

A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” and if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Thus, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient. In this paper, we explore the design space for a user-controlled multi-level main memory. Our work identifies situations in which rewriting application kernels can provide significant performance gains when using near-memory. We present algorithms designed for two-level main memory, using divide-and-conquer to partition computations and streaming to exploit data locality. We consider algorithms for the fundamental application of sorting and for the data analysis kernel k-means. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories’ SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance improvements for our sorting algorithm. In contrast, the k-means algorithm is generally CPU bound and does not improve when using near-memory except under extreme conditions. These conditions require large instances that rule out SST simulation, but we demonstrate improvements by running on a customized machine with high and low bandwidth memory. These case studies in co-design serve as positive and cautionary templates, respectively, for the major task of optimizing the computational kernels of many fundamental applications for two-level main memory systems.

More Details

Messier: A Detailed NVM-Based DIMM Model for the SST Simulation Framework

Awad, Amro A.; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hammond, Simon D.; Hoekstra, Robert J.; Hughes, Clayton H.

DRAM technology is the main building block of main memory, however, DRAM scaling is becoming very challenging. The main issues for DRAM scaling are the increasing error rates with each new generation, the geometric and physical constraints of scaling the capacitor part of the DRAM cells, and the high power consumption caused by the continuous need for refreshing cell values. At the same time, emerging Non- Volatile Memory (NVM) technologies, such as Phase-Change Memory (PCM), are emerging as promising replacements for DRAM. NVMs, when compared to current technologies e.g., NAND-based ash, have latencies comparable to DRAM. Additionally, NVMs are non-volatile, which eliminates the need for refresh power and enables persistent memory applications. Finally, NVMs have promising densities and the potential for multi-level cell (MLC) storage.

More Details

Multi-level memory policies: What you add is more important than what you take out

ACM International Conference Proceeding Series

Hammond, Simon D.; Rodrigues, Arun; Voskuilen, Gwendolyn R.

Multi-Level Memory (MLM) will be an increasingly common organization for main memory. Hybrid main memories that combine conventional DDR and "fast" memory will allow higher peak bandwidth at an attainable cost. However, the chief hurdle for MLM systems is the management of data placement. While user-directed placement may work for some applications, it imposes a heavy burden on the programmer. To avoid this burden while still benefiting from MLM, we propose a number of automated management policies. Our results show that several possible policies offer performance and implementation trade offs. Also, unlike conventional cache or paged memory policies, the addition policy is much more important than the replacement policy.

More Details

Analyzing allocation behavior for multi-level memory

ACM International Conference Proceeding Series

Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hammond, Simon D.

Managing multi-level memories will require different policies from those used for cache hierarchies, as memory technologies differ in latency, bandwidth, and volatility. To this end we analyze application data allocations and main memory accesses to determine whether an application-driven approach to managing a multi-level memory system comprising stacked and conventional DRAM is viable. Our early analysis shows that the approach is viable, but some applications may require dynamic allocations (i.e., migration) while others are amenable to static allocation.

More Details

Abstract Machine Models and Proxy Architectures for Exascale Computing

Ang, James A.; Barrett, Richard F.; Benner, R.E.; Burke, Daniel B.; Chan, Cy P.; Cook, Jeanine C.; Daley, Christopher D.; Donofrio, Dave D.; Hammond, Simon D.; Hemmert, Karl S.; Hoekstra, Robert J.; Ibrahim, Khaled I.; Kelly, Suzanne M.; Le, Hoang L.; Leung, Vitus J.; Michelogiannakis, George M.; Resnick, David R.; Rodrigues, Arun; Shalf, John S.; Stark, Dylan S.; Unat, D.U.; Wright, Nick W.; Voskuilen, Gwendolyn R.

Machine Models and Proxy Architectures for Exascale Computing Version 2.0 Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 94550 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Approved for public release; further dissemination unlimited. Issued by Sandia National Laboratories, operated for the United States Department of Energy by Sandia Corporation. NOTICE: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government, nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors, or their employees, make any warranty, express or implied, or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or rep- resent that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government, any agency thereof, or any of their contractors or subcontractors. The views and opinions expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or any of their contractors. Printed in the United States of America. This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from U.S. Department of Energy Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831 Telephone: (865) 576-8401 Facsimile: (865) 576-5728 E-Mail: reports@adonis.osti.gov Online ordering: http://www.osti.gov/bridge Available to the public from U.S. Department of Commerce National Technical Information Service 5285 Port Royal Rd Springfield, VA 22161 Telephone: (800) 553-6847 Facsimile: (703) 605-6900 E-Mail: orders@ntis.fedworld.gov Online ordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online D E P A R T M E N T O F E N E R G Y * * U N I T E D S T A T E S O F A M E R I C A SAND2016-6049 Unlimited Release Printed Abstract Machine Models and Proxy Architectures for Exascale Computing Version 2.0 J.A. Ang 1 , R.F. Barrett 1 , R.E. Benner 1 , D. Burke 2 , C. Chan 2 , J. Cook 1 , C.S. Daley 2 , D. Donofrio 2 , S.D. Hammond 1 , K.S. Hemmert 1 , R.J. Hoekstra 1 , K. Ibrahim 2 , S.M. Kelly 1 , H. Le, V.J. Leung 1 , G. Michelogiannakis 2 , D.R. Resnick 1 , A.F. Rodrigues 1 , J. Shalf 2 , D. Stark, D. Unat, N.J. Wright 2 , G.R. Voskuilen 1 1 1 Sandia National Laboratories, P.O. Box 5800, Albuquerque, New Mexico 87185-MS 1319 2 Lawrence Berkeley National Laboratory, Berkeley, California Abstract To achieve exascale computing, fundamental hardware architectures must change. The most sig- nificant consequence of this assertion is the impact on the scientific and engineering applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to exascale architectures, developers must be able to reason about new hardware and deter- mine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware archi- tects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of mod- els that can help software developers prepare for exascale. In addition, through the use of proxy architectures, we can enable a more concrete exploration of how well new and evolving applica- tion codes map onto future architectures. This second version of the document addresses system scale considerations and provides a system-level abstract machine model with proxy architecture information.

More Details

Optical networks for high-performance computing: Promises and perils

5th IEEE Photonics Society Optical Interconnects Conference, OI 2016

Rodrigues, Arun

Optical networks hold great promise for improving the performance of supercomputers, yet they have always proven just out of reach. This talk will examine the potential of optical interconnects, barriers to adoption, and possible solutions from hardware/software co-design.

More Details

The potential and perils of multi-level memory

ACM International Conference Proceeding Series

Jayaraj, Jagan J.; Rodrigues, Arun; Hammond, Simon D.; Voskuilen, Gwendolyn R.

The future of memory systems isMulti-LevelMemory (MLM). In a MLM system the main memory is comprised of two or more types of memory instead of a conventional DDR- DRAM-only main memory. By combining different memory technologies, an MLM system can potentially offer more us- Able bandwidth and more capacity for a similar cost as a conventional memory system. However, substantial software and hardware design challenges must be overcome to make this potential real. It is our position that the diversity of application access pat- Terns precludes any simple "one size fits all" approach and that better tools and design processes will be needed to ful- fill the potential of MLM. Effcient implementations of MLM will require a high degree of co-design and coordination be- Tween hardware and software. The simulation framework we have built for this study can aid tool building to solve the programming challenges.

More Details

Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015

Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.; Hemmert, Karl S.; McCauley, Samuel; Moore, Branden J.; Moseley, Benjamin; Phillips, Cynthia A.; Resnick, David R.; Rodrigues, Arun

A fundamental challenge for supercomputer architecture is that processors cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. As the number of cores per chip increases, and traditional DDR DRAM speeds stagnate, the problem is only getting worse. A variety of non-DDR 3D memory technologies (Wide I/O 2, HBM) offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. However, such a packaging scheme cannot contain sufficient memory capacity for a node. It seems likely that future systems will require at least two levels of main memory: high-bandwidth, low-power memory near the processor and low-bandwidth high-capacity memory further away. This near memory will probably not have significantly faster latency than the far memory. This, combined with the large size of the near memory (multiple GB) and power constraints, may make it difficult to treat it as a standard cache. In this paper, we explore some of the design space for a user-controlled multi-level main memory. We present algorithms designed for the heterogeneous bandwidth, using streaming to exploit data locality. We consider algorithms for the fundamental application of sorting. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories' SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance. This co-design effort suggests implementing two-level main memory systems may improve memory performance in fundamental applications.

More Details

Trends in Microfabrication Capabilities & Device Architectures

Bauer, Todd B.; Jones, Adam J.; Lentine, Anthony L.; Mudrick, John M.; Okandan, Murat O.; Rodrigues, Arun

More Details

Design methodology for optimizing optical interconnection networks in high performance systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Rumley, Sébastien; Glick, Madeleine; Hammond, Simon D.; Rodrigues, Arun; Bergman, Keren

Modern high performance computers connect hundreds of thousands of endpoints and employ thousands of switches. This allows for a great deal of freedom in the design of the network topology. At the same time, due to the sheer numbers and complexity involved, it becomes more challenging to easily distinguish between promising and improper designs. With ever increasing line rates and advances in optical interconnects, there is a need for renewed design methodologies that comprehensively capture the requirements and expose tradeoffs expeditiously in this complex design space. We introduce a systematic approach, based on Generalized Moore Graphs, allowing one to quickly gauge the ideal level of connectivity required for a given number of end-points and traffic hypothesis, and to collect insight on the role of the switch radix in the topology cost. Based on this approach, we present a methodology for the identification of Pareto-optimal topologies. We apply our method to a practical case with 25,000 nodes and present the results.

More Details

Abstract machine models and proxy architectures for exascale computing

Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Ang, James A.; Barrett, R.F.; Benner, R.E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, Simon D.; Hemmert, Karl S.; Kelly, Suzanne M.; Le, H.; Leung, Vitus J.; Resnick, D.R.; Rodrigues, Arun; Shalf, J.; Stark, Dylan S.; Unat, D.; Wright, N.J.

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.

More Details

Using a complementary emulation-simulation co-design approach to assess application readiness for Processing-in-Memory systems

Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Stelle, George; Olivier, Stephen L.; Stark, Dylan S.; Rodrigues, Arun; Hemmert, Karl S.

Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.

More Details

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

SNL software manual for the ACS Data Analytics Project

Stearley, Jon S.; Robinson, David G.; Hooper, Russell H.; Stickland, Michael S.; McLendon, William C.; Rodrigues, Arun

In the ACS Data Analytics Project (also known as 'YumYum'), a supercomputer is modeled as a graph of components and dependencies, jobs and faults are simulated, and component fault rates are estimated using the graph structure and job pass/fail outcomes. This report documents the successful completion of all SNL deliverables and tasks, describes the software written by SNL for the project, and presents the data it generates. Readers should understand what the software tools are, how they fit together, and how to use them to reproduce the presented data and additional experiments as desired. The SNL YumYum tools provide the novel simulation and inference capabilities desired by ACS. SNL also developed and implemented a new algorithm, which provides faster estimates, at finer component granularity, on arbitrary directed acyclic graphs.

More Details

Using reconfigurable functional units in conventional microprocessors

Rodrigues, Arun

Scientific applications use highly specialized data structures that require complex, latency sensitive graphs of integer instructions for memory address calculations. Working with the Univeristy of Wisconsin, we have demonstrated significant differences between the Sandia's applications and the industry standard SPEC-FP (standard performance evaluation corporation-floating point) suite. Specifically, integer dataflow performance is critical to overall system performance. To improve this performance, we have developed a configurable functional unit design that is capable of accelerating integer dataflow.

More Details

On the path to exascale

International Journal of Distributed Systems and Technologies

Alvin, Kenneth F.; Barrett, Brian B.; Brightwell, Ronald B.; Dosanjh, Sudip S.; Geist, Al; Hemmert, Karl S.; Heroux, Michael; Kothe, Doug; Murphy, Richard C.; Nichols, Jeff; Oldfield, Ron A.; Rodrigues, Arun; Vetter, Jeffrey S.

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing (a million, trillion operations per second). Key architectural challenges include power, memory, interconnection networks and resilience. The paper summarizes ongoing research aimed at overcoming these hurdles. Topics of interest are architecture aware and scalable algorithms, system simulation, 3D integration, new approaches to system-directed resilience and new benchmarks. Although significant progress is being made, a broader international program is needed.

More Details

An architecture to perform NIC based MPI matching

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Hemmert, Karl S.; Underwood, Keith; Rodrigues, Arun

Modern supercomputers aggregate thousands of microprocessors through a high performance network. Many of these systems place a processor on the network interface controller (NIC) to handle some portion of the MPI processing. This processing involves traversing a linked list and invoking a matching function for each item. Although this task is critical to the performance of the system, microprocessors perform it extremely poorly. Furthermore, the traditional network processor approaches of multicore and multithreading map poorly to the problem because the list is a shared data structure. While match processing can be implemented directly in hardware, hardware implementations can be extremely inflexible and lead to extremely high risk. This paper presents a novel, programmable architecture for a processor to handle the matching function. The matching engine approaches the performance of a direct hardware implementation while maintaining a high degree of flexibility and programmability. More importantly, it requires a dramatically smaller area than a conventional processor. © 2007 IEEE.

More Details

Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking

Brightwell, Ronald B.; Goudy, Sue P.; Rodrigues, Arun; Underwood, Keith D.

The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI. © 2006 Inderscience Enterprises Ltd.

More Details

Enhancing NIC performance for MPI using processing-in-memory

Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005

Rodrigues, Arun; Murphy, Richard; Brightwell, Ronald B.; Underwood, Keith D.

Processing-in-Memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely fine-grained multi-threading capabilities. This paper explores a mechanism for leveraging these features of PIM technology to enhance commodity architectures in a seemingly mundane way: accelerating MPI. Modern network interfaces leverage simple processors to offload portions of the MPI semantics, particularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a significant decrease in latency and increase in small message bandwidth, particularly when long queues are present.

More Details

Programming future architectures : dusty decks, memory walls, and the speed of light

Rodrigues, Arun

Due to advances in CMOS fabrication technology, high performance computing capabilities have continually grown. More capable hardware has allowed a range of complex scientific applications to be developed. However, these applications present a bottleneck to future performance. Entrenched 'legacy' codes - 'Dusty Decks' - demand that new hardware must remain compatible with existing software. Additionally, conventional architectures faces increasing challenges. Many of these challenges revolve around the growing disparity between processor and memory speed - the 'Memory Wall' - and difficulties scaling to large numbers of parallel processors. To a large extent, these limitations are inherent to the traditional computer architecture. As data is consumed more quickly, moving that data to the point of computation becomes more difficult. Barring any upward revision in the speed of light, this will continue to be a fundamental limitation on the speed of computation. This work focuses on these solving these problems in the context of Light Weight Processing (LWP). LWP is an innovative technique which combines Processing-In-Memory, short vector computation, multithreading, and extended memory semantics. It applies these techniques to try and answer the questions 'What will a next-generation supercomputer look like?' and 'How will we program it?' To that end, this work presents four contributions: (1) An implementation of MPI which uses features of LWP to substantially improve message processing throughput; (2) A technique leveraging extended memory semantics to improve message passing by overlapping computation and communication; (3) An OpenMP library modified to allow efficient partitioning of threads between a conventional CPU and LWPs - greatly improving cost/performance; and (4) An algorithm to extract very small 'threadlets' which can overcome the inherent disadvantages of a simple processor pipeline.

More Details

Accelerating list management for MPI

Hemmert, Karl S.; Rodrigues, Arun; Underwood, Keith

The latency and throughput of MPI messages are critically important to a range of parallel scientific applications. In many modern networks, both of these performance characteristics are largely driven by the performance of a processor on the network interface. Because of the semantics of MPI, this embedded processor is forced to traverse a linked list of posted receives each time a message is received. As this list grows long, the latency of message reception grows and the throughput of MPI messages decreases. This paper presents a novel hardware feature to handle list management functions on a network interface. By moving functions such as list insertion, list traversal, and list deletion to the hardware unit, latencies are decreased by up to 20% in the zero length queue case with dramatic improvements in the presence of long queues. Similarly, the throughput is increased by up to 10% in the zero length queue case and by nearly 100% in the presence queues of 30 messages.

More Details

The implications of working set analysis on supercomputing memory hierarchy design

Underwood, Keith; Rodrigues, Arun

Supercomputer architects strive to maximize the performance of scientific applications. Unfortunately, the large, unwieldy nature of most scientific applications has lead to the creation of artificial benchmarks, such as SPEC-FP, for architecture research. Given the impact that these benchmarks have on architecture research, this paper seeks an understanding of how they relate to real-world applications within the Department of Energy. Since the memory system has been found to be a particularly key issue for many applications, the focus of the paper is on the relationship between how the SPEC-FP benchmarks and DOE applications use the memory system. The results indicate that while the SPEC-FP suite is a well balanced suite, supercomputing applications typically demand more from the memory system and must perform more 'other work' (in the form of integer computations) along with the floating point operations. The SPEC-FP suite generally demonstrates slightly more temporal locality leading to somewhat lower bandwidth demands. The most striking result is the cumulative difference between the benchmarks and the applications in terms of the requirements to sustain the floating-point operation rate: the DOE applications require significantly more data from main memory (not cache) per FLOP and dramatically more integer instructions per FLOP.

More Details
114 Results
114 Results