Publications Search

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

More Details

TYPE Journal Article YEAR 2014

Scopus OSTI DOI

The Structural Simulation Toolkit

Rodrigues, Arun; Moore, Branden J.; Hammond, Simon D.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI DOI

The Sandia Co-design Ecosystem

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Abstract machine models and proxy architectures for exascale computing

Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Ang, James A.; Barrett, R.F.; Benner, R.E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, Simon D.; Hemmert, Karl S.; Kelly, Suzanne M.; Le, H.; Leung, Vitus J.; Resnick, D.R.; Rodrigues, Arun; Shalf, J.; Stark, Dylan S.; Unat, D.; Wright, N.J.

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.

More Details

TYPE Conference Poster YEAR 2014

Scopus OSTI DOI

Using a complementary emulation-simulation co-design approach to assess application readiness for Processing-in-Memory systems

Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Stelle, George; Olivier, Stephen L.; Stark, Dylan S.; Rodrigues, Arun; Hemmert, Karl S.

Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.

More Details

TYPE Conference Poster YEAR 2014

Scopus OSTI DOI

XGC Overview

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Extreme-scale Computing Grand Challenge LDRD (XGC)

Hemmert, Karl S.; Barrett, Brian B.; Barrett, Richard F.; Lentine, Anthony L.; Rodrigues, Arun; Denton-Hill, Kim M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Using the Cray Gemini Performance Counters

Pedretti, Kevin P.; Vaughan, Courtenay T.; Barrett, Richard F.; Devine, Karen D.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The Impact of Hybrid-Core Processors on MPI Message Rate

Barrett, Brian B.; Brightwell, Ronald B.; Hemmert, Karl S.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The Impact of Hybrid-Core Processors on MPI Message Rate

Barrett, Brian B.; Brightwell, Ronald B.; Hammond, Simon D.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The portals 4.0.1 network programming interface

Barrett, Brian B.; Brightwell, Ronald B.; Pedretti, Kevin P.; Hemmert, Karl S.

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities. 3

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

Interconnect Challenges at Exascale

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Using the Cray Gemini Performance Counters

Pedretti, Kevin P.; Vaughan, Courtenay T.; Hemmert, Karl S.; Barrett, Richard F.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The Structural Simulation Toolkit

Proposed for publication in SIGMETRICS Performance Evaluation Review.

Rodrigues, Arun; Hemmert, Karl S.; Barrett, Brian B.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

The Portals 4.0 network programming interface

Brightwell, Ronald B.; Pedretti, Kevin P.; Wheeler, Kyle B.; Hemmert, Karl S.; Barrett, Brian B.

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI