Publications

Results 26–50 of 104
Skip to search filters

Toward an evolutionary task parallel integrated MPI + X Programming Model

Proceedings of the 6th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015

Barrett, Richard F.; Stark, Dylan S.; Vaughan, Courtenay T.; Grant, Ryan E.; Olivier, Stephen L.; Pedretti, Kevin P.

The Bulk Synchronous Parallel programming model is showing performance limitations at high processor counts. We propose over-decomposition of the domain, operated on as tasks, to smooth out utilization of the computing resource, in particular the node interconnect and processing cores, and hide intra- and inter-node data movement. Our approach maintains the existing coding style commonly employed in computational science and engineering applications. Although we show improved performance on existing computers, up to 131,072 processor cores, the effectiveness of this approach on expected future architectures will require the continued evolution of capabilities throughout the codesign stack. Success then will not only result in decreased time to solution, but would also make better use of the hardware capabilities and reduce power and energy requirements, while fundamentally maintaining the current code configuration strategy.

More Details

PerDome: A performance model for heterogeneous computing systems

Simulation Series

Tang, Li; Hu, X.S.; Barrett, Richard F.

Heterogeneous systems, consisting of different types of processors, have the potential to offer higher performance at lower energy cost than homogeneous systems. However, it is rather challenging to actually achieve the high execution efficiency promised by such a system due to the larger design space and the lack of reliable performance/energy models for aiding design space exploration. This paper fills this gap by proposing a performance model for heterogeneous systems. In processor level, the roofline model [1] can produce the performance upper bound of executed code using its ratio of computation to memory traffic. Our model, referred to as PerDome, builds on the roofline model and can reliably predict the system performance for both homogeneous execution (where each processor either executes the entire application code or none) and heterogeneous execution (where each processor executes part of the application code). Two case studies are carried out to demonstrate the effectiveness of PerDome. The results show that PerDome can indeed provide a good estimate for performance comparisons which can then be used for heterogeneous system design space exploration.

More Details

Performance and Energy Implications for Heterogeneous Computing Systems: A MiniFE Case Study

Barrett, Richard F.; Tang, Li T.; Hu, Sharon X.

Heterogeneous computing systems, which employ a mix of general-purpose (GP) processors and accelerators such as graphics processing units (GPUs) or Field Programmable Gate Arrays (FPGAs), have the potential to offer much higher performance and lower energy usage than homogeneous systems. However, designing heterogeneous computing systems to achieve high performance and low energy usage is a challenging task. Designs that offer higher performance do not necessarily lead to lower energy consumption. Furthermore, mapping of applications to different computing devices can play a key role in performance and energy tradeoff. In this report, we present a detailed performance and energy study of executing a specific mini-application on different heterogeneous systems. The results show that hardware choices, application implementations, and mapping of applications to hardware can all significantly impact system performance and energy consumption and that the impact on performance and energy can be quite different. This study forms a basis for modeling the interdependencies of program structures and hardware execution units, which could be used to guide design space exploration.

More Details

FY14 Codesign Milestone Summary

Hoekstra, Robert J.; Barrett, Richard F.; howell, louis h.; daniel, david d.

This milestone was the 2nd in a series of Tri-Lab Co-Design L2 milestones supporting ‘Co-Design’ efforts in the ASC program. It is a crucial step towards evaluating the effectiveness of proxy applications in exploring code performance on next generation architectures. All three labs evaluated the performance of 2 proxy applications on modern architectures and/or testbeds for pre-production hardware. The results are captured in this document as well as annotated presentations from all 3 laboratories.

More Details

MPPM, Viewed as a co-design effort

Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Woodward, Paul R.; Jayaraj, Jagan J.; Barrett, Richard F.

The Piecewise Parabolic Method (PPM) was designed as a means of exploring compressible gas dynam-ics problems of interest in astrophysics, including super-sonic jets, compressible turbulence, stellar convection, and turbulent mixing and burning of gases in stellar interiors. Over time, the capabilities encapsulated in PPM have co-evolved with the availability of a series of high performance computing platforms. Implementation of the algorithm has adapted to and advanced with the architectural capabilities and characteristics of these machines. This adaptability of our PPM codes has enabled targeted astrophysical applica-tions of PPM to exploit these scarce resources to explore complex physical phenomena. Here we describe the means by which this was accomplished, and set a path forward, with a new miniapp, mPPM, for continuing this process in a diverse and dynamic architecture design environment. Adaptations in mPPM for the latest high performance machines are discussed that address the important issue of limited bandwidth from locally attached main memory to the microprocessor chip.

More Details

Early Experiences Co-Scheduling Work and Communication Tasks for Hybrid MPI+X Applications

Proceedings of ExaMPI 2014: Exascale MPI 2014 - held in conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Stark, Dylan S.; Barrett, Richard F.; Grant, Ryan E.; Olivier, Stephen L.; Pedretti, Kevin P.; Vaughan, Courtenay T.

Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.

More Details

Reducing the bulk of the bulk synchronous parallel model

Parallel Processing Letters

Barrett, Richard F.; Vaughan, Courtenay T.; Hammond, Simon D.

For over two decades the dominant means for enabling portable performance of computational science and engineering applications on parallel processing architectures has been the bulk-synchronous parallel programming (BSP) model. Code developers, motivated by performance considerations to minimize the number of messages transmitted, have typically pursued a strategy of aggregating message data into fewer, larger messages. Emerging and future high-performance architectures, especially those seen as targeting Exascale capabilities, provide motivation and capabilities for revisiting this approach. In this paper we explore alternative configurations within the context of a large-scale complex multi-physics application and a proxy that represents its behavior, presenting results that demonstrate some important advantages as the number of processors increases in scale.

More Details
Results 26–50 of 104
Results 26–50 of 104