Publications

Results 101–125 of 218
Skip to search filters

High Performance Computing - Power Application Programming Interface Specification

Laros, James H.; Kelly, Suzanne M.; Pedretti, Kevin P.; Grant, Ryan E.; Olivier, Stephen L.; Levenhagen, Michael J.; DeBonis, David D.; Laros, James H.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [131, 3, 5, 11), 4, a, B, Ili, 7, T71,, a 11 11, 1, 6, IA, ]112]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager. KC

More Details

Overtime: A tool for analyzing performance variation due to network interference

Proceedings of the 3rd ExaMPI Workshop at the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2015

Grant, Ryan E.; Pedretti, Kevin P.; Gentile, Ann C.

Shared networks create unique challenges in obtaining con-sistent performance across jobs for large systems when not using exclusive system-wide allocations. In order to provide good system utilization, resource managers allocate system space to multiple jobs. These multiple independent node al-locations can interfere with each other through their shared network. This work provides a method of observing and measuring the impact of network contention due to interfer-ence from other jobs through a continually running bench-mark application and the use of network performance coun-Ters. This is the first work to measure network interfer-ence using specially designed benchmarks and network per-formance counters.

More Details

Early experiences with node-level power capping on the cray XC40 platform

Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Pedretti, Kevin P.; Olivier, Stephen L.; Ferreira, Kurt B.; Shipman, Galen; Shu, Wei

Power consumption of extreme-scale supercomputers has become a key performance bottleneck. Yet current practices do not leverage power management opportunities, instead running at maximum power. This is not sustainable. Future systems will need to manage power as a critical resource, directing it to where it has greatest benefit. Power capping is one mechanism for managing power budgets, however its behavior is not well understood. This paper presents an empirical evaluation of several key HPC workloads running under a power cap on a Cray XC40 system, and provides a comparison of this technique with p-state control, demonstrating the performance differences of each. These results show: 1.) Maximum performance requires ensuring the cap is not reached; 2.) Performance slowdown under a cap can be attributed to cascading delays which result in unsynchronized performance variability across nodes; and, 3.) Due to lag in reaction time, considerable time is spent operating above the set cap. This work provides a timely and much needed comparison of HPC application performance under a power cap and attempts to enable users and system administrators to understand how to best optimize application performance on power-constrained HPC systems.

More Details

Early Experiences with Node-Level Power Capping on the Cray XC40 Platform

Pedretti, Kevin P.; Olivier, Stephen L.; Ferreira, Kurt B.; Shipman, Galen S.; Shu, Wei S.

Power consumption of extreme-scale supercomputers has become a key performance bottleneck. Yet current practices do not leverage power management opportunities, instead running at ''maximum power''. This is not sustainable. Future systems will need to manage power as a critical resource, directing where it has greatest benefit. Power capping is one mechanism for managing power budgets, however its behavior is not well understood. This paper presents an empirical evaluation of several key HPC workloads running under a power cap on a Cray XC40 system, and provides a comparison of this technique with p-state control, demonstrating the performance differences of each. These results show: 1. Maximum performance requires ensuring the cap is not reached; 2. Performance slowdown under a cap can be attributed to cascading delays which result in unsynchronized performance variability across nodes; and, 3. Due to lag in reaction time, considerable time is spent operating above the set cap. This work provides a timely and much needed comparison of HPC application performance under a power cap and attempts to enable users and system administrators to understand how to best optimize application performance on power-constrained HPC systems.

More Details

System-level support for composition of applications

Proceedings of the 5th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2015 - In conjunction with HPDC 2015

Kocoloski, Brian; Lange, John; Abbasi, Hasan; Bernholdt, David E.; Jones, Terry R.; Dayal, Jai; Evans, Noah; Lang, Michael; Lofstead, Jay; Pedretti, Kevin P.; Bridges, Patrick G.

Current HPC system software lacks support for emerging application deployment scenarios that combine one or more simulations with in situ analytics, sometimes called multi-component or multi-enclave applications. This paper presents an initial design study, implementation, and evaluation of mechanisms supporting composite multi-enclave applications in the Hobbes exascale operating system. These mechanisms include virtualization techniques isolating application custom enclaves while using the vendor-supplied host operating system and high-performance inter-VM communication mechanisms. Our initial single-node performance evaluation of these mechanisms on multi-enclave science applications, both real and proxy, demonstrate the ability to support multi-enclave HPC job composition with minimal performance overhead.

More Details

Achieving performance isolation with lightweight co-kernels

HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Ouyang, Jiannan; Kocoloski, Brian; Lange, John; Pedretti, Kevin P.

Performance isolation is emerging as a requirement for High Performance Computing (HPC) applications, particularly as HPC architectures turn to in situ data processing and application composition techniques to increase system throughput. These approaches require the co-location of disparate workloads on the same compute node, each with different resource and runtime requirements. In this paper we claim that these workloads cannot be effectively managed by a single Operating System/Runtime (OS/R). Therefore, we present Pisces, a system software architecture that enables the co-existence of multiple independent and fully isolated OS/Rs, or enclaves, that can be customized to address the disparate requirements of next generation HPC workloads. Each enclave consists of a specialized lightweight OS cokernel and runtime, which is capable of independently managing partitions of dynamically assigned hardware resources. Contrary to other co-kernel approaches, in this work we consider performance isolation to be a primary requirement and present a novel co-kernel architecture to achieve this goal. We further present a set of design requirements necessary to ensure performance isolation, including: (1) elimination of cross OS dependencies, (2) internalized management of I/O, (3) limiting cross enclave communication to explicit shared memory channels, and (4) using virtualization techniques to provide missing OS features. The implementation of the Pisces co-kernel architecture is based on the Kitten Lightweight Kernel and Palacios Virtual Machine Monitor, two system software architectures designed specifically for HPC systems. Finally we will show that lightweight isolated co-kernels can provide better performance for HPC applications, and that isolated virtual machines are even capable of outperforming native environments in the presence of competing workloads.

More Details

Toward an evolutionary task parallel integrated MPI + X Programming Model

Proceedings of the 6th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015

Barrett, Richard F.; Stark, Dylan S.; Vaughan, Courtenay T.; Grant, Ryan E.; Olivier, Stephen L.; Pedretti, Kevin P.

The Bulk Synchronous Parallel programming model is showing performance limitations at high processor counts. We propose over-decomposition of the domain, operated on as tasks, to smooth out utilization of the computing resource, in particular the node interconnect and processing cores, and hide intra- and inter-node data movement. Our approach maintains the existing coding style commonly employed in computational science and engineering applications. Although we show improved performance on existing computers, up to 131,072 processor cores, the effectiveness of this approach on expected future architectures will require the continued evolution of capabilities throughout the codesign stack. Success then will not only result in decreased time to solution, but would also make better use of the hardware capabilities and reduce power and energy requirements, while fundamentally maintaining the current code configuration strategy.

More Details

Demonstrating improved application performance using dynamic monitoring and task mapping

2014 IEEE International Conference on Cluster Computing, CLUSTER 2014

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

This work demonstrates the integration of monitoring, analysis, and feedback to perform application-to-resource mapping that adapts to both static architecture features and dynamic resource state. In particular, we present a framework for mapping MPI tasks to compute resources based on run-time analysis of system-wide network data, architecture-specific routing algorithms, and application communication patterns. We address several challenges. Within each node, we collect local utilization data. We consolidate that information to form a global view of system performance, accounting for system-wide factors including competing applications. We provide an interface for applications to query the global information. Then we exploit the system information to change the mapping of tasks to nodes so that system bottlenecks are avoided. We demonstrate the benefit of this monitoring and feedback by remapping MPI tasks based on route-length, bandwidth, and credit-stalls metrics for a parallel sparse matrix-vector multiplication kernel. In the best case, remapping based on dynamic network information in a congested environment recovered 48.9% of the time lost to congestion, reducing matrix-vector multiplication time by 7.8%. Our experiments focus on the Cray XE/XK platform, but the integration concepts are generally applicable to any platform for which applicable metrics and route knowledge can be obtained.

More Details

Using architecture information and real-time resource state to reduce power consumption and communication costs in parallel applications

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Pedretti, Kevin P.; Rajamanickam, Sivasankaran R.; Bunde, David P.; Deveci, Mehmet D.; Catalyurek, Umit V.

As computer systems grow in both size and complexity, the need for applications and run-time systems to adjust to their dynamic environment also grows. The goal of the RAAMP LDRD was to combine static architecture information and real-time system state with algorithms to conserve power, reduce communication costs, and avoid network contention. We devel- oped new data collection and aggregation tools to extract static hardware information (e.g., node/core hierarchy, network routing) as well as real-time performance data (e.g., CPU uti- lization, power consumption, memory bandwidth saturation, percentage of used bandwidth, number of network stalls). We created application interfaces that allowed this data to be used easily by algorithms. Finally, we demonstrated the benefit of integrating system and application information for two use cases. The first used real-time power consumption and memory bandwidth saturation data to throttle concurrency to save power without increasing application execution time. The second used static or real-time network traffic information to reduce or avoid network congestion by remapping MPI tasks to allocated processors. Results from our work are summarized in this report; more details are available in our publications [2, 6, 14, 16, 22, 29, 38, 44, 51, 54].

More Details
Results 101–125 of 218
Results 101–125 of 218