Publications

Results 26–50 of 64
Skip to search filters

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Cassella, Paul; Enos, Jeremy; Fullop, Joshi; Gentile, Ann C.; Monk, Stephen T.; Naksinehaboon, Nichamon; Ogden, Jeffry B.; Rajan, Mahesh R.; Showerman, Michael; Stevenson, Joel O.; Taerat, Narate; Tucker, Tom

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details

Investigating the impact of the Cielo Cray XE6 architecture on scientific application codes

IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

Vaughan, Courtenay T.; Rajan, Mahesh R.; Barrett, Richard F.; Doerfler, Douglas W.; Pedretti, Kevin P.

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, and supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo. © 2011 IEEE.

More Details

Investigating the impact of the cielo cray XE6 architecture on scientific application codes

Vaughan, Courtenay T.; Rajan, Mahesh R.; Barrett, Richard F.; Doerfler, Douglas W.; Pedretti, Kevin P.

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, and supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.

More Details
Results 26–50 of 64
Results 26–50 of 64