Publications Search

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

TYPE SAND Report YEAR 2021

OSTI DOI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine C.; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Tom T.; Tucker, Nick T.; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrating Systems Operations into CoDesign

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

Integrating Systems Management into CoDesign

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrating Systems Management into CoDesign

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrating Systems Management into CoDesign

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Enabling Application and System Data Fusion

Gentile, Ann C.; Brandt, James M.; Cook, Jeanine C.; Hammond, Simon D.; Poliakoff, David Z.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Tucker, Tom

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

Including Operations Analytics & Communication In Next Generation CoDesign:

Brandt, James M.; Enos, Jeremy E.; Gentile, Ann C.; Kramer, William K.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

AI/ML for HPC Operations

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2020

OSTI DOI

Supporting Dynamic Event Monitoring in the Lightweight Distributed Metric Service (LDMS)

Tucker, Tom T.; Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

LDMS v4: Writing Sampler and Store Plugins (updated for LDMSCON2020)

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

ALAMO: Autonomous Lightweight Allocation Management and Optimization

Brightwell, Ronald B.; Ferreira, Kurt B.; Grant, Ryan E.; Levy, Scott L.; Lofstead, Gerald F.; Olivier, Stephen L.; Pedretti, Kevin P.; Younge, Andrew J.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Attributing Performance Variation from Integrated Application and System Data

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Olivier, Stephen L.; Pedretti, Kevin P.; Tucker, Tom T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Design Installation and Operation of the Vortex ART Platform

Gauntt, Nathan E.; Davis, Kevin D.; Repik, Jason; Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Other Report YEAR 2019

OSTI DOI

A study of network congestion in two supercomputing high-speed interconnects

Proceedings - 2019 IEEE Symposium on High-Performance Interconnects, HOTI 2019

Jha, Saurabh; Patke, Archit; Brandt, James M.; Gentile, Ann C.; Showerman, Mike; Roman, Eric; Kalbarczyk, Zbigniew T.; Kramer, Bill; Iyer, Ravishankar K.

Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.

More Details

TYPE Conference Poster YEAR 2019

Scopus OSTI

Exploring New Monitoring and Analysis Capabilities on Cray?s Software Preview System

Brandt, James M.; Brown, Connor J.; Gentile, Ann C.; Greenseid, Joe G.; Kramer, William K.; Langer, PAtti L.; Rashid, Aamir R.; Rhem, Kevin R.; Showerman, Michael S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Exploring New Monitoring and Analysis Capabilities on Cray's Software Preview System (Final Version)

Brandt, James M.; Brown, Connor J.; Gentile, Ann C.; Greenseid, Joe G.; Kramer, William K.; Langer, PAtti L.; Rashid, Aamir R.; Rhem, Kevin R.; Showerman, Michael S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Holistic Measurement Driven System Assessment

Kramer, Bill K.; Bauer, Greg B.; Bode, Brett B.; Showerman, Mike S.; Enos, Jeremy E.; Saxton, Aaron S.; Jha, Saurabh J.; Kalbarczyk, Zbigniew K.; Iyer, Ravi I.; Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Large-Scale System Monitoring Experiences and Recommendations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Gienger, Michael; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin P.; Piccinali, Jean G.; Repik, Jason; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Characterizing Supercomputer Traffic Networks Through Link-Level Analysis

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigniew; Iyer, Ravishankar

We present techniques for characterizing bandwidth and congestion characteristics of supercomputer High-Speed Networks (HSN). By utilizing a link-level perspective, we gain generality over analyses which are tied to specific topologies. We illustrate these techniques using five months of a Blue Waters production dataset consisting of network utilization and congestion counters. We find that: i) execution time of the communicationheavy applications is highly correlated to network stalls observed in the network topology and increase in application runtime can be as high as 1.7x with nominal increase in stalls, ii) heterogeneity in the available link bandwidth in the network can lead to backpressure and congestion even when the network is not underprovisioned, and (iii) links connected to I/O nodes are no more likely to observe congestion during operational hours than any other link in the system.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Application and System Performance Metrics

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.A.; Andersson, S.A.; Brandt, James M.; Cardo, N.C.; Chunduri, S.C.; Enos, J.E.; Fields, P.F.; Gentile, Ann C.; Gerber, R.B.; Gienger, M.G.; Greenseid, J.G.; Greiner, A.G.; Hadri, B.H.; He, Y.H.; Hoppe, D.H.; Kaila, U.K.; Kelly, K.K.; Klein, M.K.; Kristiansen, A.K.; Leak, S.L.; Mason, M.M.; Pedretti, Kevin P.; Piccinali, J-G.P.; Repik, Jason; Rogers, J.R.; Salminen, S.S.; showerman, m.s.; Whitney, C.W.; Williams, J.W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI DOI