Research

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

TYPE SAND Report YEAR 2021

OSTI DOI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine C.; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Tom T.; Tucker, Nick T.; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrating Systems Operations into CoDesign

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

Integrating Systems Management into CoDesign

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrating Systems Management into CoDesign

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrating Systems Management into CoDesign

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Enabling Application and System Data Fusion

Gentile, Ann C.; Brandt, James M.; Cook, Jeanine C.; Hammond, Simon D.; Poliakoff, David Z.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Tucker, Tom

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

Including Operations Analytics & Communication In Next Generation CoDesign:

Brandt, James M.; Enos, Jeremy E.; Gentile, Ann C.; Kramer, William K.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

AI/ML for HPC Operations

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2020

OSTI DOI

Supporting Dynamic Event Monitoring in the Lightweight Distributed Metric Service (LDMS)

Tucker, Tom T.; Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

LDMS v4: Writing Sampler and Store Plugins (updated for LDMSCON2020)

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

ALAMO: Autonomous Lightweight Allocation Management and Optimization

Brightwell, Ronald B.; Ferreira, Kurt B.; Grant, Ryan E.; Levy, Scott L.; Lofstead, Gerald F.; Olivier, Stephen L.; Pedretti, Kevin P.; Younge, Andrew J.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Attributing Performance Variation from Integrated Application and System Data

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Olivier, Stephen L.; Pedretti, Kevin P.; Tucker, Tom T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Design Installation and Operation of the Vortex ART Platform

Gauntt, Nathan E.; Davis, Kevin D.; Repik, Jason; Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Other Report YEAR 2019

OSTI DOI

A study of network congestion in two supercomputing high-speed interconnects

Proceedings - 2019 IEEE Symposium on High-Performance Interconnects, HOTI 2019

Jha, Saurabh; Patke, Archit; Brandt, James M.; Gentile, Ann C.; Showerman, Mike; Roman, Eric; Kalbarczyk, Zbigniew T.; Kramer, Bill; Iyer, Ravishankar K.

Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.

More Details

TYPE Conference Poster YEAR 2019

Scopus OSTI

Exploring New Monitoring and Analysis Capabilities on Cray?s Software Preview System

Brandt, James M.; Brown, Connor J.; Gentile, Ann C.; Greenseid, Joe G.; Kramer, William K.; Langer, PAtti L.; Rashid, Aamir R.; Rhem, Kevin R.; Showerman, Michael S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Exploring New Monitoring and Analysis Capabilities on Cray's Software Preview System (Final Version)

Brandt, James M.; Brown, Connor J.; Gentile, Ann C.; Greenseid, Joe G.; Kramer, William K.; Langer, PAtti L.; Rashid, Aamir R.; Rhem, Kevin R.; Showerman, Michael S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Holistic Measurement Driven System Assessment

Kramer, Bill K.; Bauer, Greg B.; Bode, Brett B.; Showerman, Mike S.; Enos, Jeremy E.; Saxton, Aaron S.; Jha, Saurabh J.; Kalbarczyk, Zbigniew K.; Iyer, Ravi I.; Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Large-Scale System Monitoring Experiences and Recommendations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Gienger, Michael; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin P.; Piccinali, Jean G.; Repik, Jason; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Characterizing Supercomputer Traffic Networks Through Link-Level Analysis

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigniew; Iyer, Ravishankar

We present techniques for characterizing bandwidth and congestion characteristics of supercomputer High-Speed Networks (HSN). By utilizing a link-level perspective, we gain generality over analyses which are tied to specific topologies. We illustrate these techniques using five months of a Blue Waters production dataset consisting of network utilization and congestion counters. We find that: i) execution time of the communicationheavy applications is highly correlated to network stalls observed in the network topology and increase in application runtime can be as high as 1.7x with nominal increase in stalls, ii) heterogeneity in the available link bandwidth in the network can lead to backpressure and congestion even when the network is not underprovisioned, and (iii) links connected to I/O nodes are no more likely to observe congestion during operational hours than any other link in the system.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Application and System Performance Metrics

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.A.; Andersson, S.A.; Brandt, James M.; Cardo, N.C.; Chunduri, S.C.; Enos, J.E.; Fields, P.F.; Gentile, Ann C.; Gerber, R.B.; Gienger, M.G.; Greenseid, J.G.; Greiner, A.G.; Hadri, B.H.; He, Y.H.; Hoppe, D.H.; Kaila, U.K.; Kelly, K.K.; Klein, M.K.; Kristiansen, A.K.; Leak, S.L.; Mason, M.M.; Pedretti, Kevin P.; Piccinali, J-G.P.; Repik, Jason; Rogers, J.R.; Salminen, S.S.; showerman, m.s.; Whitney, C.W.; Williams, J.W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI DOI

Integrating low-latency analysis into HPC system monitoring

ACM International Conference Proceeding Series

Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, James M.; Gentile, Ann C.; Dechev, Damian

The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. While raw performance data is increasingly exposed at the component level, the usefulness of the data is dependent on the ability to do meaningful analysis on actionable timescales. However, current system monitoring infrastructures largely focus on data collection, with analysis performed off-system in post-processing mode. This increases the time required to provide analysis and feedback to a variety of consumers. In this work, we enhance the architecture of a monitoring system used on large-scale computational platforms, to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. We leverage the flexible communication topology of the monitoring system to enable placement of transformations based on overhead concerns, while still enabling low-latency exposure on node. Our design internally supports and exposes the raw and transformed data uniformly for both node level and off-system consumers. We show the viability of our implementation for a case with production-relevance: run-time determination of the relative per-node files system demands.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

OVIS Update 08/24/18

Brandt, James M.; Tucker, Thomas T.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Application Performance Insights via System Monitoring

Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.; Cook, Jeanine C.; Allan, Benjamin A.; Tucker, Thomas T.; Naksinehaboon, Nichamon N.; Taerat, Narate T.; Cook, Jonathan C.; Aaziz, Omar R.; Ates, Emre A.; Tuncer, Ozan T.; Egele, Manuel E.; Turk, Ata T.; Coskun, Ayse K.; izadpanah, ramin i.; Dechev, Damian D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville A.; Andersson, Stefan A.; Brandt, James M.; Cardo, Nicholas C.; Chunduri, Sudheer C.; Enos, Jeremy E.; Fields, Parks F.; Gentile, Ann C.; Gerber, Richard G.; Greenseid, Joe G.; Greiner, Annette G.; Hadri, Bilel H.; He, Yun H.; Hoppe, Dennis H.; Kaila, Urpo K.; Kelly, Kaki K.; Klein, Mark K.; Kristiansen, Alex K.; Leak, Steve L.; Mason, Mike M.; Pedretti, Kevin P.; Piccinali, Jean-Guillaume P.; Repik, Jason; Rogers, Jim R.; Salminen, Susanna S.; Showerman, Mike S.; Whitney, Cary W.; Williams, Jim W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Application Performance Insights via System Monitoring

Brandt, James M.; Enos, Jeremy E.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Supporting Failure Analysis with Discoverable Annotated Log Datasets

Leak, Stephen L.; Greiner, Annette G.; Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville A.; Andersson, Stefan A.; Brandt, James M.; Cardo, Nicholas C.; Chunduri, Sudheer C.; Enos, Jeremy E.; Fields, Parks F.; Gentile, Ann C.; Gerber, Richard G.; Greenseid, Joe G.; Greiner, Annette G.; Hadri, Bilel H.; He, Yun H.; Hoppe, Dennis H.; Kaila, Urpo K.; Kelly, Kaki K.; Klein, Mark K.; Kristiansen, Alex K.; Leak, Steve L.; Mason, Mike M.; Pedretti, Kevin P.; Piccinali, Jean-Guillaume P.; Repik, Jason; Rogers, Jim R.; Salminen, Susanna S.; Showerman, Mike S.; Whitney, Cary W.; Williams, Jim W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Runtime HPC System and Application Performance Assessment and Diagnostics

Brandt, James M.; Gentile, Ann C.; Cook, Jonathan E.; Allan, Benjamin A.; Cook, Jeanine C.; Aaziz, Omar R.; Tucker, Thomas T.; Nichamon, Naksinehaboon N.; Taerat, Narate T.; Ates, Emre A.; Tuncer, Ozan T.; Egele, Manuel E.; Turk, Ata T.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Enhanced Profiling for Kokkos Applications

Hammond, Simon D.; Trott, Christian R.; Ibanez-Granados, Daniel A.; Edwards, Harold C.; Sunderland, Daniel S.; Ellingwood, Nathan D.; Brandt, James M.; Gentile, Ann C.; Cook, Jeanine C.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Continuous Performance Tracking for Kokkos Applications Using LDMS

Brandt, James M.; Hammond, Simon D.; Tucker, Thomas T.; Gentile, Ann C.; Cook, Jeanine C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Dynamic Assessment and Feedback

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Live feed Sandia CAPVIZ HPC cluster performance analysis & visualization demonstration

Allan, Benjamin A.; Schmitz, Mark E.; Walsh, Edward J.; Aguilar, Michael J.; Brandt, James M.; Gentile, Ann C.; Ogden, Jeffry B.; Monk, Stephen T.; Noe, John P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Holistic measurement-driven system assessment

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigniew; Bauer, Greg; Enos, Jeremy; Showerman, Michael; Kaplan, Larry; Bode, Brett; Greiner, Annette; Bonnie, Amanda; Mason, Mike; Iyer, Ravishankar K.; Kramer, William

In high-performance computing systems, application performance and throughput are dependent on a complex interplay of hardware and software subsystems and variable workloads with competing resource demands. Data-driven insights into the potentially widespread scope and propagationof impact of events, such as faults and contention for shared resources, can be used to drive more effective use of resources, for improved root cause diagnosis, and for predicting performance impacts. We present work developing integrated capabilities for holistic monitoring and analysis to understand and characterize propagation of performance-degrading events. These characterizations can be used to determine and invoke mitigating responses by system administrators, applications, and system software.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Final Review of FY17 ASC CSSE L2 Milestone #6018 entitled "Analyzing Power Usage Characteristics of Workloads Running on Trinity"

Hoekstra, Robert J.; Hammond, Simon D.; Hemmert, Karl S.; Gentile, Ann C.; Oldfield, Ron A.; Lang, Mike L.; Martin, Steve M.

The presentation documented the technical approach of the team and summary of the results with sufficient detail to demonstrate both the value and the completion of the milestone. A separate SAND report was also generated with more detail to supplement the presentation.

More Details

TYPE Other Report YEAR 2017

OSTI DOI

Task Placement to Reduce Application Communication Costs

Devine, Karen D.; Brandt, James M.; Deveci, Mehmet D.; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Pedretti, Kevin P.; Rajamanickam, Sivasankaran R.; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Holistic Measurement Driven System Assessment

Jha, Saurabh J.; Brandt, James M.; Gentile, Ann C.; Karlbarczyk, Zbigniew K.; Bauer, Greg B.; Enos, Jeremy E.; Showerman, Michael S.; Kaplan, Larry K.; Bode, Brett B.; Greiner, Annette G.; Bonnie, Amanda B.; Mason, Mike M.; Iyer, Ravishankar I.; Kramer, William K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI DOI

Discovering Metrics of Network Contention

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II

DeConinck, Adam D.; Nam, Hai A.; Mortin, Dave M.; Bonnie, Amanda B.; Lueninghoener, Cory L.; Brandt, James M.; Gentile, Ann C.; Pedretti, Kevin P.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon D.; Allan, Benjamin A.; Davis, Michael C.; Repik, Jason

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Formicola, Valerio F.; Jha, Saurabh J.; Chen, Daniel C.; Dong, Wen D.; Bonnie, Amanda B.; Mason, Mike M.; Brandt, James M.; Gentile, Ann C.; Kaplan, Larry K.; Repik, Jason; Enos, Jeremy E.; Showerman, Mike S.; Greiner, Annette G.; Kalbarczyk, Zbigniew K.; Iyer, Ravishankar I.; Kramer, Bill K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II (Paper)

DeConinck, Adam D.; Nam, Hai A.; Morton, David P.; Bonnie, Amanda B.; Lueninghoener, Cory L.; Brandt, James M.; Gentile, Ann C.; Pedretti, Kevin P.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon D.; Allan, Benjamin A.; Davis, Mike D.; Repik, Jason

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Contention and Congestion: Challenges and Approaches to Understanding Application Impact

Gentile, Ann C.; Brandt, James M.; Agelastos, Anthony M.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Understanding and Avoiding Performance Variability in High Performance Networks

Grant, Ryan E.; Groves, Taylor G.; Pedretti, Kevin P.; Gentile, Ann C.; Arnold, Dorian A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Defining Metrics to Distill Large-Scale HPC Platform and Application Performance Data into Actionable Quantities ? Resource Contention of File System and Aries Interconnect

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Discovery interpretation and communication of meaningful information in HPC monitoring data

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

TYPE Journal Article YEAR 2016

Scopus OSTI DOI

Publications