Publications

Results 1–50 of 73
Skip to search filters

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James E.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Nick T.; Tucker, Tom T.; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine C.; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Tom T.; Tucker, Nick T.; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

HPC System Data Pipeline to Enable Meaningful Insights through Analysis-Driven Visualizations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Schwaller, Benjamin S.; Tucker, Nick; Tucker, Tom; Allan, Benjamin A.; Brandt, James M.

The increasing complexity of High Performance Computing (HPC) systems has created a growing need for facilitating insight into system performance and utilization for administrators and users. The strides made in HPC system monitoring data collection have produced terabyte/day sized time-series data sets rich with critical information, but it is onerous to extract and construe meaningful information from these metrics. We have designed and developed an architecture that enables flexible, as-needed, run-time analysis and presentation capabilities for HPC monitoring data. Our architecture enables quick and efficient data filtration and analysis. Complex runtime or historical analyses can be expressed as Python-based computations. Results of analyses and a variety of HPC oriented summaries are displayed in a Grafana front-end interface. To demonstrate our architecture, we have deployed it in production for a 1500-node HPC system and have developed analyses and visualizations requested by system administrators, and later employed by users, to track key metrics about the cluster at a job, user, and system level. Our architecture is generic, applicable to any*-nix based system, and it is extensible to supporting multi-cluster HPC centers. We structure it with easily replaced modules that allow unique customization across clusters and centers. In this paper, we describe the data collection and storage infrastructure, the application created to query and analyze data from a custom database, and the visual displays created to provide clear insights into HPC system behavior.

More Details

LDMS Monitoring of EDR InfiniBand Networks

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Allan, Benjamin A.; Aguilar, Michael J.; Schwaller, Benjamin S.; Langer, Steven

We introduce a new HPC system high-speed network fabric production monitoring tool, the ibnet sampler plugin for LDMS version 4. Large-scale testing of this tool is our work in progress. When deployed appropriately, the ibnet sampler plugin can provide extensive counter data, at frequencies up to 1 Hz. This allows the LDMS monitoring system to be useful for tracking the impact of new network features on production systems. We present preliminary results concerning reliability, performance impact, and usability of the sampler.

More Details

Figures of merit for production HPC

Allan, Benjamin A.

This report summarizes a set of figures of merit of interest in monitoring the hardware and hardware usage in a Sandia high performance computing (HPC) center. These figures are computable from high frequency monitoring data and other non-metric data and may aid administrators and customer support personnel in their decision processes. The figures are derived from interviews of the HPC center staff. The figures are in many cases simplistic data reductions, but they are our initial targets in creating dashboards that turn voluminous monitoring data into actionable information. Because simplistic reductions may obscure as well as reveal the situation under study, we also document the necessary 'drill-down' and %60exploration' views needed to make the data better understood quickly. These figures of merit may be compared to dashboarding tools documented by other HPC centers. ACKNOWLEDGEMENTS We thank the staff of Sandia's production HPC department for their survey input.

More Details

Standardized Environment for Monitoring Heterogeneous Architectures

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Brown, Connor J.; Schwaller, Benjamin S.; Gauntt, Nathan E.; Allan, Benjamin A.; Davis, Kevin D.

Increasingly diverse architectures and operating systems continue to emerge in the HPC industry. As such, HPC centers are becoming more heterogeneous which introduces a variety of challenges for system administrators. Monitoring a wide array of different platforms by itself is difficult, but the problem compounds in an environment where new platforms are frequently added. Creating a standard monitoring environment across these platforms that allows for simple administration with minimal setup becomes necessary in such situations.This paper presents the solutions introduced in the HPC Development department at Sandia National Laboratories to meet these challenges. This includes our adoption of a multi-stage data-collection pipeline across our clusters that is implemented from the ground up with our Golden Image. We also discuss our infrastructure to support a heterogeneous environment and activities in progress to improve our center. These advances simplify system standup and make monitoring integration easier and faster for new systems which is necessary for our center's domain.

More Details

Production application performance data streaming for system monitoring

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Izadpanah, Ramin; Allan, Benjamin A.; Dechev, Damian; Brandt, James M.

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Also, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.

More Details

Measuring minimum switch port metric retrieval time and impact for multi-layer infiniband fabrics

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Aguilar, Michael J.; Allan, Benjamin A.; Polevitzky, Sergei I.

In this work, we seek to gain an understanding of the InfiniBand network processing limitations that might exist in gathering performance metric information from InfiniBand switches using our new LDMS ibfabric sampler. The limitations studied consist of delays in gathering InfiniBand metric information from a specific switch device due to the switch's processor response delays or RDMA contention for network bandwidth.

More Details

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

Toward rapid understanding of production HPC applications and systems

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC application's resource needs and their complex interactions with each other and HPC platform resources is critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

Emerging techniques for field device security

IEEE Security and Privacy

Schwartz, Moses D.; Mulder, John M.; Chavez, Adrian R.; Allan, Benjamin A.

Industrial control systems (ICSs) rely on embedded devices to control essential processes. State-of-the-art security solutions can't detect attacks on these devices at the hardware or firmware level. To improve ICS cybersecurity, defensive measures should focus on inspectability, trustworthiness, and diversity.

More Details

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Cassella, Paul; Enos, Jeremy; Fullop, Joshi; Gentile, Ann C.; Monk, Stephen T.; Naksinehaboon, Nichamon; Ogden, Jeffry B.; Rajan, Mahesh R.; Showerman, Michael; Stevenson, Joel O.; Taerat, Narate; Tucker, Tom

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details
Results 1–50 of 73
Results 1–50 of 73