Publications Search

Infrastructure for In Situ System Monitoring and Application Data Analysis

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.

Abstract not provided.

TYPE Conference Poster YEAR 2015

OSTI DOI

Toward rapid understanding of production HPC applications and systems

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC application's resource needs and their complex interactions with each other and HPC platform resources is critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

New systems, new behaviors, new patterns: Monitoring insights from system standup

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Brandt, James M.; Gentile, Ann C.; Martin, Cindy; Repik, Jason; Taerat, Narate

Disentangling significant and important log messages from those that are routine and unimportant can be a difficult task. Further, on a new system, understanding correlations between significant and possibly new types of messages and conditions that cause them can require significant effort and time. The initial standup of a machine can provide opportunities for investigating the parameter space of events and operations and thus for gaining insight into the events of interest. In particular, failure inducement and investigation of corner case conditions can provide knowledge of system behavior for significant issues that will enable easier diagnosis and mitigation of such issues for when they may actually occur during the platform lifetime. In this work, we describe the testing process and monitoring results from a testbed system in preparation for the ACES Trinity system. We describe how events in the initial standup including changes in configuration and software and corner case testing has provided insights that can inform future monitoring and operating conditions, both of our test systems and the eventual large-scale Trinity system.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI DOI

Extending LDMS to enable performance monitoring in multi-core applications

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Feldman, Steven; Zhang, Deli; Dechev, Damian D.; Brandt, James M.

Identifying design patterns that limit the performance of multi-core algorithms is a challenging task. There are many known methods by which threads synchronize their actions and each method may exhibit different behavior in different use cases. These use cases may vary in regards to the workload being executed, number of parallel tasks, dependencies between these tasks, and the behavior of the system scheduler. Restructuring algorithms to overcome performance limitations requires intimate knowledge on how these algorithms utilize the hardware. In our experience, we have found a lack of adequate tools to gain such knowledge. To address this, we have enhanced and implemented additional data sampler modules for OVIS's Lightweight Distributed Metric Service (LDMS) to enable scalable distributed collection of hardware performance counter data. These modules provide an interface by which LDMS can utilize the PAPI library, Linux perf tools, and RAPL to collect hardware performance data of interest. Using these samplers, we plan to monitor the intra-node behavior, including contention for node level shared resources, of multi-core applications for a diverse set of use cases. We are currently exploring how the values reported are affected by the level of concurrency, the synchronization methodologies, and progress guarantees. We hope to use this information to identify ways to restructure algorithms to increase their performance.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

New Systems New Behaviors New Patterns: Monitoring Insights from System Standup [PowerPoint]

Brandt, James M.; Gentile, Ann C.; Martin, Cindy M.; Repik, Jason; Taerat, Narate T.

Abstract not provided.