Publications Search

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details

TYPE Conference Poster YEAR 2014

Scopus OSTI DOI

Lightweight Distributed Metric Service (LDMS): Run-time Resource Utilization Monitoring

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

LANL_Monitoring_summit_7-23-2013

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Copy of High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6 (Overheads)

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

High Fidelity Data Collection and Transport Service Applied to the Cray XE6/XK6 (Paper)

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

OVIS Suite of Tools

Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

LDMS: Lightweight Distributed Metric Service for HPC Monitoring

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

SOS_High_Level_Documentation

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Lightweight Distributed Metric Service

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Copy of Lightweight Distributed Metric Service

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Modeling Failures in Large-Scale Computer Systems

Mayo, Jackson M.; Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Develop feedback system for intelligent dynamic resource allocation to improve application performance

Brandt, James M.; Gentile, Ann C.

This report provides documentation for the completion of the Sandia Level II milestone 'Develop feedback system for intelligent dynamic resource allocation to improve application performance'. This milestone demonstrates the use of a scalable data collection analysis and feedback system that enables insight into how an application is utilizing the hardware resources of a high performance computing (HPC) platform in a lightweight fashion. Further we demonstrate utilizing the same mechanisms used for transporting data for remote analysis and visualization to provide low latency run-time feedback to applications. The ultimate goal of this body of work is performance optimization in the face of the ever increasing size and complexity of HPC systems.

More Details

TYPE SAND Report YEAR 2011

OSTI DOI

Framework for Enabling System Understanding

Brandt, James M.; Chen, Frank X.; Gentile, Ann C.; Mayo, Jackson M.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Baler: Deterministic, lossless log message clustering tool

Computer Science - Research and Development

Taerat, Narate; Brandt, Jim; Gentile, Ann C.; Wong, Matthew H.; Leangsuksun, Chokchai

The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, system log files are usually too large and complex to analyze manually. There are some existing log clustering tools that seek to help analysts in exploring these logs, however they fail to satisfy our needs with respect to scalability, usability and quality of results. Thus, we have developed a log clustering tool to better address these needs. In this paper we present our novel approach and initial experimental results. © Springer-Verlag 2011.

More Details

TYPE Conference YEAR 2011

Scopus OSTI