Publications

Results 51–100 of 157

Taxonomist: Application Detection Through Rich Monitoring Data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ates, Emre; Tuncer, Ozan; Turk, Ata; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Modern supercomputers are shared among thousands of users running a variety of applications. Knowing which applications are running in the system can bring substantial benefits: knowledge of applications that intensively use shared resources can aid scheduling; unwanted applications such as cryptocurrency mining or password cracking can be blocked; system architects can make design decisions based on system usage. However, identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users. This paper introduces a novel technique to identify applications running on supercomputers. Our technique, Taxonomist, is based on the empirical evidence that applications have different and characteristic resource utilization patterns. Taxonomist uses machine learning to classify known applications and also detect unknown applications. We test our technique with a variety of benchmarks and cryptocurrency miners, and also with applications that users of a production supercomputer ran during a 6 month period. We show that our technique achieves nearly perfect classification for this challenging data set.

More Details

TYPE Presentation YEAR 2018

Scopus OSTI DOI

Live feed Sandia CAPVIZ HPC cluster performance analysis & visualization demonstration

Allan, Benjamin A.; Schmitz, Mark E.; Walsh, Edward J.; Aguilar, Michael J.; Brandt, James M.; Gentile, Ann C.; Ogden, Jeffry B.; Monk, Stephen T.; Noe, John P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Holistic measurement-driven system assessment

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigniew; Bauer, Greg; Enos, Jeremy; Showerman, Michael; Kaplan, Larry; Bode, Brett; Greiner, Annette; Bonnie, Amanda; Mason, Mike; Iyer, Ravishankar K.; Kramer, William

In high-performance computing systems, application performance and throughput are dependent on a complex interplay of hardware and software subsystems and variable workloads with competing resource demands. Data-driven insights into the potentially widespread scope and propagationof impact of events, such as faults and contention for shared resources, can be used to drive more effective use of resources, for improved root cause diagnosis, and for predicting performance impacts. We present work developing integrated capabilities for holistic monitoring and analysis to understand and characterize propagation of performance-degrading events. These characterizations can be used to determine and invoke mitigating responses by system administrators, applications, and system software.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Task Placement to Reduce Application Communication Costs

Devine, Karen D.; Brandt, James M.; Deveci, Mehmet D.; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Pedretti, Kevin P.; Rajamanickam, Sivasankaran R.; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Holistic Measurement Driven System Assessment

Jha, Saurabh J.; Brandt, James M.; Gentile, Ann C.; Karlbarczyk, Zbigniew K.; Bauer, Greg B.; Enos, Jeremy E.; Showerman, Michael S.; Kaplan, Larry K.; Bode, Brett B.; Greiner, Annette G.; Bonnie, Amanda B.; Mason, Mike M.; Iyer, Ravishankar I.; Kramer, William K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI DOI

Discovering Metrics of Network Contention

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II

DeConinck, Adam D.; Nam, Hai A.; Mortin, Dave M.; Bonnie, Amanda B.; Lueninghoener, Cory L.; Brandt, James M.; Gentile, Ann C.; Pedretti, Kevin P.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon D.; Allan, Benjamin A.; Davis, Michael C.; Repik, Jason

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Formicola, Valerio F.; Jha, Saurabh J.; Chen, Daniel C.; Dong, Wen D.; Bonnie, Amanda B.; Mason, Mike M.; Brandt, James M.; Gentile, Ann C.; Kaplan, Larry K.; Repik, Jason; Enos, Jeremy E.; Showerman, Mike S.; Greiner, Annette G.; Kalbarczyk, Zbigniew K.; Iyer, Ravishankar I.; Kramer, Bill K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II (Paper)

DeConinck, Adam D.; Nam, Hai A.; Morton, David P.; Bonnie, Amanda B.; Lueninghoener, Cory L.; Brandt, James M.; Gentile, Ann C.; Pedretti, Kevin P.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon D.; Allan, Benjamin A.; Davis, Mike D.; Repik, Jason

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Diagnosing Performance Variations in HPC Architectures Using Machine Learning

Tuncer, Ozan T.; Ates, Emre A.; Zhang, Yijia Z.; Turk, Ata T.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Contention and Congestion: Challenges and Approaches to Understanding Application Impact

Gentile, Ann C.; Brandt, James M.; Agelastos, Anthony M.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Diagnosing performance variations in HPC applications using machine learning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. Application performance variation can be caused by resource contention as well as software- and firmware-related problems, and can lead to premature job termination, reduced performance, and wasted compute platform resources. To effectively alleviate this problem, system administrators must detect and identify the anomalies that are responsible for performance variation and take preventive actions. However, diagnosing anomalies is often a difficult task given the vast amount of noisy and high-dimensional data being collected via a variety of system monitoring infrastructures. In this paper, we present a novel framework that uses machine learning to automatically diagnose previously encountered performance anomalies in HPC systems. Our framework leverages resource usage and performance counter data collected during application runs. We first convert the collected time series data into statistical features that retain application characteristics to significantly reduce the computational overhead of our technique. We then use machine learning algorithms to learn anomaly characteristics from this historical data and to identify the types of anomalies observed while running applications. We evaluate our framework both on an HPC cluster and on a public cloud, and demonstrate that our approach outperforms current state-of-the-art techniques in detecting anomalies, reaching an F-score over 0.97.

More Details

TYPE Presentation YEAR 2017

Scopus OSTI DOI

Defining Metrics to Distill Large-Scale HPC Platform and Application Performance Data into Actionable Quantities ? Resource Contention of File System and Aries Interconnect

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Discovery interpretation and communication of meaningful information in HPC monitoring data

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

TYPE Journal Article YEAR 2016

Scopus OSTI DOI

High Performance Computing Metrics to Enable Application-Platform Communication

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Sandia has invested heavily in scientifc/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the com- putational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used efciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-afecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric defnitions, identifed in this research, that can be used as meaningful and poten- tially actionable indicators of performance-afecting contention between applications. These metrics were verifed using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's eforts to succeed in extreme-scale computing.

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

Large-scale persistent numerical data source monitoring system experiences

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Brandt, James M.; Gentile, Ann C.; Showerman, M.; Enos, J.; Fullop, J.; Bauer, G.

Issues of High Performance Computer (HPC) system diagnosis, automated system management, and resource-aware computing, are all dependent on high fidelity, system wide, persistent monitoring. Development and deployment of an effective persistent system wide monitoring service at large-scale presents a number of challenges, particularly when collecting data at the granularities needed to resolve features of interest and obtain early indication of significant events on the system. In this paper we provide experiences from our developments on and two-year deployment of our Lightweight Distributed Metric Service (LDMS) monitoring system on NCSA's 27,648 node Blue Waters system. We present monitoring related challenges and issues and their effects on the major functional components of general monitoring infrastructures and deployments: Data Sampling, Data Aggregation, Data Storage, Analysis Support, Operations, and Data Stewardship. Based on these experiences, we providerecommendations for effective development and deployment of HPC monitoring systems.

More Details

TYPE Conference Poster YEAR 2016

Scopus OSTI DOI

Design and implementation of a scalable HPC monitoring system

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Sanchez, S.; Bonnie, A.; Van Heule, G.; Robinson, C.; Deconinck, A.; Kelly, K.; Snead, Q.; Brandt, James M.

Over the past decade, platforms at Los AlamosNational Laboratory (LANL) have experienced large increases in complexity and scale to reach computational targets. The changes to the compute platforms have presented new challenges to the production monitoring systems in which they must not only cope with larger volumes of monitoring data, but also must provide new capabilities for the management, distribution, and analysis of this data. This schema must support both real-time analysis for alerting on urgent issues, as well as analysis of historical data for understanding performance issues and trends in systembehavior. This paper presents the design of our proposed next-generation monitoring system, as well as implementation details for an initial deployment. This design takes the form of a multi-stage data processing pipeline, including a scalable cluster for data aggregation and early analysis, a message broker for distribution of this data to varied consumers, and an initial selection of consumer services for alerting and analysis. We will also present estimates of the capabilities and scale required to monitor two upcoming compute platforms at LANL.

More Details

TYPE Conference Poster YEAR 2016

Scopus OSTI DOI

Dynamic Machine Specific Register (MSR) Data Collection as a System Service

Bauer, Gregory B.; Brandt, James M.; Gentile, Ann C.; Kot, Andriy K.; Showerman, Mike S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Network Performance Counter Monitoring and Analysis on the Cray XC Platform

Brandt, James M.; froese, edwin f.; Gentile, Ann C.; Kaplan, Larry K.; Allan, Benjamin A.; Walsh, Edward J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Dynamic Machine Specific Register (MSR) Data Collection as a System Service

Bauer, Gregory B.; Brandt, James M.; Gentile, Ann C.; Kot, Andriy K.; Showerman, Mike S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Network Performance Counter Monitoring and Analysis on the Cray XC Platform

Brandt, James M.; froese, edwin f.; Gentile, Ann C.; Kaplan, Larry K.; Allan, Benjamin A.; Walsh, Edward J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Smart HPC Centers: data analysis feedback and response

Brandt, James M.; Gentile, Ann C.; martin, c m.; Allan, Benjamin A.; Devine, Karen D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Monitoring High Speed Network Fabrics: Experiences and Needs

Brandt, James M.; Gentile, Ann C.; Allan, Benjamin A.; Lefantzi, Sophia L.; Aguilar, Michael J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Design and Implementation of a Scalable Monitoring System for Trinity

DeConinck, A.D.; Bonnie, A.B.; Kelly, K.K.; Sanchez, S.S.; Martin, C.M.; Mason, M.M.; Brandt, James M.; Gentile, Ann C.; Allan, Benjamin A.; Agelastos, Anthony M.; Davis, M.D.; Berry, M.B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Infrastructure for In Situ System Monitoring and Application Data Analysis

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

Toward rapid understanding of production HPC applications and systems

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC application's resource needs and their complex interactions with each other and HPC platform resources is critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

New systems, new behaviors, new patterns: Monitoring insights from system standup

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Brandt, James M.; Gentile, Ann C.; Martin, Cindy; Repik, Jason; Taerat, Narate

Disentangling significant and important log messages from those that are routine and unimportant can be a difficult task. Further, on a new system, understanding correlations between significant and possibly new types of messages and conditions that cause them can require significant effort and time. The initial standup of a machine can provide opportunities for investigating the parameter space of events and operations and thus for gaining insight into the events of interest. In particular, failure inducement and investigation of corner case conditions can provide knowledge of system behavior for significant issues that will enable easier diagnosis and mitigation of such issues for when they may actually occur during the platform lifetime. In this work, we describe the testing process and monitoring results from a testbed system in preparation for the ACES Trinity system. We describe how events in the initial standup including changes in configuration and software and corner case testing has provided insights that can inform future monitoring and operating conditions, both of our test systems and the eventual large-scale Trinity system.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI DOI

Extending LDMS to enable performance monitoring in multi-core applications

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Feldman, Steven; Zhang, Deli; Dechev, Damian D.; Brandt, James M.

Identifying design patterns that limit the performance of multi-core algorithms is a challenging task. There are many known methods by which threads synchronize their actions and each method may exhibit different behavior in different use cases. These use cases may vary in regards to the workload being executed, number of parallel tasks, dependencies between these tasks, and the behavior of the system scheduler. Restructuring algorithms to overcome performance limitations requires intimate knowledge on how these algorithms utilize the hardware. In our experience, we have found a lack of adequate tools to gain such knowledge. To address this, we have enhanced and implemented additional data sampler modules for OVIS's Lightweight Distributed Metric Service (LDMS) to enable scalable distributed collection of hardware performance counter data. These modules provide an interface by which LDMS can utilize the PAPI library, Linux perf tools, and RAPL to collect hardware performance data of interest. Using these samplers, we plan to monitor the intra-node behavior, including contention for node level shared resources, of multi-core applications for a diverse set of use cases. We are currently exploring how the values reported are affected by the level of concurrency, the synchronization methodologies, and progress guarantees. We hope to use this information to identify ways to restructure algorithms to increase their performance.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

New Systems New Behaviors New Patterns: Monitoring Insights from System Standup [PowerPoint]

Brandt, James M.; Gentile, Ann C.; Martin, Cindy M.; Repik, Jason; Taerat, Narate T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

Monitoring for Cori and Trinity

Gentile, Ann C.; Brandt, James M.; Lujan, Jim L.; Martin, Cindy M.; Wright, Nick W.; Butler, Tina B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Infrastructure for In Situ System Monitoring and Application Data Analysis

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

Uncovering Bottlenecks in Data Transfer from a Filesystem to HPSS using the Lightweight Distributed Metric Service (LDMS)

Brandt, James M.; Collins, William C.; Gentile, Ann C.; Martinez II, Michael A.; Mcree, Susan R.; Sands, Daniel S.; Yaklin, Allan C.; Mcree, Susan R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Monitoring and Analysis Tools for Numeric and Log Data

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Sandia's Advanced Architecture Test Beds

Laros, James H.; Ang, James A.; Hammond, Simon D.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Enabling Advanced Operational Analysis Through Multi-subsystem Data Integration on Trinity

Brandt, James M.; DeBonis, David D.; Gentile, Ann C.; Lujan, Jim L.; Martin, Cindy M.; Martinez, David J.; Olivier, Stephen L.; Pedretti, Kevin P.; Taerat, Narate T.; Velarde, Ron V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Scalable Integrated High-Fidelity Continuous Monitoring

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity

Brandt, James M.; DeBonis, David D.; Gentile, Ann C.; Lujan, James L.; Martin, Cindy M.; Martinez, David J.; Olivier, Stephen L.; Pedretti, Kevin P.; Taerat, Narate T.; Velarde, Ron V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Monitoring Application Resource Utilization on the

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Demonstrating improved application performance using dynamic monitoring and task mapping

2014 IEEE International Conference on Cluster Computing, CLUSTER 2014

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

This work demonstrates the integration of monitoring, analysis, and feedback to perform application-to-resource mapping that adapts to both static architecture features and dynamic resource state. In particular, we present a framework for mapping MPI tasks to compute resources based on run-time analysis of system-wide network data, architecture-specific routing algorithms, and application communication patterns. We address several challenges. Within each node, we collect local utilization data. We consolidate that information to form a global view of system performance, accounting for system-wide factors including competing applications. We provide an interface for applications to query the global information. Then we exploit the system information to change the mapping of tasks to nodes so that system bottlenecks are avoided. We demonstrate the benefit of this monitoring and feedback by remapping MPI tasks based on route-length, bandwidth, and credit-stalls metrics for a parallel sparse matrix-vector multiplication kernel. In the best case, remapping based on dynamic network information in a congested environment recovered 48.9% of the time lost to congestion, reducing matrix-vector multiplication time by 7.8%. Our experiments focus on the Cray XE/XK platform, but the integration concepts are generally applicable to any platform for which applicable metrics and route knowledge can be obtained.

More Details

TYPE Conference Poster YEAR 2014

Scopus OSTI DOI

Using architecture information and real-time resource state to reduce power consumption and communication costs in parallel applications

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Pedretti, Kevin P.; Rajamanickam, Sivasankaran R.; Bunde, David P.; Deveci, Mehmet D.; Catalyurek, Umit V.

As computer systems grow in both size and complexity, the need for applications and run-time systems to adjust to their dynamic environment also grows. The goal of the RAAMP LDRD was to combine static architecture information and real-time system state with algorithms to conserve power, reduce communication costs, and avoid network contention. We devel- oped new data collection and aggregation tools to extract static hardware information (e.g., node/core hierarchy, network routing) as well as real-time performance data (e.g., CPU uti- lization, power consumption, memory bandwidth saturation, percentage of used bandwidth, number of network stalls). We created application interfaces that allowed this data to be used easily by algorithms. Finally, we demonstrated the benefit of integrating system and application information for two use cases. The first used real-time power consumption and memory bandwidth saturation data to throttle concurrency to save power without increasing application execution time. The second used static or real-time network traffic information to reduce or avoid network congestion by remapping MPI tasks to allocated processors. Results from our work are summarized in this report; more details are available in our publications [2, 6, 14, 16, 22, 29, 38, 44, 51, 54].

More Details

TYPE SAND Report YEAR 2014

OSTI DOI

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI DOI

Lightweight Distributed Metric Service (LDMS)

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

SNL-Monitoring-Overview_talk

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Toward Rapid Understanding of Production HPC Applications and Systems

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Large Scale HPC Monitoring

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Large Scale System Monitoring and Analysis on Blue Waters using OVIS (presentation)

Brandt, James M.; Gentile, Ann C.; Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Large Scale System Monitoring and Analysis on Blue Waters using OVIS

Brandt, James M.; Gentile, Ann C.; Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Results 51–100 of 157

Results 51–100 of 157