Publications

Publications / SAND Report

High Performance Computing Metrics to Enable Application-Platform Communication

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Sandia has invested heavily in scientifc/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the com- putational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used efciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-afecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric defnitions, identifed in this research, that can be used as meaningful and poten- tially actionable indicators of performance-afecting contention between applications. These metrics were verifed using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's eforts to succeed in extreme-scale computing.