Publications
Defining metrics to distill large-scale HPC platform and application performance data into actionable quantities
Application performance data accounting for resource contention and other external influences is highly coveted and extremely difficult to obtain. «Why did my application's performance change from the last time it ran?» is a question shared by application developers, program analysts, and system administrators. The answer to this question impacts nearly all programmatic and R&D efforts related to high-performance computing (HPC). Lightweight, right-fidelity monitoring infrastructures that can gather relevant application and resource performance data across the entire HPC platform can help address this research topic. This short technical paper will formally define an ongoing research effort to define the needed metrics and methods that distill the vast quantities of available data to a minimum set of actionable and interpretable quantities that can be used by application developers, system administrators, production analysts, and HPC platform designers for their respective production and R&D focus areas.