Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.
Select one or more publication years and click "Update search results".
This list has already been filtered by scope and author.
SELECTED PUBLICATION YEARS
MATCHING PUBLICATION YEARS
ALL PUBLICATION YEARS
No matches found.
Select a document type
Select one or more document types and click "Update search results".
This list has already been filtered by scope and author.
SELECTED DOCUMENT TYPES
MATCHING DOCUMENT TYPES
ALL DOCUMENT TYPES
No matches found.
Search for an author
Search for a Sandian author by first name, last name, or initials. Click on the author's name to add them as an option, and then click "Update search results".
This list has already been filtered by scope and author.
SELECTED AUTHORS
MATCHING AUTHORS
ALL AUTHORS
No matches found.
Search for a research partner
Search for one or more research partners and click "Update search results".
This list has already been filtered by scope and author.