Publications
Design and implementation of a scalable HPC monitoring system
Sanchez, S.; Bonnie, A.; Van Heule, G.; Robinson, C.; Deconinck, A.; Kelly, K.; Snead, Q.; Brandt, James M.
Over the past decade, platforms at Los AlamosNational Laboratory (LANL) have experienced large increases in complexity and scale to reach computational targets. The changes to the compute platforms have presented new challenges to the production monitoring systems in which they must not only cope with larger volumes of monitoring data, but also must provide new capabilities for the management, distribution, and analysis of this data. This schema must support both real-time analysis for alerting on urgent issues, as well as analysis of historical data for understanding performance issues and trends in systembehavior. This paper presents the design of our proposed next-generation monitoring system, as well as implementation details for an initial deployment. This design takes the form of a multi-stage data processing pipeline, including a scalable cluster for data aggregation and early analysis, a message broker for distribution of this data to varied consumers, and an initial selection of consumer services for alerting and analysis. We will also present estimates of the capabilities and scale required to monitor two upcoming compute platforms at LANL.