Standardized Environment for Monitoring Heterogeneous Architectures
Proceedings - IEEE International Conference on Cluster Computing, ICCC
Increasingly diverse architectures and operating systems continue to emerge in the HPC industry. As such, HPC centers are becoming more heterogeneous which introduces a variety of challenges for system administrators. Monitoring a wide array of different platforms by itself is difficult, but the problem compounds in an environment where new platforms are frequently added. Creating a standard monitoring environment across these platforms that allows for simple administration with minimal setup becomes necessary in such situations.This paper presents the solutions introduced in the HPC Development department at Sandia National Laboratories to meet these challenges. This includes our adoption of a multi-stage data-collection pipeline across our clusters that is implemented from the ground up with our Golden Image. We also discuss our infrastructure to support a heterogeneous environment and activities in progress to improve our center. These advances simplify system standup and make monitoring integration easier and faster for new systems which is necessary for our center's domain.