Publications

10 Results
Skip to search filters

Large-Scale System Monitoring Experiences and Recommendations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Gienger, Michael; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin P.; Piccinali, Jean G.; Repik, Jason; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.

More Details

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.A.; Andersson, S.A.; Brandt, James M.; Cardo, N.C.; Chunduri, S.C.; Enos, J.E.; Fields, P.F.; Gentile, Ann C.; Gerber, R.B.; Gienger, M.G.; Greenseid, J.G.; Greiner, A.G.; Hadri, B.H.; He, Y.H.; Hoppe, D.H.; Kaila, U.K.; Kelly, K.K.; Klein, M.K.; Kristiansen, A.K.; Leak, S.L.; Mason, M.M.; Pedretti, Kevin P.; Piccinali, J-G.P.; Repik, Jason; Rogers, J.R.; Salminen, S.S.; showerman, m.s.; Whitney, C.W.; Williams, J.W.

Abstract not provided.

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville A.; Andersson, Stefan A.; Brandt, James M.; Cardo, Nicholas C.; Chunduri, Sudheer C.; Enos, Jeremy E.; Fields, Parks F.; Gentile, Ann C.; Gerber, Richard G.; Greenseid, Joe G.; Greiner, Annette G.; Hadri, Bilel H.; He, Yun H.; Hoppe, Dennis H.; Kaila, Urpo K.; Kelly, Kaki K.; Klein, Mark K.; Kristiansen, Alex K.; Leak, Steve L.; Mason, Mike M.; Pedretti, Kevin P.; Piccinali, Jean-Guillaume P.; Repik, Jason; Rogers, Jim R.; Salminen, Susanna S.; Showerman, Mike S.; Whitney, Cary W.; Williams, Jim W.

Abstract not provided.

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville A.; Andersson, Stefan A.; Brandt, James M.; Cardo, Nicholas C.; Chunduri, Sudheer C.; Enos, Jeremy E.; Fields, Parks F.; Gentile, Ann C.; Gerber, Richard G.; Greenseid, Joe G.; Greiner, Annette G.; Hadri, Bilel H.; He, Yun H.; Hoppe, Dennis H.; Kaila, Urpo K.; Kelly, Kaki K.; Klein, Mark K.; Kristiansen, Alex K.; Leak, Steve L.; Mason, Mike M.; Pedretti, Kevin P.; Piccinali, Jean-Guillaume P.; Repik, Jason; Rogers, Jim R.; Salminen, Susanna S.; Showerman, Mike S.; Whitney, Cary W.; Williams, Jim W.

Abstract not provided.

New systems, new behaviors, new patterns: Monitoring insights from system standup

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Brandt, James M.; Gentile, Ann C.; Martin, Cindy; Repik, Jason; Taerat, Narate

Disentangling significant and important log messages from those that are routine and unimportant can be a difficult task. Further, on a new system, understanding correlations between significant and possibly new types of messages and conditions that cause them can require significant effort and time. The initial standup of a machine can provide opportunities for investigating the parameter space of events and operations and thus for gaining insight into the events of interest. In particular, failure inducement and investigation of corner case conditions can provide knowledge of system behavior for significant issues that will enable easier diagnosis and mitigation of such issues for when they may actually occur during the platform lifetime. In this work, we describe the testing process and monitoring results from a testbed system in preparation for the ACES Trinity system. We describe how events in the initial standup including changes in configuration and software and corner case testing has provided insights that can inform future monitoring and operating conditions, both of our test systems and the eventual large-scale Trinity system.

More Details
10 Results
10 Results