Publications
Using Cloud constructs and predictive analysis to enable pre-failure process migration in HPC systems
Brandt, James M.; Chen, F.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, P.; Roe, D.; Thompson, D.; Wong, M.
Accurate failure prediction in conjunction with efficient process migration facilities including some Cloud constructs can enable failure avoidance in large-scale high performance computing (HPC) platforms. In this work we demonstrate a prototype system that incorporates our probabilistic failure prediction system with virtualization mechanisms and techniques to provide a whole system approach to failure avoidance. This work utilizes a failure scenario based on a real-world HPC case study. © 2010 IEEE.