Publications

Publications / Conference

Using probabilistic characterization to reduce runtime faults in HPC systems

Brandt, James M.; Debusschere, Bert D.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, Philippe; Thompson, David; Wong, Matthew H.

The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application. © 2008 IEEE.