Publications

Publications / Conference Poster

Versioned distributed arrays for resilience in scientific applications: Global View Resilience

Chien, A.; Balaji, P.; Beckman, P.; Dun, N.; Fang, A.; Fujita, H.; Iskra, K.; Rubenstein, Z.; Zheng, Z.; Schreiber, R.; Hammond, J.; Dinan, J.; Laguna, I.; Richards, D.; Dubey, A.; Van Straalen, B.; Hoemmen, M.; Heroux, Michael A.; Teranishi, Keita T.; Siegel, A.

Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR's interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.