Publications

Publications / Conference Poster

Comparing, contrasting, generalizing, and integrating two current designs for fault-tolerant MPI

Hassani, Amin; Skjellum, Anthony; Brightwell, Ronald B.; Bangalore, Purushotham V.

We compare and contrast the approaches and key features of two proposals for fault-tolerant MPI: User-Level Failure Mitigation (UFLM) and Fault-Aware MPI (FA-MPI). We show how they are complementary and also how they could leverage each other through modifications and/or extensions. We show how to "weaken" and extend ULFM to help integrate it with FA-MPI, with corollary benefits of broadening applicability of ULFM. Reducibility of each to the other is considered. This helps identify which components of each are minimally "required" for standardization, versus layer-able on a future MPI specification.