The UQ Toolkit (UQTk) is a collection of libraries and tools for the quantification of uncertainty in numerical model predictions. Version 3.0.3 offers intrusive and non-intrusive methods for propagating input uncertainties through computational models, tools for sen- sitivity analysis, methods for sparse surrogate construction, and Bayesian inference tools for inferring parameters from experimental data. This manual discusses the download and installation process for UQTk, provides pointers to the UQ methods used in the toolkit, and describes some of the examples provided with the toolkit.
Proceedings of ScalA 2016: 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis
We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.
The UQ Toolkit (UQTk) is a collection of libraries and tools for the quantification of uncer- tainty in numerical model predictions. Version 3.0 offers intrusive and non-intrusive methods for propagating input uncertainties through computational models, tools for sensitivity anal- ysis, methods for sparse surrogate construction, and Bayesian inference tools for inferring parameters from experimental data. This manual discusses the download and installation process for UQTk, provides pointers to the UQ methods used in the toolkit, and describes some of the examples provided with the toolkit.
We discuss algorithm-based resilience to silent data corruption (SDC) in a task- based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDC. The imple- mentation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to [?] 51 K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corrup- tion of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamics voltage/frequency scaling, and its interplay with fault-rates, and application overhead. [?] Sandia National Laboratories, Livermore, CA ( fnrizzi@sandia.gov ). + Sandia National Laboratories, Livermore, CA ( knmorri@sandia.gov ). ++ Sandia National Laboratories, Livermore, CA ( ksargsy@sandia.gov ). SS Duke University, Durham, NC ( paul.mycek@duke.edu ). P Sandia National Laboratories, Livermore, CA ( csafta@sandia.gov ). k Laboratoire d'Informatique pour la M'ecanique et les Sciences de l'Ing'enieur, Orsay, France ( olm@limsi.fr ). [?][?] Duke University, Durham, NC ( omar.knio@duke.edu ). ++ Sandia National Laboratories, Livermore, CA ( bjdebus@sandia.gov ).
We present a domain-decomposition-based pre-conditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm is based on the following steps: first, the computational domain is split into overlapping subdomains, second, the target PDE is solved on each subdomain for sampled values of the local current boundary conditions, third, the subdomain solution samples are collected and fed into a regression step to build maps between the subdomains' boundary conditions, finally, the intersection of these maps yields the updated state at the subdomain boundaries. This reformulation allows us to recast the problem as a set of independent tasks. The implementation relies on an asynchronous server-client framework, where one or more reliable servers hold the data, while the clients ask for tasks and execute them. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing, soft faults occurring during the communication of the tasks between server and clients, and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.