This report is a sequel to [PB16], in which we provided a first progress report on research and development towards a scalable, asynchronous many-task, in situ statistical analysis engine using the Legion runtime system. This earlier work included a prototype implementation of a proposed solution, using a proxy mini-application as a surrogate for a full-scale scientific simulation code. The first scalability studies were conducted with the above on modestly-sized experimental clusters. In contrast, in the current work we have integrated our in situ analysis engines with a full-size scientific application (S3D, using the Legion-SPMD model), and have conducted nu- merical tests on the largest computational platform currently available for DOE science ap- plications. We also provide details regarding the design and development of a light-weight asynchronous collectives library. We describe how this library is utilized within our SPMD- Legion S3D workflow, and compare the data aggregation technique deployed herein to the approach taken within our previous work.
Proceedings of ESPM2 2016: 2nd International Workshop on Extreme Scale Programming Models and Middleware - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
Task-based execution models have received considerable attention in recent years to meet the performance challenges facing high-performance computing (HPC). In this paper we introduce MetaPASS-Metaprogramming-enabled Para-llelism from Apparently Sequential Semantics-a proof-of-concept, non-intrusive header library that enables implicit task-based parallelism in a sequential C++ code. MetaPASS is a data-driven model, relying on dependency analysis of variable read-/write accesses to derive a directed acyclic graph (DAG) of the computation to be performed. MetaPASS enables embedding of runtime dependency analysis directly in C++ applications using only template metaprogramming. Rather than requiring verbose task-based code or source-to-source compilers, a native C++ code can be made task-based with minimal modifications. We present an overview of the programming model enabled by MetaPASS and the C++ runtime API required to support it. Details are provided regarding how standard template metaprogramming is used to capture task dependencies. We finally discuss how the programming model can be deployed in both an MPI+X and in a standalone distributed memory context.
The flame structure corresponding to lean hydrogen–air premixed flames in intense sheared turbulence in the thin reaction zone regime is quantified from flame thickness and conditional scalar dissipation rate statistics, obtained from recent direct numerical simulation data of premixed temporally-evolving turbulent slot jet flames [1]. It is found that, on average, these sheared turbulent flames are thinner than their corresponding planar laminar flames. Extensive analysis is performed to identify the reason for this counter-intuitive thinning effect. The factors controlling the flame thickness are analyzed through two different routes i.e., the kinematic route, and the transport and chemical kinetics route. The kinematic route is examined by comparing the statistics of the normal strain rate due to fluid motion with the statistics of the normal strain rate due to varying flame displacement speed or self-propagation. It is found that while the fluid normal straining is positive and tends to separate iso-scalar surfaces, the dominating normal strain rate due to self-propagation is negative and tends to bring the iso-scalar surfaces closer resulting in overall thinning of the flame. The transport and chemical kinetics route is examined by studying the non-unity Lewis number effect on the premixed flames. The effects from the kinematic route are found to couple with the transport and chemical kinetics route. In addition, the intermittency of the conditional scalar dissipation rate is also examined. It is found to exhibit a unique non-monotonicity of the exponent of the stretched exponential function, conventionally used to describe probability density function tails of such variables. The non-monotonicity is attributed to the detailed chemical structure of hydrogen-air flames in which heat release occurs close to the unburnt reactants at near free-stream temperatures.
Formulas for incremental or parallel computation of second order central moments have long been known, and recent extensions of these formulas to univariate and multivariate moments of arbitrary order have been developed. Such formulas are of key importance in scenarios where incremental results are required and in parallel and distributed systems where communication costs are high. We survey these recent results, and improve them with arbitrary-order, numerically stable one-pass formulas which we further extend with weighted and compound variants. We also develop a generalized correction factor for standard two-pass algorithms that enables the maintenance of accuracy over nearly the full representable range of the input, avoiding the need for extended-precision arithmetic. We then empirically examine algorithm correctness for pairwise update formulas up to order four as well as condition number and relative error bounds for eight different central moment formulas, each up to degree six, to address the trade-offs between numerical accuracy and speed of the various algorithms. Finally, we demonstrate the use of the most elaborate among the above mentioned formulas, with the utilization of the compound moments for a practical large-scale scientific application.
Dissipation spectra of velocity and reactive scalars—temperature and fuel mass fraction—in turbulent premixed flames are studied using direct numerical simulation data of a temporally evolving lean hydrogen-air premixed planar jet (PTJ) flame and a statistically stationary planar lean methane-air (SP) flame. The equivalence ratio in both cases was 0.7, the pressure 1 atm while the unburned temperature was 700 K for the hydrogen-air PTJ case and 300 K for methane-air SP case, resulting in data sets with a density ratio of 3 and 5, respectively. The turbulent Reynolds numbers for the cases ranged from 200 to 428.4, the Damköhler number from 3.1 to 29.1, and the Karlovitz number from 0.1 to 4.5. The dissipation spectra collapse when normalized by the respective Favre-averaged dissipation rates. However, the normalized dissipation spectra in all the cases deviate noticeably from those predicted by classical scaling laws for constant-density turbulent flows and bear a clear influence of the chemical reactions on the dissipative range of the energy cascade.
Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
Pébaÿ, Philippe; Bennett, Janine C.; Hollman, David S.; Treichler, Sean; McCormick, Patrick S.; Sweeney, Christine M.; Kolla, Hemanth K.; Aiken, Alex
We explore the use of asynchronous many-task (AMT) programming models for the implementation of in situ analysis towards the goal of maximizing programmer productivity and overall performance on next generation platforms. We describe how a broad class of statistics algorithms can be transformed from a traditional single-programm multiple-data (SPMD) implementation to an AMT implementation, demonstrating with a concrete example: a measurement of descriptive statistics implemented in Legion. Our experiments to quantify the benefit and possible drawbacks of this approach are in progress, and we present some encouraging initial results on the (minimal) impact of the AMT-based approach on code complexity, task scheduling, and application scalability.
PARMA (Distributed Asynchronous Resilient Models and ApH asynchronous many-task (AMT) rmogramming models and hardware idiosyncrasies, 2) improve application programmer interface (API) plication Ico-desiga activities into meaningful requirements for characterization and definition, accelerating the development of pARMAI APT is a rranslation layer runtime systems Am' 11 between an application-facing . The application-facing user-level iting the generic language constructs of C++ and adding parallel programs. Though the implementation of the provide the front end semantics, it is nonetheless fully embedded in the C++ language and leverages a widely supported front end fiack end in C++, inher- that facilitate expressing distributed asynchronous uses C++ constructs unfamiliar to many programmers to subset of C++14 functionality (gcc >= 4.9, clang >= 3.5, icc > = 16). The rranslation layer leverages C++ to map the user's code onto the fiack encI runtime APT. The fiack end APT is a set of abstract classes and function signatures that iuntime systenr developers must implement in accordance with the specification require- ments in order to interface with application code written to the must link to a iuntime systenr that implements the abstract mentations will be external, drawing upon existing provided in the pARMAI code distribution. IDARMAI fiack end templatO front end. Executable 1DARMA applications runtime APT. It is intended that these imple- technologies. However, a reference implementation will be The front end rranslation layer, and iback end APT are detailed herein. We also include a list of application requirements driving the specification (along with a list of the applications contributing to the requirements to date), a brief history of changes between previous versions of the specification, and summary of the planned changes in up- coming versions of the specification. Appendices walk the user through a more detailed set of examples of applications written in the PARMA front encI APII and provide additional technical details for those the interested reader.
Formulas for incremental or parallel computation of second order central moments have long been known, and recent extensions of these formulas to univariate and multivariate moments of arbitrary order have been developed. Such formulas are of key importance in scenarios where incremental results are required and in parallel and distributed systems where communication costs are high. We survey these recent results, and improve them with arbitrary-order, numerically stable one-pass formulas which we further extend with weighted and compound variants. We also develop a generalized correction factor for standard two-pass algorithms that enables the maintenance of accuracy over nearly the full representable range of the input, avoiding the need for extended-precision arithmetic. We then empirically examine algorithm correctness for pairwise update formulas up to order four as well as condition number and relative error bounds for eight different central moment formulas, each up to degree six, to address the trade-offs between numerical accuracy and speed of the various algorithms. Finally, we demonstrate the use of the most elaborate among the above mentioned formulas, with the utilization of the compound moments for a practical large-scale scientific application.
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.
This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein