This report is a sequel to [PB16], in which we provided a first progress report on research and development towards a scalable, asynchronous many-task, in situ statistical analysis engine using the Legion runtime system. This earlier work included a prototype implementation of a proposed solution, using a proxy mini-application as a surrogate for a full-scale scientific simulation code. The first scalability studies were conducted with the above on modestly-sized experimental clusters. In contrast, in the current work we have integrated our in situ analysis engines with a full-size scientific application (S3D, using the Legion-SPMD model), and have conducted nu- merical tests on the largest computational platform currently available for DOE science ap- plications. We also provide details regarding the design and development of a light-weight asynchronous collectives library. We describe how this library is utilized within our SPMD- Legion S3D workflow, and compare the data aggregation technique deployed herein to the approach taken within our previous work.
Proceedings of ESPM2 2016: 2nd International Workshop on Extreme Scale Programming Models and Middleware - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
Task-based execution models have received considerable attention in recent years to meet the performance challenges facing high-performance computing (HPC). In this paper we introduce MetaPASS-Metaprogramming-enabled Para-llelism from Apparently Sequential Semantics-a proof-of-concept, non-intrusive header library that enables implicit task-based parallelism in a sequential C++ code. MetaPASS is a data-driven model, relying on dependency analysis of variable read-/write accesses to derive a directed acyclic graph (DAG) of the computation to be performed. MetaPASS enables embedding of runtime dependency analysis directly in C++ applications using only template metaprogramming. Rather than requiring verbose task-based code or source-to-source compilers, a native C++ code can be made task-based with minimal modifications. We present an overview of the programming model enabled by MetaPASS and the C++ runtime API required to support it. Details are provided regarding how standard template metaprogramming is used to capture task dependencies. We finally discuss how the programming model can be deployed in both an MPI+X and in a standalone distributed memory context.
The flame structure corresponding to lean hydrogen–air premixed flames in intense sheared turbulence in the thin reaction zone regime is quantified from flame thickness and conditional scalar dissipation rate statistics, obtained from recent direct numerical simulation data of premixed temporally-evolving turbulent slot jet flames [1]. It is found that, on average, these sheared turbulent flames are thinner than their corresponding planar laminar flames. Extensive analysis is performed to identify the reason for this counter-intuitive thinning effect. The factors controlling the flame thickness are analyzed through two different routes i.e., the kinematic route, and the transport and chemical kinetics route. The kinematic route is examined by comparing the statistics of the normal strain rate due to fluid motion with the statistics of the normal strain rate due to varying flame displacement speed or self-propagation. It is found that while the fluid normal straining is positive and tends to separate iso-scalar surfaces, the dominating normal strain rate due to self-propagation is negative and tends to bring the iso-scalar surfaces closer resulting in overall thinning of the flame. The transport and chemical kinetics route is examined by studying the non-unity Lewis number effect on the premixed flames. The effects from the kinematic route are found to couple with the transport and chemical kinetics route. In addition, the intermittency of the conditional scalar dissipation rate is also examined. It is found to exhibit a unique non-monotonicity of the exponent of the stretched exponential function, conventionally used to describe probability density function tails of such variables. The non-monotonicity is attributed to the detailed chemical structure of hydrogen-air flames in which heat release occurs close to the unburnt reactants at near free-stream temperatures.
Formulas for incremental or parallel computation of second order central moments have long been known, and recent extensions of these formulas to univariate and multivariate moments of arbitrary order have been developed. Such formulas are of key importance in scenarios where incremental results are required and in parallel and distributed systems where communication costs are high. We survey these recent results, and improve them with arbitrary-order, numerically stable one-pass formulas which we further extend with weighted and compound variants. We also develop a generalized correction factor for standard two-pass algorithms that enables the maintenance of accuracy over nearly the full representable range of the input, avoiding the need for extended-precision arithmetic. We then empirically examine algorithm correctness for pairwise update formulas up to order four as well as condition number and relative error bounds for eight different central moment formulas, each up to degree six, to address the trade-offs between numerical accuracy and speed of the various algorithms. Finally, we demonstrate the use of the most elaborate among the above mentioned formulas, with the utilization of the compound moments for a practical large-scale scientific application.
Dissipation spectra of velocity and reactive scalars—temperature and fuel mass fraction—in turbulent premixed flames are studied using direct numerical simulation data of a temporally evolving lean hydrogen-air premixed planar jet (PTJ) flame and a statistically stationary planar lean methane-air (SP) flame. The equivalence ratio in both cases was 0.7, the pressure 1 atm while the unburned temperature was 700 K for the hydrogen-air PTJ case and 300 K for methane-air SP case, resulting in data sets with a density ratio of 3 and 5, respectively. The turbulent Reynolds numbers for the cases ranged from 200 to 428.4, the Damköhler number from 3.1 to 29.1, and the Karlovitz number from 0.1 to 4.5. The dissipation spectra collapse when normalized by the respective Favre-averaged dissipation rates. However, the normalized dissipation spectra in all the cases deviate noticeably from those predicted by classical scaling laws for constant-density turbulent flows and bear a clear influence of the chemical reactions on the dissipative range of the energy cascade.
Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
Pébaÿ, Philippe; Bennett, Janine C.; Hollman, David S.; Treichler, Sean; McCormick, Patrick S.; Sweeney, Christine M.; Kolla, Hemanth K.; Aiken, Alex
We explore the use of asynchronous many-task (AMT) programming models for the implementation of in situ analysis towards the goal of maximizing programmer productivity and overall performance on next generation platforms. We describe how a broad class of statistics algorithms can be transformed from a traditional single-programm multiple-data (SPMD) implementation to an AMT implementation, demonstrating with a concrete example: a measurement of descriptive statistics implemented in Legion. Our experiments to quantify the benefit and possible drawbacks of this approach are in progress, and we present some encouraging initial results on the (minimal) impact of the AMT-based approach on code complexity, task scheduling, and application scalability.
PARMA (Distributed Asynchronous Resilient Models and ApH asynchronous many-task (AMT) rmogramming models and hardware idiosyncrasies, 2) improve application programmer interface (API) plication Ico-desiga activities into meaningful requirements for characterization and definition, accelerating the development of pARMAI APT is a rranslation layer runtime systems Am' 11 between an application-facing . The application-facing user-level iting the generic language constructs of C++ and adding parallel programs. Though the implementation of the provide the front end semantics, it is nonetheless fully embedded in the C++ language and leverages a widely supported front end fiack end in C++, inher- that facilitate expressing distributed asynchronous uses C++ constructs unfamiliar to many programmers to subset of C++14 functionality (gcc >= 4.9, clang >= 3.5, icc > = 16). The rranslation layer leverages C++ to map the user's code onto the fiack encI runtime APT. The fiack end APT is a set of abstract classes and function signatures that iuntime systenr developers must implement in accordance with the specification require- ments in order to interface with application code written to the must link to a iuntime systenr that implements the abstract mentations will be external, drawing upon existing provided in the pARMAI code distribution. IDARMAI fiack end templatO front end. Executable 1DARMA applications runtime APT. It is intended that these imple- technologies. However, a reference implementation will be The front end rranslation layer, and iback end APT are detailed herein. We also include a list of application requirements driving the specification (along with a list of the applications contributing to the requirements to date), a brief history of changes between previous versions of the specification, and summary of the planned changes in up- coming versions of the specification. Appendices walk the user through a more detailed set of examples of applications written in the PARMA front encI APII and provide additional technical details for those the interested reader.
Formulas for incremental or parallel computation of second order central moments have long been known, and recent extensions of these formulas to univariate and multivariate moments of arbitrary order have been developed. Such formulas are of key importance in scenarios where incremental results are required and in parallel and distributed systems where communication costs are high. We survey these recent results, and improve them with arbitrary-order, numerically stable one-pass formulas which we further extend with weighted and compound variants. We also develop a generalized correction factor for standard two-pass algorithms that enables the maintenance of accuracy over nearly the full representable range of the input, avoiding the need for extended-precision arithmetic. We then empirically examine algorithm correctness for pairwise update formulas up to order four as well as condition number and relative error bounds for eight different central moment formulas, each up to degree six, to address the trade-offs between numerical accuracy and speed of the various algorithms. Finally, we demonstrate the use of the most elaborate among the above mentioned formulas, with the utilization of the compound moments for a practical large-scale scientific application.
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.
This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein
In this position paper, we argue for improved fault-tolerance of an MPI code by introducing lightweight virtualization into the MPI interface. In particular, we outline key-value store semantics for MPI send/recv calls, thereby creating a far more expressive programming model. The general message passing semantics and imperative style of MPI application codes would remain essentially unchanged. However, the additional expressiblity of the programming model 1) enables the underlying transport layer to handle faulttolerance more transparently to the application developer, and 2) provides an evolutionary code path towards more declarative asynchronous programming models. The core contribution of this paper is an initial implementation of the DHARMA transport layer that provides the new, required functionality to support the MPI key-value store model.
Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.
This paper reports the results of a joint experimental and numerical study of the flow characteristics and flame structure of a hydrogen rich jet injected normal to a turbulent, vitiated crossflow of lean methane combustion products. Simultaneous high-speed stereoscopic PIV and OH PLIF measurements were obtained and analyzed alongside three-dimensional direct numerical simulations of inert and reacting JICF with detailed H2/CO chemistry. Both the experiment and the simulation reveal that, contrary to most previous studies of reacting JICF stabilized in low-to-moderate temperature air crossflow, the present conditions lead to a burner-attached flame that initiates uniformly around the burner edge. Significant asymmetry is observed, however, between the reaction zones located on the windward and leeward sides of the jet, due to the substantially different scalar dissipation rates. The windward reaction zone is much thinner in the near field, while also exhibiting significantly higher local and global heat release than the much broader reaction zone found on the leeward side of the jet. The unsteady dynamics of the windward shear layer, which largely control the important jet/crossflow mixing processes in that region, are explored in order to elucidate the important flow stability implications arising in the inert and reacting JICF. The paper concludes with an analysis of the ignition, flame characteristics, and global structure of the burner-attached flame. Chemical explosive mode analysis (CEMA) shows that the entire windward shear layer, and a large region on the leeward side of the jet, are highly explosive prior to ignition and are dominated by non-premixed flame structures after ignition. The predominantly mixing limited nature of the flow after ignition is examined by computing the Takeno flame index, which shows that ~70% of the heat release occurs in non-premixed regions.
Three-dimensional direct numerical simulation results of a transverse syngas fuel jet in turbulent cross-flow of air are analyzed to study the influence of varying volume fractions of CO relative to H2 in the fuel composition on the near field flame stabilization. The mean flame stabilizes at a similar location for CO-lean and CO-rich cases despite the trend suggested by their laminar flame speed, which is higher for the CO-lean condition. To identify local mixtures having favorable mixture conditions for flame stabilization, explosive zones are defined using a chemical explosive mode timescale. The explosive zones related to flame stabilization are located in relatively low velocity regions. The explosive zones are characterized by excess hydrogen transported solely by differential diffusion, in the absence of intense turbulent mixing or scalar dissipation rate. The conditional averages show that differential diffusion is negatively correlated with turbulent mixing. Moreover, the local turbulent Reynolds number is insufficient to estimate the magnitude of the differential diffusion effect. Alternatively, the Karlovitz number provides a better indicator of the importance of differential diffusion. A comparison of the variations of differential diffusion, turbulent mixing, heat release rate and probability of encountering explosive zones demonstrates that differential diffusion predominantly plays an important role for mixture preparation and initiation of chemical reactions, closely followed by intense chemical reactions sustained by sufficient downstream turbulent mixing. The mechanism by which differential diffusion contributes to mixture preparation is investigated using the Takeno Flame Index. The mean Flame Index, based on the combined fuel species, shows that the overall extent of premixing is not intense in the upstream regions. However, the Flame Index computed based on individual contribution of H2 or CO species reveals that hydrogen contributes significantly to premixing, particularly in explosive zones in the upstream leeward region, i.e. at the preferred flame stabilization location. Therefore, a small amount of H2 diffuses much faster than CO, creating relatively homogeneous mixture pockets depending on the competition with turbulent mixing. These pockets, together with high H2 reactivity, contribute to stabilizing the flame at a consistent location regardless of the CO concentration in the fuel for the present range of DNS conditions.
This paper presents the results of DNS of a partially premixed turbulent syngas/air flame at atmospheric pressure. The objective was to assess the importance and possible effects of molecular transport on flame behavior and structure. To this purpose DNS were performed at with two proprietary DNS codes and with three different molecular diffusion transport models: fully multi-component, mixture averaged, and imposing the Lewis number of all species to be unity. Results indicate that At the Reynolds numbers of the simulations (Returb = 600, Re = 8000) choice of molecular diffusion models affects significantly the temperature and concentration fields;Assuming Le = 1 for all species predicts temperatures up to 250 K higher than the physically realistic multi-component model;Faster molecular transport of lighter species changes the local concentration field and affects reaction pathways and chemical kinetics. A possible explanation for these observations is provided in terms of species diffusion velocity that is a strong function of gradients: thus, at sufficiently large Reynolds numbers, gradients and their effects tend to be large. The preliminary conclusion from these simulations seems to indicate molecular diffusion as the third important mechanism active in flames besides convective transport and kinetics. If confirmed by further DNS and measurements, molecular transport in high intensity turbulent flames will have to be realistically modeled to accurately predict emissions (gaseous and particulates) and other combustor performance metrics.
The topology of turbulent premixed flames is analysed using data from Direct Numerical Simulation (DNS), with emphasis on the statistical geometry of flame-flame interaction. A general method for obtaining the critical points of line, surface and volume fields is outlined, and the method is applied to isosurfaces of reaction progress variable in a DNS configuration involving a pair of freely-propagating hydrogen-air flames in a field of intense shear-generated turbulence. A complete set of possible flame-interaction topologies is derived using the eigenvalues of the scalar Hessian, and the topologies are parametrised using a pair of shape factors. The frequency of occurrence of each type of topology is evaluated from the DNS dataset for two different Damköhler numbers. Different types of flame-interaction topology are found to be favoured in various regions of the turbulent flame, and the physical significance of each interaction is discussed.
Extreme-scale computing will bring significant changes to high performance computing system architectures. In particular, the increased number of system components is creating a need for software to demonstrate 'pervasive parallelism' and resiliency. Asynchronous, many-task programming models show promise in addressing both the scalability and resiliency challenges, however, they introduce an enormously challenging distributed, resilient consistency problem. In this work, we explore the viability of resilient collective communication in task scheduling and work stealing and, through simulation with SST/macro, the performance of these collectives on speculative extreme-scale architectures.