Overview of the Structural Simulation Toolkit (SST)
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
The power and procurement cost of bandwidth in system-wide networks has forced a steady drop in the byte/flop ratio. This trend of computation becoming faster relative to the network is expected to hold. In this paper, we explore how cost-oriented task placement enables reducing the cost of system-wide networks by enabling high performance even on tapered topologies where more bandwidth is provisioned at lower levels. We describe APHiD, an efficient hierarchical placement algorithm that uses new techniques to improve the quality of heuristic solutions and reduces the demand on high-level, expensive bandwidth in hierarchical topologies. We apply APHiD to a tapered fat-Tree, demonstrating that APHiD maintains application scalability even for severely tapered network configurations. Using simulation, we show that for tapered networks APHiD improves performance by more than 50% over random placement and even 15% in some cases over costlier, state-of-The-Art placement algorithms.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings of ESPM2 2016: 2nd International Workshop on Extreme Scale Programming Models and Middleware - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
Task-based execution models have received considerable attention in recent years to meet the performance challenges facing high-performance computing (HPC). In this paper we introduce MetaPASS-Metaprogramming-enabled Para-llelism from Apparently Sequential Semantics-a proof-of-concept, non-intrusive header library that enables implicit task-based parallelism in a sequential C++ code. MetaPASS is a data-driven model, relying on dependency analysis of variable read-/write accesses to derive a directed acyclic graph (DAG) of the computation to be performed. MetaPASS enables embedding of runtime dependency analysis directly in C++ applications using only template metaprogramming. Rather than requiring verbose task-based code or source-to-source compilers, a native C++ code can be made task-based with minimal modifications. We present an overview of the programming model enabled by MetaPASS and the C++ runtime API required to support it. Details are provided regarding how standard template metaprogramming is used to capture task dependencies. We finally discuss how the programming model can be deployed in both an MPI+X and in a standalone distributed memory context.
Proceedings of COM-HPC 2016: 1st Workshop on Optimization of Communication in HPC Runtime Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
We introduce a topology-aware performance optimization and modeling workflow for AMR simulation that includes two new modeling tools, ProgrAMR and Mota Mapper, which interface with the BoxLib AMR framework and the SSTmacro network simulator. ProgrAMR allows us to generate and model the execution of task dependency graphs from high-level specifications of AMR-based applications, which we demonstrate by analyzing two example AMR-based multigrid solvers with varying degrees of asynchrony. Mota Mapper generates multiobjective, network topology-aware box mappings, which we apply to optimize the data layout for the example multigrid solvers. While the sensitivity of these solvers to layout and execution strategy appears to be modest for balanced scenarios, the impact of better mapping algorithms can be significant when performance is highly constrained by network hop latency. Furthermore, we show that network latency in the multigrid bottom solve is the main contributing factor preventing good scaling on exascale-class machines.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
We investigate possible options of creating a Dragonfly topology capable of accommodating a specified number of end-points. We first observe that any Dragonfly topology can be described with two main parameters, imbalance and density, dictating the distribution of routers in groups, and the inter-group connectivity, respectively. We then introduce an algorithm that generates a dragonfly topology by taking the desired number of end-points and these two parameters as input. We calculate a variety of metrics on the generated topologies resulting from a large set of parameter combinations. Based on these metrics, we isolate the subset of topologies that present the best economical and performance trade-off. We conclude by summarizing guidelines for Dragonfly topology design and dimensioning.
Abstract not provided.
Proceedings of the AISTECS' 17 Workshop
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
International Conference for High Performance Computing, Networking, Storage and Analysis, SC
The Dragonfly topology provides low-diameter connectivity for high-performance computing with all-to-all global links at the inter-group level. Our traffic matrix characterization of various scientific applications shows consistent mismatch between the imbalanced group-to-group traffic and the uniform global bandwidth allocation of Dragonfly. Though adaptive routing has been proposed to utilize bandwidth of non-minimal paths, increased hops and cross-group interference lower efficiency. This work presents a photonic architecture, Flexfly, which 'trades' global links among groups using low-radix Silicon photonic switches. With transparent optical switching, Flexfly reconfigures the inter-group topology based on traffic pattern, stealing additional direct bandwidth for communication-intensive group pairs. Simulations with applications such as GTC, Nekbone and LULESH show up to 1.8× speedup over Dragonfly paired with UGAL routing, along with halved hop count and latency for cross-group messages. We built a 32-node Flexfly prototype using a Silicon photonic switch connecting four groups and demonstrated 820 ns interconnect reconfiguration time.
Abstract not provided.
PARMA (Distributed Asynchronous Resilient Models and ApH asynchronous many-task (AMT) rmogramming models and hardware idiosyncrasies, 2) improve application programmer interface (API) plication Ico-desiga activities into meaningful requirements for characterization and definition, accelerating the development of pARMAI APT is a rranslation layer runtime systems Am' 11 between an application-facing . The application-facing user-level iting the generic language constructs of C++ and adding parallel programs. Though the implementation of the provide the front end semantics, it is nonetheless fully embedded in the C++ language and leverages a widely supported front end fiack end in C++, inher- that facilitate expressing distributed asynchronous uses C++ constructs unfamiliar to many programmers to subset of C++14 functionality (gcc >= 4.9, clang >= 3.5, icc > = 16). The rranslation layer leverages C++ to map the user's code onto the fiack encI runtime APT. The fiack end APT is a set of abstract classes and function signatures that iuntime systenr developers must implement in accordance with the specification require- ments in order to interface with application code written to the must link to a iuntime systenr that implements the abstract mentations will be external, drawing upon existing provided in the pARMAI code distribution. IDARMAI fiack end templatO front end. Executable 1DARMA applications runtime APT. It is intended that these imple- technologies. However, a reference implementation will be The front end rranslation layer, and iback end APT are detailed herein. We also include a list of application requirements driving the specification (along with a list of the applications contributing to the requirements to date), a brief history of changes between previous versions of the specification, and summary of the planned changes in up- coming versions of the specification. Appendices walk the user through a more detailed set of examples of applications written in the PARMA front encI APII and provide additional technical details for those the interested reader.
Abstract not provided.
Abstract not provided.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems
Simulation is a widely adopted method to analyze and predict the performance of large-scale parallel applications. Validating the hardware model is highly important for complex simulations with a large number of parameters. Common practice involves calculating the percent error between the projected and the real execution time of a benchmark program. However, in a high-dimensional parameter space, this coarse-grained approach often suffers from parameter insensitivity, which may not be known a priori. Moreover, the traditional approach cannot be applied to the validation of software models, such as application skeletons used in online simulations. In this work, we present a methodology and a toolset for validating both hardware and software models by quantitatively comparing fine-grained statistical characteristics obtained from execution traces. Although statistical information has been used in tasks like performance optimization, this is the first attempt to apply it to simulation validation. Lastly, our experimental results show that the proposed evaluation approach offers significant improvement in fidelity when compared to evaluation using total execution time, and the proposed metrics serve as reliable criteria that progress toward automating the simulation tuning process.