Publications

Results 1201–1225 of 9,998
Skip to search filters

Asymptotically compatible reproducing kernel collocation and meshfree integration for the peridynamic Navier equation

Computer Methods in Applied Mechanics and Engineering

Leng, Yu; Tian, Xiaochuan; Trask, Nathaniel A.; Foster, John T.

In this work, we study reproducing kernel (RK) collocation method for peridynamic Navier equation. In the first part, we apply a linear RK approximation to both displacement and dilatation, and then back-substitute dilatation and solve the peridynamic Navier equation in a pure displacement form. The RK collocation scheme converges to the nonlocal limit for a fixed nonlocal interaction length and also to the local limit as nonlocal interactions vanish. The stability is shown by comparing the collocation scheme with the standard Galerkin scheme using Fourier analysis. In the second part, we apply the RK collocation to the quasi-discrete peridynamic Navier equation and show its convergence to the correct local limit when the ratio between the nonlocal length scale and the discretization parameter is fixed. The analysis is carried out on a special family of rectilinear Cartesian grids for the RK collocation method with a designated kernel with finite support. We assume the Lamé parameters satisfy λ≥μ to avoid extra assumptions on the nonlocal kernel. Finally, numerical experiments are conducted to validate the theoretical results.

More Details

PreFAM: Understanding the Impact of Prefetching in Fabric-Attached Memory Architectures

ACM International Conference Proceeding Series

Kommareddy, Vamsee R.; Kotra, Jagadish; Hughes, Clayton H.; Hammond, Simon D.; Awad, Amro

With many recent advances in interconnect technologies and memory interfaces, disaggregated memory systems are approaching industrial adoption. For instance, the recent Gen-Z consortium focuses on a new memory semantic protocol that enables fabric-attached memories (FAM), where the memory and other compute units can be directly attached to fabric interconnects. Decoupling of memory from compute units becomes a feasible option as the rate of data transfer increases due to the emergence of novel interconnect technologies, such as Silicon Photonic Interconnects. Disaggregated memories not only enable more efficient use of capacity (minimizes under-utilization) they also allow easy integration of evolving technologies. Additionally, they simplify the programming model at the same time allowing efficient sharing of data. However, the latency of accessing the data in these Fabric Attached disaggregated Memories (FAMs) is dependent on the latency imposed by the fabric interfaces. To reduce memory access latency and to improve the performance of FAM systems, in this paper, we explore techniques to prefetch data from FAMs to the local memory present in the node (PreFAM). We realize that since the memory access latency is high in FAMs, prefetching a cache block (64 bytes) from FAM can be inefficient, since the possibility of issuing demand requests before the completion of prefetch requests, to the same FAM locations, is high. Hence, we explore predicting and prefetching FAM blocks at a distance; prefetching blocks which are going to be accessed in future but not immediately. We show that, with prefetching, the performance of FAM architectures increases by 38.84%, while memory access latency is improved by 39.6%, with only 17.65% increase in the number of accesses to the FAM, on average. Further, by prefetching at a distance we show a performance improvement of 72.23%.

More Details

Modeling assisted room temperature operation of atomic precision advanced manufacturing devices

International Conference on Simulation of Semiconductor Processes and Devices, SISPAD

Gao, Xujiao G.; Tracy, Lisa A.; Anderson, Evan M.; Campbell, DeAnna M.; Ivie, Jeffrey A.; Lu, Tzu-Ming L.; Mamaluy, Denis M.; Schmucker, Scott W.; Misra, Shashank M.

One big challenge of the emerging atomic precision advanced manufacturing (APAM) technology for microelectronics application is to realize APAM devices that operate at room temperature (RT). We demonstrate that semiclassical technology computer aided design (TCAD) device simulation tool can be employed to understand current leakage and improve APAM device design for RT operation. To establish the applicability of semiclassical simulation, we first show that a semiclassical impurity scattering model with the Fermi-Dirac statistics can explain the very low mobility in APAM devices quite well; we also show semiclassical TCAD reproduces measured sheet resistances when proper mobility values are used. We then apply semiclassical TCAD to simulate current leakage in realistic APAM wires. With insights from modeling, we were able to improve device design, fabricate Hall bars, and demonstrate RT operation for the very first time.

More Details

Physics-informed graph neural network for circuit compact model development

International Conference on Simulation of Semiconductor Processes and Devices, SISPAD

Gao, Xujiao G.; Huang, Andy H.; Trask, Nathaniel A.; Reza, Shahed R.

We present a Physics-Informed Graph Neural Network (pigNN) methodology for rapid and automated compact model development. It brings together the inherent strengths of data-driven machine learning, high-fidelity physics in TCAD simulations, and knowledge contained in existing compact models. In this work, we focus on developing a neural network (NN) based compact model for a non-ideal PN diode that represents one nonlinear edge in a pigNN graph. This model accurately captures the smooth transition between the exponential and quasi-linear response regions. By learning voltage dependent non-ideality factor using NN and employing an inverse response function in the NN loss function, the model also accurately captures the voltage dependent recombination effect. This NN compact model serves as basis model for a PN diode that can be a single device or represent an isolated diode in a complex device determined by topological data analysis (TDA) methods. The pigNN methodology is also applicable to derive reduced order models in other engineering areas.

More Details

SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data

2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Teranishi, Keita T.; Dunlavy, Daniel D.; Myers, Jeremy M.; Barrett, Richard F.

Canonical Polyadic tensor decomposition using alternate Poisson regression (CP-APR) is an effective analysis tool for large sparse count datasets. One of the variants using projected damped Newton optimization for row subproblems (PDNR) offers quadratic convergence and is amenable to parallelization. Despite its potential effectiveness, PDNR performance on modern high performance computing (HPC) systems is not well understood. To remedy this, we have developed a parallel implementation of PDNR using Kokkos, a performance portable parallel programming framework supporting efficient runtime of a single code base on multiple HPC systems. We demonstrate that the performance of parallel PDNR can be poor if load imbalance associated with the irregular distribution of nonzero entries in the tensor data is not addressed. Preliminary results using tensors from the FROSTT data set indicate that using multiple kernels to address this imbalance when solving the PDNR row subproblems in parallel can improve performance, with up to 80% speedup on CPUs and 10-fold speedup on NVIDIA GPUs.

More Details

Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software

2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Myers, Jeremy M.; Dunlavy, Daniel D.; Teranishi, Keita T.; Hollman, David S.

Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior.

More Details

Evaluating MPI Message Size Summary Statistics

ACM International Conference Proceeding Series

Ferreira, Kurt B.; Levy, Scott

The Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on today's high-performance computing (HPC) systems. This dominance stems from MPI's powerful semantics for inter-process communication that has enabled scientists to write applications for simulating important physical phenomena. MPI does not, however, specify how messages and synchronization should be carried out. Those details are typically dependent on low-level architecture details and the message characteristics of the application. Therefore, analyzing an applications MPI usage is critical to tuning MPI's performance on a particular platform. The results of this analysis is typically a discussion of average message sizes for a workload or set of workloads. While a discussion of the message average might be the most intuitive summary statistic, it might not be the most useful in terms of representing the entire message size dataset for an application. Using a previously developed MPI trace collector, we analyze the MPI message traces for a number of key MPI workloads. Through this analysis, we demonstrate that the average, while easy and efficient to calculate, may not be a good representation of all subsets of application messages sizes, with median and mode of message sizes being a superior choice in most cases. We show that the problem with using the average relate to the multi-modal nature of the distribution of point-to-point messages. Finally, we show that while scaling a workload has little discernible impact on which measures of central tendency are representative of the underlying data, different input descriptions can significantly impact which metric is most effective. The results and analysis in this paper have the potential for providing valuable guidance on how we as a community should discuss and analyze MPI message data for scientific applications.

More Details

Granular packings with sliding, rolling, and twisting friction

Physical Review E

Santos, Andrew P.; Bolintineanu, Dan S.; Grest, Gary S.; Lechman, Jeremy B.; Plimpton, Steven J.; Srivastava, Ishan; Silbert, Leonardo E.

Intuition tells us that a rolling or spinning sphere will eventually stop due to the presence of friction and other dissipative interactions. The resistance to rolling and spinning or twisting torque that stops a sphere also changes the microstructure of a granular packing of frictional spheres by increasing the number of constraints on the degrees of freedom of motion. We perform discrete element modeling simulations to construct sphere packings implementing a range of frictional constraints under a pressure-controlled protocol. Mechanically stable packings are achievable at volume fractions and average coordination numbers as low as 0.53 and 2.5, respectively, when the particles experience high resistance to sliding, rolling, and twisting. Only when the particle model includes rolling and twisting friction were experimental volume fractions reproduced.

More Details

LDMS-GPU: Lightweight Distributed Metric Service (LDMS) for NVIDIA GPGPUs

Elwazir, Ammar E.; Badawy, Abdel-Hameed, B.; Aaziz, Omar R.; Cook, Jeanine C.

GPUs are now a fundamental accelerator for many high-performance computing applications. They are viewed by many as a technology facilitator for the surge in fields like machine learning and Convolutional Neural Networks. To deliver the best performance on a GPU, we need to create monitoring tools to ensure that we optimize the code to get the most performance and efficiency out of a GPU. Since NVIDIA GPUs are currently the most commonly implemented in HPC applications and systems, NVIDIA tools are the solution for performance monitoring. The Light-Weight Distributed Metric System (LDMS) at Sandia is an infrastructure widely adopted for large-scale systems and application monitoring. Sandia has developed CPU application monitoring capability within LDMS. Therefore, we chose to develop a GPU monitoring capability within the same framework. In this report, we discuss the current limitations in the NVIDIA monitoring tools, how we overcame such limitations, and present an overview of the tool we built to monitor GPU performance in LDMS and its capabilities. Also, we discuss our current validation results. Most of the performance counter results are the same in both vendor tools and our tool when using LDMS to collect these results. Furthermore, our tool provides these statistics during the entire runtime of the tool as a time series and not just aggregate statistics at the end of the application run. This allows the user to see the progress of the behavior of the applications during their lifetime.

More Details

On mixed-integer programming formulations for the unit commitment problem

INFORMS Journal on Computing

Knueven, Ben; Ostrowski, James; Watson, Jean-Paul W.

We provide a comprehensive overview of mixed-integer programming formulations for the unit commitment (UC) problem. UC formulations have been an especially active area of research over the past 12 years due to their practical importance in power grid operations, and this paper serves as a capstone for this line of work. We additionally provide publicly available reference implementations of all formulations examined. We computationally test existing and novel UC formulations on a suite of instances drawn from both academic and real-world data sources. Driven by our computational experience from this and previous work, we contribute some additional formulations for both generator production upper bounds and piecewise linear production costs. By composing new UC formulations using existing components found in the literature and new components introduced in this paper, we demonstrate that performance can be significantly improved—and in the process, we identify a new state-of-the-art UC formulation.

More Details

Analog architectures for neural network acceleration based on non-volatile memory

Applied Physics Reviews

Xiao, T.P.; Bennett, Christopher H.; Feinberg, Benjamin F.; Agarwal, Sapan A.; Marinella, Matthew J.

Analog hardware accelerators, which perform computation within a dense memory array, have the potential to overcome the major bottlenecks faced by digital hardware for data-heavy workloads such as deep learning. Exploiting the intrinsic computational advantages of memory arrays, however, has proven to be challenging principally due to the overhead imposed by the peripheral circuitry and due to the non-ideal properties of memory devices that play the role of the synapse. We review the existing implementations of these accelerators for deep supervised learning, organizing our discussion around the different levels of the accelerator design hierarchy, with an emphasis on circuits and architecture. We explore and consolidate the various approaches that have been proposed to address the critical challenges faced by analog accelerators, for both neural network inference and training, and highlight the key design trade-offs underlying these techniques.

More Details
Results 1201–1225 of 9,998
Results 1201–1225 of 9,998