Publications

Results 1–25 of 42
Skip to search filters

Validating the simulation of large-scale parallel applications using statistical characteristics

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Dechev, Damian D.; Zhang, Deli Z.; Hendry, Gilbert H.; Wilke, Jeremiah J.

Simulation is a widely adopted method to analyze and predict the performance of large-scale parallel applications. Validating the hardware model is highly important for complex simulations with a large number of parameters. Common practice involves calculating the percent error between the projected and the real execution time of a benchmark program. However, in a high-dimensional parameter space, this coarse-grained approach often suffers from parameter insensitivity, which may not be known a priori. Moreover, the traditional approach cannot be applied to the validation of software models, such as application skeletons used in online simulations. In this work, we present a methodology and a toolset for validating both hardware and software models by quantitatively comparing fine-grained statistical characteristics obtained from execution traces. Although statistical information has been used in tasks like performance optimization, this is the first attempt to apply it to simulation validation. Lastly, our experimental results show that the proposed evaluation approach offers significant improvement in fidelity when compared to evaluation using total execution time, and the proposed metrics serve as reliable criteria that progress toward automating the simulation tuning process.

More Details

An Efficient Wait-Free Vector

IEEE Transactions on Parallel and Distributed Systems

Dechev, Damian D.; Feldman, Steven F.; Valeraleon, Carlos V.

The vector is a fundamental data structure, which provides constant-time access to a dynamically-resizable range of elements. Currently, there exist no wait-free vectors. The only non-blocking version supports only a subset of the sequential vector API and exhibits significant synchronization overhead caused by supporting opposing operations. Since many applications operate in phases of execution, wherein each phase only a subset of operations are used, this overhead is unnecessary for the majority of the application. To address the limitations of the non-blocking version, we present a new design that is wait-free, supports more of the operations provided by the sequential vector, and provides alternative implementations of key operations. These alternatives allow the developer to balance the performance and functionality of the vector as requirements change throughout execution. Compared to the known non-blocking version and the concurrent vector found in Intel’s TBB library, our design outperforms or provides comparable performance in the majority of tested scenarios. Over all tested scenarios, the presented design performs an average of 4.97 times more operations per second than the non-blocking vector and 1.54 more than the TBB vector. In a scenario designed to simulate the filling of a vector, performance improvement increases to 13.38 and 1.16 times. This work presents the first ABA-free non-blocking vector. Finally, unlike the other non-blocking approach, all operations are wait-free and bounds-checked and elements are stored contiguously in memory.

More Details

Extending LDMS to enable performance monitoring in multi-core applications

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Feldman, Steven; Zhang, Deli; Dechev, Damian D.; Brandt, James M.

Identifying design patterns that limit the performance of multi-core algorithms is a challenging task. There are many known methods by which threads synchronize their actions and each method may exhibit different behavior in different use cases. These use cases may vary in regards to the workload being executed, number of parallel tasks, dependencies between these tasks, and the behavior of the system scheduler. Restructuring algorithms to overcome performance limitations requires intimate knowledge on how these algorithms utilize the hardware. In our experience, we have found a lack of adequate tools to gain such knowledge. To address this, we have enhanced and implemented additional data sampler modules for OVIS's Lightweight Distributed Metric Service (LDMS) to enable scalable distributed collection of hardware performance counter data. These modules provide an interface by which LDMS can utilize the PAPI library, Linux perf tools, and RAPL to collect hardware performance data of interest. Using these samplers, we plan to monitor the intra-node behavior, including contention for node level shared resources, of multi-core applications for a diverse set of use cases. We are currently exploring how the values reported are affected by the level of concurrency, the synchronization methodologies, and progress guarantees. We hope to use this information to identify ways to restructure algorithms to increase their performance.

More Details

Static analysis techniques for semiautomatic synthesis of message passing software skeletons

ACM Transactions on Modeling and Computer Simulation

Sottile, Matthew; Dagit, Jason; Zhang, Deli; Hendry, Gilbert; Dechev, Damian D.

The design of high-performance computing architectures requires performance analysis of large-scale parallel applications to derive various parameters concerning hardware design and software development. The process of performance analysis and benchmarking an application can be done in several ways with varying degrees of fidelity. One of the most cost-effective ways is to do a coarse-grained study of large-scale parallel applications through the use of program skeletons. The concept of a "program skeleton" that we discuss in this article is an abstracted program that is derived from a larger program where source code that is determined to be irrelevant is removed for the purposes of the skeleton. In this work, we develop a semiautomatic approach for extracting program skeletons based on compiler program analysis.We demonstrate correctness of our skeleton extraction process by comparing details from communication traces, as well as show the performance speedup of using skeletons by running simulations in the SST/macro simulator.

More Details

A lock-free priority queue design based on multi-dimensional linked lists

IEEE Transactions on Parallel and Distributed Systems

Dechev, Damian D.; Zhang, Deli Z.

The throughput of concurrent priority queues is pivotal to multiprocessor applications such as discrete event simulation, best-first search and task scheduling. Existing lock-free priority queues are mostly based on skiplists, which probabilistically create shortcuts in an ordered list for fast insertion of elements. The use of skiplists eliminates the need of global rebalancing in balanced search trees and ensures logarithmic sequential search time on average, but the worst-case performance is linear with respect to the input size. In this paper, we propose a quiescently consistent lock-free priority queue based on a multi-dimensional list that guarantees worst-case search time of O(logN) for key universe of size N. The novel multi-dimensional list (MDList) is composed of nodes that contain multiple links to child nodes arranged by their dimensionality. The insertion operation works by first injectively mapping the scalar key to a high-dimensional vector, then uniquely locating the target position by using the vector as coordinates. Nodes in MDList are ordered by their coordinate prefixes and the ordering property of the data structure is readily maintained during insertion without rebalancing nor randomization. Furthermore, in our experimental evaluation using a micro-benchmark, our priority queue achieves an average of 50% speedup over the state of the art approaches under high concurrency.

More Details

Tools for enabling automatic validation of large-scale parallel application simulations

Proceedings - 30th International Conference on Software Maintenance and Evolution, ICSME 2014

Zhang, Deli; Hendry, Gilbert; Dechev, Damian D.

Validation is highly important in parallel application simulations with a large number of parameters, a process that can vary depending on the structure of the simulator and the granularity of the models used. Common practice involves calculating the percentage error between the projected and the real execution time of a benchmark program. However, this coarse-grained approach often suffers from a parameter insensitivity problem in regions of high-dimensional parameter space. In this work we demonstrate the use of our fine-grained validation toolset to capture and compare the statistical characteristics of a parallel application's execution. It is the first toolset to apply fine-grained statistics to large-scale simulation validation, and our experimental evaluation shows that it offers a significant improvement in fidelity when compared to validation using total execution time.

More Details
Results 1–25 of 42
Results 1–25 of 42