Publications Search

Validating the simulation of large-scale parallel applications using statistical characteristics

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Dechev, Damian D.; Zhang, Deli Z.; Hendry, Gilbert H.; Wilke, Jeremiah J.

Simulation is a widely adopted method to analyze and predict the performance of large-scale parallel applications. Validating the hardware model is highly important for complex simulations with a large number of parameters. Common practice involves calculating the percent error between the projected and the real execution time of a benchmark program. However, in a high-dimensional parameter space, this coarse-grained approach often suffers from parameter insensitivity, which may not be known a priori. Moreover, the traditional approach cannot be applied to the validation of software models, such as application skeletons used in online simulations. In this work, we present a methodology and a toolset for validating both hardware and software models by quantitatively comparing fine-grained statistical characteristics obtained from execution traces. Although statistical information has been used in tasks like performance optimization, this is the first attempt to apply it to simulation validation. Lastly, our experimental results show that the proposed evaluation approach offers significant improvement in fidelity when compared to evaluation using total execution time, and the proposed metrics serve as reliable criteria that progress toward automating the simulation tuning process.

More Details

TYPE Journal Article YEAR 2016

OSTI DOI

An Efficient Wait-Free Vector

IEEE Transactions on Parallel and Distributed Systems

Dechev, Damian D.; Feldman, Steven F.; Valeraleon, Carlos V.

The vector is a fundamental data structure, which provides constant-time access to a dynamically-resizable range of elements. Currently, there exist no wait-free vectors. The only non-blocking version supports only a subset of the sequential vector API and exhibits significant synchronization overhead caused by supporting opposing operations. Since many applications operate in phases of execution, wherein each phase only a subset of operations are used, this overhead is unnecessary for the majority of the application. To address the limitations of the non-blocking version, we present a new design that is wait-free, supports more of the operations provided by the sequential vector, and provides alternative implementations of key operations. These alternatives allow the developer to balance the performance and functionality of the vector as requirements change throughout execution. Compared to the known non-blocking version and the concurrent vector found in Intel’s TBB library, our design outperforms or provides comparable performance in the majority of tested scenarios. Over all tested scenarios, the presented design performs an average of 4.97 times more operations per second than the non-blocking vector and 1.54 more than the TBB vector. In a scenario designed to simulate the filling of a vector, performance improvement increases to 13.38 and 1.16 times. This work presents the first ABA-free non-blocking vector. Finally, unlike the other non-blocking approach, all operations are wait-free and bounds-checked and elements are stored contiguously in memory.

More Details

TYPE Journal Article YEAR 2016

OSTI DOI

Extending LDMS to enable performance monitoring in multi-core applications

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Feldman, Steven; Zhang, Deli; Dechev, Damian D.; Brandt, James M.

Identifying design patterns that limit the performance of multi-core algorithms is a challenging task. There are many known methods by which threads synchronize their actions and each method may exhibit different behavior in different use cases. These use cases may vary in regards to the workload being executed, number of parallel tasks, dependencies between these tasks, and the behavior of the system scheduler. Restructuring algorithms to overcome performance limitations requires intimate knowledge on how these algorithms utilize the hardware. In our experience, we have found a lack of adequate tools to gain such knowledge. To address this, we have enhanced and implemented additional data sampler modules for OVIS's Lightweight Distributed Metric Service (LDMS) to enable scalable distributed collection of hardware performance counter data. These modules provide an interface by which LDMS can utilize the PAPI library, Linux perf tools, and RAPL to collect hardware performance data of interest. Using these samplers, we plan to monitor the intra-node behavior, including contention for node level shared resources, of multi-core applications for a diverse set of use cases. We are currently exploring how the values reported are affected by the level of concurrency, the synchronization methodologies, and progress guarantees. We hope to use this information to identify ways to restructure algorithms to increase their performance.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

Application Modeling for Scalable Simulation of Massively Parallel Systems

Wilke, Jeremiah J.; Dechev, Damian D.; Yalamanchili, Sudhakar Y.; Anger, Eric A.; Hendry, Gilbert H.

Abstract not provided.