Sandia National Laboratories is a premier United States national security laboratory which develops science-based technologies in areas such as nuclear deterrence, energy production, and climate change. Computing plays a key role in its diverse missions, and within that environment, Research Software Engineers (RSEs) and other scientific software developers utilize testing automation to ensure quality and maintainability of their work. We conducted a Participatory Action Research study to explore the challenges and strategies for testing automation through the lens of academic literature. Through the experiences collected and comparison with open literature, we identify these challenges in testing automation and then present strategies for mitigation grounded in evidence-based practice and experience reports that other, similar institutions can assess for their automation needs.
A discussion of systems like ChatGPT, what their legal issues may be, how they are affecting society, and what the ethical considerations of their existence and use are.
It is essential to Sandia National Laboratory’s continued success in scientific and technological advances and mission delivery to embrace a hybrid workforce culture under which current and future employees can thrive. This report focuses on the findings of the Hybrid Work Team for the Center for Computing Research, which met weekly from March to June 2023 and conducted a survey across the Center at Sandia. Conclusions in this report are drawn from the 9 authors of this report, which comprises the Hybrid Work Team, and 15 responses to a center-wide survey, as well as numerous conversations with colleagues. A major finding was widespread dissatisfaction with the quantity, execution, and tooling surrounding formal meetings with remote participants. While there was consensus that remote work enables people to produce high quality individual and technical work, there was also consensus that there was widespread social disconnect, with particular concern about hires that were made after the onset of the Covid-19 pandemic. There were many concerns about tooling and policy to facilitate remote collaboration both within Sandia and with its external collaborators. This report includes recommendations for mitigating these problems. For problems for which obvious recommendations cannot be made, ideas of what a successful solution might look like are presented.
Scientific discovery increasingly relies on interoperable, multimodular workflows generating intermediate data. The complexity of managing intermediate data may cause performance losses or unexpected costs. This paper defines an approach to composing these scientific workflows on cloud services, focusing on workflow data orchestration, management, and scalability. We demonstrate the effectiveness of our approach with the SOMOSPIE scientific workflow that deploys machine learning (ML) models to predict high-resolution soil moisture using an HPC service (LSF) and an open-source cloud-native service (K8s) and object storage. Our approach enables scientists to scale from coarse-grained to fine-grained resolution and from a small to a larger region of interest. Using our empirical observations, we generate a cost model for the execution of workflows with hidden intermediate data on cloud services.
The use of containerization technology in high performance computing (HPC) workflows has substantially increased recently because it makes workflows much easier to develop and deploy. Although many HPC workflows include multiple data and multiple applications, they have traditionally all been bundled together into one monolithic container. This hinders the ability to trace the thread of execution, thus preventing scientists from establishing data provenance, or having workflow reproducibility. To provide a solution to this problem we extend the functionality of a popular HPC container runtime, Singularity. We implement both the ability to compose fine-grained containerized workflows and execute these workflows within the Singularity runtime with automatic metadata collection. Specifically, the new functionality collects a record trail of execution and creates data provenance. The use of our augmented Singularity is demonstrated with an earth science workflow, SOMOSPIE. The workflow is composed via our augmented Singularity which creates fine-grained containers and collects the metadata to trace, explain, and reproduce the prediction of soil moisture at a fine resolution.
The National Academy of Sciences, Engineering, and Medicine (NASEM) defines reproducibility as 'obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis,' and replicability as 'obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data' [1]. Due to an increasing number of applications of artificial intelligence and machine learning (AI/ML) to fields such as healthcare and digital medicine, there is a growing need for verifiable AI/ML results, and therefore reproducible research and replicable experiments. This paper establishes examples of irreproducible AI/ML applications to medical sciences and quantifies the variance of common AI/ML models (Artificial Neural Network, Naive Bayes classifier, and Random Forest classifiers) for tasks on medical data sets.
This work seeks to advance the state of the art in HPC I/O performance analysis and interpretation. In particular, we demonstrate effective techniques to: (1) model output performance in the presence of I/O interference from production loads; (2) build features from write patterns and key parameters of the system architecture and configurations; (3) employ suitable machine learning algorithms to improve model accuracy. We train models with five popular regression algorithms and conduct experiments on two distinct production HPC platforms. We find that the lasso and random forest models predict output performance with high accuracy on both of the target systems. We also explore use of the models to guide adaptation in I/O middleware systems, and show potential for improvements of at least 15% from model-guided adaptation on 70% of samples, and improvements up to 10 × on some samples for both of the target systems.
Several recent workshops conducted by the DOE Advanced Scientific Computing Research program have established the fact that the complexity of developing applications and executing them on high-performance computing (HPC) systems is rising at a rate which will make it nearly impossible to continue to achieve higher levels of performance and scalability. Absent an alternative approach to managing this ever-growing complexity, HPC systems will become increasingly difficult to use. A more holistic approach to designing and developing applications and managing system resources is required. This paper outlines a research strategy for managing the increasing the complexity by providing the programming environment, software stack, and hardware capabilities needed for autonomous resource management of HPC systems. Developing portable applications for a variety of HPC systems of varying scale requires a paradigm shift from the current approach, where applications are painstakingly mapped to individual machine resources, to an approach where machine resources are automatically mapped and optimized to applications as they execute. Achieving such automated resource management for HPC systems is a daunting challenge that requires significant sustained investment in exploring new approaches and novel capabilities in software and hardware that span the spectrum from programming systems to device-level mechanisms. This paper provides an overview of the functionality needed to enable autonomous resource management and optimization and describes the components currently being explored at Sandia National Laboratories to help support this capability.
Proceedings of PDSW 2021: IEEE/ACM 6th International Parallel Data Systems Workshop, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis
I/O performance in a multi-user environment is difficult to predict. Users do not know what I/O performance to expect when running and tuning applications. We propose to use the IO500 benchmark as a way to guide user expectations on their application's performance and to aid identifying root causes of their I/O problems that might come from the system. Our experiments describe how we manage user expectation with IO500 and provide a mechanism for system fault identification. This work also provides us with information of the tail latency problem that needs to be addressed and granular information about the impact of I/O technique choices (POSIX and MPI-IO).
Proceedings of PDSW 2021: IEEE/ACM 6th International Parallel Data Systems Workshop, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis
In high-performance computing (HPC), scientific applications often manage a massive amount of data using I/O libraries. These libraries provide convenient data model abstractions, help ensure data portability, and, most important, empower end users to improve I/O performance by tuning configurations across multiple layers of the HPC I/O stack. We propose SCTuner, an autotuner integrated within the I/O library itself to dynamically tune both the I/O library and the underlying I/O stack at application runtime. To this end, we introduce a statistical benchmarking method to profile the behaviors of individual supercomputer I/O subsystems with varied configurations across I/O layers. We use the benchmarking results as the built-in knowledge in SCTuner, implement an I/O pattern extractor, and plan to implement an online performance tuner as the SCTuner runtime. We conducted a benchmarking analysis on the Summit supercomputer and its GPFS file system Alpine. The preliminary results show that our method can effectively extract the consistent I/O behaviors of the target system under production load, building the base for I/O autotuning at application runtime.
Persistent memory (PMEM) devices can achieve comparable performance to DRAM while providing significantly more capacity. This has made the technology compelling as an expansion to main memory. Rethinking PMEM as storage devices can offer a high performance buffering layer for HPC applications to temporarily, but safely store data. However, modern parallel I/O libraries, such as HDF5 and pNetCDF, are complicated and introduce significant software and metadata overheads when persisting data to these storage devices, wasting much of their potential. In this work, we explore the potential of PMEM as storage through pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory. We demonstrate that our approach is up to 2x faster than other popular parallel I/O libraries under real workloads.
International Journal of High Performance Computing Applications
Childs, Hank; Ahern, Sean D.; Ahrens, James; Bauer, Andrew C.; Bennett, Janine C.; Bethel, E.W.; Bremer, Peer-Timo; Brugger, Eric; Cottam, Joseph; Dorier, Matthieu; Dutta, Soumya; Favre, Jean M.; Fogal, Thomas; Frey, Steffen; Garth, Christoph; Geveci, Berk; Godoy, William F.; Hansen, Charles D.; Harrison, Cyrus; Insley, Joseph; Johnson, Chris R.; Klasky, Scott; Knoll, Aaron; Kress, James; Foulk, James W.; Lofstead, Gerald F.; Ma, Kwan-Liu; Malakar, Preeti; Meredith, Jeremy; Moreland, Kenneth D.; Navratil, Paul; Leary, Manish'; Parashar, Manish; Pascucci, Valerio; Patchett, John; Peterka, Tom; Petruzza, Steve; Pugmire, David; Rasquin, Michel; Rizzi, Silvio; Rogers, David M.; Sane, Sudhanshu; Sauer, Franz; Sisneros, Johnny R.; Shen, Han-Wei; Usher, Will; Vickery, Rhonda; Vishwanath, Venkatram; Wald, Ingo; Wang, Ruonan; Weber, Gunther H.; Whitlock, Brad; Wolf, Matthew; Yu, Hongfeng; Ziegeler, Sean B.
The term “in situ processing” has evolved over the last decade to mean both a specific strategy for visualizing and analyzing data and an umbrella term for a processing paradigm. The resulting confusion makes it difficult for visualization and analysis scientists to communicate with each other and with their stakeholders. To address this problem, a group of over 50 experts convened with the goal of standardizing terminology. This paper summarizes their findings and proposes a new terminology for describing in situ systems. An important finding from this group was that in situ systems are best described via multiple, distinct axes: integration type, proximity, access, division of execution, operation controls, and output type. Here, they discuss these axes, evaluate existing systems within the axes, and explore how currently used terms relate to the axes.
A new in transit Data Service is presented and compared to the traditional file-based workflow and the newly refactored in situ Catalyst workflow. Each workflow is enabled by the IOSS mesh interface equipped with data management layers for Exodus and CGNS (file-based), Catalyst (in situ), and FAODEL (in transit). FAODEL is a distributed object store that can transmit data across MPI allocations. Catalyst is a Para View-based visualization capability developed as part of the CSSE Data Services effort. The workflows considered here take SPARC data into Catalyst for visualization post-processing. Although still in unoptimized form, we show that the in transit approach is a viable alternative to file-based and in situ workflows and offers several advantages to both simulation and post-processing developers. Since IOSS is a mature interface with wide adoption across Sandia and externally, each workflow can be reconfigured to use different simulations that generate mesh data and post-processing tools that consume it.
Generally, scientific simulations load the entire simulation domain into memory because most, if not all, of the data changes with each time step. This has driven application structures that have, in turn, affected the design of popular IO libraries, such as HDF-5, ADIOS, and NetCDF. This assumption makes sense for many cases, but there is also a significant collection of simulations where this approach results in vast swaths of unchanged data written each time step.This paper explores a new IO approach that is capable of stitching together a coherent global view of the total simulation space at any given time. This benefit is achieved with no performance penalty compared to running with the full data set in memory, at a radically smaller process requirement, and results in radical data reduction with no fidelity loss. Additionally, the structures employed enable online simulation monitoring.
Xie, Bing; Oral, Sarp; Zimmer, Christopher; Choi, Jong Y.; Dillow, David; Klasky, Scott A.; Lofstead, Gerald F.; Chase, Jeffrey
This article studies the I/O write behaviors of the Titan supercomputer and its Lustre parallel file stores under production load. The results can inform the design, deployment, and configuration of file systems along with the design of I/O software in the application, operating system, and adaptive I/O libraries.We propose a statistical benchmarking methodology to measure write performance across I/O configurations, hardware settings, and system conditions. Moreover, we introduce two relative measures to quantify the write-performance behaviors of hardware components under production load. In addition to designing experiments and benchmarking on Titan, we verify the experimental results on one real application and one real application I/O kernel, XGC and HACC IO, respectively. These two are representative and widely used to address the typical I/O behaviors of applications.In summary, we find that Titan’s I/O system is variable across the machine at fine time scales. This variability has two major implications. First, stragglers lessen the benefit of coupled I/O parallelism (striping). Peak median output bandwidths are obtained with parallel writes to many independent files, with no striping or write sharing of files across clients (compute nodes). I/O parallelism is most effective when the application—or its I/O libraries—distributes the I/O load so that each target stores files for multiple clients and each client writes files on multiple targets in a balanced way with minimal contention. Second, our results suggest that the potential benefit of dynamic adaptation is limited. In particular, it is not fruitful to attempt to identify “good locations” in the machine or in the file system: component performance is driven by transient load conditions and past performance is not a useful predictor of future performance. For example, we do not observe diurnal load patterns that are predictable.
Trusting simulation output is crucial for Sandia’s mission objectives. Here, we rely on these simulations to perform our high-consequence mission tasks given national treaty obligations. Other science and modeling applications, while they may have high-consequence results, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed workflow and provenance systems to aid in both automating simulation and modeling execution as well as determining exactly how was some output was created so that conclusions can be drawn from the data. Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated “sandbox” and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage. This project explores extending the container concept to include storage as a new container type we call data pallets. Data Pallets are potentially writeable, auto generated by the system based on IO activities, and usable as a way to link the contained data back to the application and input deck used to create it.
Trusting simulation output is crucial for Sandia's mission objectives. We rely on these simulations to perform our high-consequence mission tasks given our treaty obligations. Other science and modelling needs, while they may not be high-consequence, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed work- flow and provenance systems to aid in both automating simulation and modelling execution, but to also aid in determining exactly how was some output created so that conclusions can be drawn from the data. Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated "sandbox" and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage. This project was an initial exploration into extending the container concept to also include storage and to use writable containers, auto generated by the system, as a way to link the contained data back to the simulation and input deck used to create it.
We introduce quiho, a framework for profiling application performance that can be used in automated performance regression tests. quiho profiles an application by applying sensitivity analysis, in particular statistical regression analysis (SRA), using application-independent performance feature vectors that characterize the performance of machines. The result of the SRA, feature importance specifically, is used as a proxy to identify hardware and low-level system software behavior. The relative importance of these features serve as a performance profile of an application (termed inferred resource utilization profile or IRUP), which is used to automatically validate performance behavior across multiple revisions of an application's code base without having to instrument code or obtain performance counters. We demonstrate that quiho can successfully discover performance regressions by showing its effectiveness in profiling application performance for synthetically introduced regressions as well as those found in real-world applications.
HPDC 2017 - Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
Xie, Bing; Huang, Yezhou; Chase, Jefrey S.; Choi, Jong Y.; Klasky, Scott; Lofstead, Gerald F.; Oral, Sarp
In this paper, we develop a predictive model useful for output performance prediction of supercomputer file systems under production load. Our target environment is Titan-the 3rd fastest supercomputer in the world-and its Lustre-based multi-stage write path. We observe from Titan that although output performance is highly variable at small time scales, the mean performance is stable and consistent over typical application run times. Moreover, we find that output performance is non-linearly related to its correlated parameters due to interference and saturation on individual stages on the path. These observations enable us to build a predictive model of expected write times of output patterns and I/O configurations, using feature transformations to capture non-linear relationships. We identify the candidate features based on the structure of the Lustre/Titan write path, and use feature transformation functions to produce a model space with 135,000 candidate models. By searching for the minimal mean square error in this space we identify a good model and show that it is effective.
Integrated Application Workflows (IAWs) run multiple simulation workflow components concurrently on an HPC resource connecting these components using compute area resources and compensating for any performance or data processing rate mismatches. These IAWs require high frequency and high volume data transfers between compute nodes and staging area nodes during the lifetime of a large parallel computation. The available network band-width between the two areas may not be enough to efficiently support the data movement. As the processing power available to compute resources increases, the requirements for this data transfer will become more difficult to satisfy and perhaps will not be satisfiable at all since network capabilities are not expanding at a comparable rate. Furthermore, energy consumption in HPC environments is expected to grow by an order of magnitude as exascale systems become a reality. The energy cost of moving large amounts of data frequently will contribute to this issue. It is necessary to reduce the volume of data without reducing the quality of data when it is being processed and analyzed. Delta resolves the issue by addressing the lifetime data transfer operations. Delta removes subsequent identical copies of already transmitted data during transfers and restores those copies once the data has reached the destination. Delta is able to identify duplicated information and determine the most space efficient way to represent it. Initial tests show about 50% reduction in data movement while maintaining the same data quality and transmission frequency.
ASC Level 2 Milestone FY 2013 continuation. L2 revealed memory pressures from using in situ analysis; Developed tools to determine memory usage; Reduced memory footprint by more than 50%.
The Trilinos Project is an effort to develop algorithms and enabling technologies within an object- oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. A unique design feature of Trilinos is its focus on packages. While the abstractions make it easy to incorporate advanced processing and data manipulation tools, it is not always obvious how to take advantage of these features. The trios package incorporated two years ago offers general data management services, but has yet to offer integrated support for core Trilinos data structures, such as those offered in the Tpetra package. An initial attempt to incorporate native Trilinos data structure support into trios services revealed the complexity, from a non-mathematician's perspective, of using Trilinos. This project sought to understand the complexities and potential barriers not just for non-mathematicians that want to contribute to or use Trilinos, but potentially for new mathematically-inclined users as well that may want to offer services to support users. This report documents the challenges for trios to offer some simple data manipulation required as a precursor to any direct data services integration and makes recommendations for clarifying the performance implications and general approach to use.
This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.
Scientific computing-driven discoveries are frequently driven from workflows that use persistent storage as a staging area for data between operations. With the bad and progressively worse bandwidth vs. data size issues as we continue towards exascale, eliminating persistent storage through techniques like data staging will both enable these workflows to continue online, but also enable more interactive workflows reducing the time to scientific discoveries. Data staging has shown to be an effective way for applications running on high-end computing platforms to offload expensive I/O operations and to manage the tremendous amounts of data they produce. This data staging approach, however, lacks the ACID style guarantees traditional straight-to-disk methods provide. Distributed transactions are a proven way to add ACID properties to data movements, however distributed transactions follow 1xN data movement semantics, where our highly parallel HPC environments employ MxN data movement semantics. In this paper we present a novel protocol that extends distributed transaction terminology to include MxN semantics which allows our data staging areas to benefit from ACID properties. We show that with our protocol we can provide resilient data staging with a limited performance penalty over current data staging implementations.