The 2022 National Defense Strategy of the United States listed climate change as a serious threat to national security. Climate intervention methods, such as stratospheric aerosol injection, have been proposed as mitigation strategies, but the downstream effects of such actions on a complex climate system are not well understood. The development of algorithmic techniques for quantifying relationships between source and impact variables related to a climate event (i.e., a climate pathway) would help inform policy decisions. Data-driven deep learning models have become powerful tools for modeling highly nonlinear relationships and may provide a route to characterize climate variable relationships. In this paper, we explore the use of an echo state network (ESN) for characterizing climate pathways. ESNs are a computationally efficient neural network variation designed for temporal data, and recent work proposes ESNs as a useful tool for forecasting spatiotemporal climate data. However, ESNs are noninterpretable black-box models along with other neural networks. The lack of model transparency poses a hurdle for understanding variable relationships. We address this issue by developing feature importance methods for ESNs in the context of spatiotemporal data to quantify variable relationships captured by the model. We conduct a simulation study to assess and compare the feature importance techniques, and we demonstrate the approach on reanalysis climate data. In the climate application, we consider a time period that includes the 1991 volcanic eruption of Mount Pinatubo. This event was a significant stratospheric aerosol injection, which acts as a proxy for an anthropogenic stratospheric aerosol injection. We are able to use the proposed approach to characterize relationships between pathway variables associated with this event that agree with relationships previously identified by climate scientists.
As global temperatures continue to rise, climate mitigation strategies such as stratospheric aerosol injections (SAI) are increasingly discussed, but the downstream effects of these strategies are not well understood. As such, there is interest in developing statistical methods to quantify the evolution of climate variable relationships during the time period surrounding an SAI. Feature importance applied to echo state network (ESN) models has been proposed as a way to understand the effects of SAI using a data-driven model. This approach depends on the ESN fitting the data well. If not, the feature importance may place importance on features that are not representative of the underlying relationships. Typically, time series prediction models such as ESNs are assessed using out-of-sample performance metrics that divide the times series into separate training and testing sets. However, this model assessment approach is geared towards forecasting applications and not scenarios such as the motivating SAI example where the objective is using a data driven model to capture variable relationships. Here, in this paper, we demonstrate a novel use of climate model replicates to investigate the applicability of the commonly used repeated hold-out model assessment approach for the SAI application. Simulations of an SAI are generated using a simplified climate model, and different initialization conditions are used to provide independent training and testing sets containing the same SAI event. The climate model replicates enable out-of-sample measures of model performance, which are compared to the single time series hold-out validation approach. For our case study, it is found that the repeated hold-out sample performance is comparable, but conservative, to the replicate out-of-sample performance when the training set contains enough time after the aerosol injection.
Physical experiments are often expensive and time-consuming. Test engineers must certify the compatibility of aircraft and their weapon systems before they can be deployed in the field, but the testing required is time consuming, expensive, and resource limited. Adopting Bayesian adaptive designs is a promising way to borrow from the successes seen in the clinical trials domain. The use of predictive probability (PP) to stop testing early and make faster decisions is particularly appealing given the aforementioned constraints. Given the high-consequence nature of the tests performed in the national security space, a strong understanding of new methods is required before being deployed. Although PP has been thoroughly studied for binary data, there is less work with continuous data, where many reliability studies are interested in certifying the specification limits of components. A simulation study evaluating the robustness of this approach indicates early stopping based on PP is reasonably robust to minor assumption violations, especially when only a few interim analyses are conducted. The simulation study also compares PP to conditional power, showing its relative strengths and weaknesses. A post-hoc analysis exploring whether release requirements of a weapon system from an aircraft are within specification with desired reliability resulted in stopping the experiment early and saving 33% of the experimental runs.
Deep learning (DL) models have enjoyed increased attention in recent years because of their powerful predictive capabilities. While many successes have been achieved, standard deep learning methods suffer from a lack of uncertainty quantification (UQ). While the development of methods for producing UQ from DL models is an active area of current research, little attention has been given to the quality of the UQ produced by such methods. In order to deploy DL models to high-consequence applications, high-quality UQ is necessary. This report details the research and development conducted as part of a Laboratory Directed Research and Development (LDRD) project at Sandia National Laboratories. The focus of this project is to develop a framework of methods and metrics for the principled assessment of UQ quality in DL models. This report presents an overview of UQ quality assessment in traditional statistical modeling and describes why this approach is difficult to apply in DL contexts. An assessment on relatively simple simulated data is presented to demonstrate that UQ quality can differ greatly between DL models trained on the same data. A method for simulating image data that can then be used for UQ quality assessment is described. A general method for simulating realistic data for the purpose of assessing a model’s UQ quality is also presented. A Bayesian uncertainty framework for understanding uncertainty and existing metrics is described. Research that came out of collaborations with two university partners are discussed along with a software toolkit that is currently being developed to implement the UQ quality assessment framework as well as serve as a general guide to incorporating UQ into DL applications.
Physical fatigue can have adverse effects on humans in extreme environments. Therefore, being able to predict fatigue using easy to measure metrics such as heart rate (HR) signatures has potential to have an impact in real-life scenarios. We apply a functional logistic regression model that uses HR signatures to predict physical fatigue, where physical fatigue is defined in a data-driven manner. Data were collected using commercially available wearable devices on 47 participants hiking the 20.7-mile Grand Canyon rim-to-rim trail in a single day. Fitted model provides good predictions and interpretable parameters for real-life application.
This project evaluated the use of emerging spintronic memory devices for robust and efficient variational inference schemes. Variational inference (VI) schemes, which constrain the distribution for each weight to be a Gaussian distribution with a mean and standard deviation, are a tractable method for calculating posterior distributions of weights in a Bayesian neural network such that this neural network can also be trained using the powerful backpropagation algorithm. Our project focuses on domain-wall magnetic tunnel junctions (DW-MTJs), a powerful multi-functional spintronic synapse design that can achieve low power switching while also opening the pathway towards repeatable, analog operation using fabricated notches. Our initial efforts to employ DW-MTJs as an all-in-one stochastic synapse with both a mean and standard deviation didn’t end up meeting the quality metrics for hardware-friendly VI. In the future, new device stacks and methods for expressive anisotropy modification may make this idea still possible. However, as a fall back that immediately satisfies our requirements, we invented and detailed how the combination of a DW-MTJ synapse encoding the mean and a probabilistic Bayes-MTJ device, programmed via a ferroelectric or ionically modifiable layer, can robustly and expressively implement VI. This design includes a physics-informed small circuit model, that was scaled up to perform and demonstrate rigorous uncertainty quantification applications, up to and including small convolutional networks on a grayscale image classification task, and larger (Residual) networks implementing multi-channel image classification. Lastly, as these results and ideas all depend upon the idea of an inference application where weights (spintronic memory states) remain non-volatile, the retention of these synapses for the notched case was further interrogated. These investigations revealed and emphasized the importance of both notch geometry and anisotropy modification in order to further enhance the endurance of written spintronic states. In the near future, these results will be mapped to effective predictions for room temperature and elevated operation DW-MTJ memory retention, and experimentally verified when devices become available.
Inverse prediction models have commonly been developed to handle scalar data from physical experiments. However, it is not uncommon for data to be collected in functional form. When data are collected in functional form, it must be aggregated to fit the form of traditional methods, which often results in a loss of information. For expensive experiments, this loss of information can be costly. In this study, we introduce the functional inverse prediction (FIP) framework, a general approach which uses the full information in functional response data to provide inverse predictions with probabilistic prediction uncertainties obtained with the bootstrap. The FIP framework is a general methodology that can be modified by practitioners to accommodate many different applications and types of data. We demonstrate the framework, highlighting points of flexibility, with a simulation example and applications to weather data and to nuclear forensics. Results show how functional models can improve the accuracy and precision of predictions.
Neural networks (NN) have become almost ubiquitous with image classification, but in their standard form produce point estimates, with no measure of confidence. Bayesian neural networks (BNN) provide uncertainty quantification (UQ) for NN predictions and estimates through the posterior distribution. As NN are applied in more high-consequence applications, UQ is becoming a requirement. Automating systems can save time and money, but only if the operator can trust what the system outputs. BNN provide a solution to this problem by not only giving accurate predictions and estimates, but also an interval that includes reasonable values within a desired probability. Despite their positive attributes, BNN are notoriously difficult and time consuming to train. Traditional Bayesian methods use Markov Chain Monte Carlo (MCMC), but this is often brushed aside as being too slow. The most common method is variational inference (VI) due to its fast computation, but there are multiple concerns with its efficacy. MCMC is the gold standard and given enough time, will produce the correct result. VI, alternatively, is an approximation that converges asymptotically. Unfortunately (or fortunately), high consequence problems often do not live in the land of asymtopia so solutions like MCMC are preferable to approximations. We apply and compare MCMC-and VI-trained BNN in the context of target detection in hyperspectral imagery (HSI), where materials of interest can be identified by their unique spectral signature. This is a challenging field, due to the numerous permuting effects practical collection of HSI has on measured spectra. Both models are trained using out-of-the-box tools on a high fidelity HSI target detection scene. Both MCMC-and VI-trained BNN perform well overall at target detection on a simulated HSI scene. Splitting the test set predictions into two classes, high confidence and low confidence predictions, presents a path to automation. For the MCMC-trained BNN, the high confidence predictions have a 0.95 probability of detection with a false alarm rate of 0.05 when considering pixels with target abundance of 0.2. VI-trained BNN have a 0.25 probability of detection for the same, but its performance on high confidence sets matched MCMC for abundances >0.4. However, the VI-trained BNN on this scene required significant expert tuning to get these results while MCMC worked immediately. On neither scene was MCMC prohibitively time consuming, as is often assumed, but the networks we used were relatively small. This paper provides an example of how to utilize the benefits of UQ, but also to increase awareness that different training methods can give different results for the same model. If sufficient computational resources are available, the best approach rather than the fastest or most efficient should be used, especially for high consequence problems.
Traditional deep learning (DL) models are powerful classifiers, but many approaches do not provide uncertainties for their estimates. Uncertainty quantification (UQ) methods for DL models have received increased attention in the literature due to their usefulness in decision making, particularly for high-consequence decisions. However, there has been little research done on how to evaluate the quality of such methods. We use statistical methods of frequentist interval coverage and interval width to evaluate the quality of credible intervals, and expected calibration error to evaluate classification predicted confidence. These metrics are evaluated on Bayesian neural networks (BNN) fit using Markov Chain Monte Carlo (MCMC) and variational inference (VI), bootstrapped neural networks (NN), Deep Ensembles (DE), and Monte Carlo (MC) dropout. We apply these different UQ for DL methods to a hyperspectral image target detection problem and show the inconsistency of the different methods' results and the necessity of a UQ quality metric. To reconcile these differences and choose a UQ method that appropriately quantifies the uncertainty, we create a simulated data set with fully parameterized probability distribution for a two-class classification problem. The gold standard MCMC performs the best overall, and the bootstrapped NN is a close second, requiring the same computational expense as DE. Through this comparison, we demonstrate that, for a given data set, different models can produce uncertainty estimates of markedly different quality. This in turn points to a great need for principled assessment methods of UQ quality in DL applications.
Deep neural networks (NNs) typically outperform traditional machine learning (ML) approaches for complicated, non-linear tasks. It is expected that deep learning (DL) should offer superior performance for the important non-proliferation task of predicting explosive device configuration based upon observed optical signature, a task which human experts struggle with. However, supervised machine learning is difficult to apply in this mission space because most recorded signatures are not associated with the corresponding device description, or “truth labels.” This is challenging for NNs, which traditionally require many samples for strong performance. Semi-supervised learning (SSL), low-shot learning (LSL), and uncertainty quantification (UQ) for NNs are emerging approaches that could bridge the mission gaps of few labels and rare samples of importance. NN explainability techniques are important in gaining insight into the inferential feature importance of such a complex model. In this work, SSL, LSL, and UQ are merged into a single framework, a significant technical hurdle not previously demonstrated. Exponential Average Adversarial Training (EAAT) and Pairwise Neural Networks (PNNs) are chosen as the SSL and LSL methods of choice. Permutation feature importance (PFI) for functional data is used to provide explainability via the Variable importance Explainable Elastic Shape Analysis (VEESA) pipeline. A variety of uncertainty quantification approaches are explored: Bayesian Neural Networks (BNNs), ensemble methods, concrete dropout, and evidential deep learning. Two final approaches, one utilizing ensemble methods and one utilizing evidential learning, are constructed and compared using a well-quantified synthetic 2D dataset along with the DIRSIG Megascene.
Deep learning (DL) has been widely proposed for target detection in hyperspectral image (HSI) data. Yet, standard DL models produce point estimates at inference time, with no associated measure of uncertainty, which is vital in high-consequence HSI applications. In this work, we develop an uncertainty quantification (UQ) framework using deep ensemble (DE) learning, which builds upon the successes of DL-based HSI target detection, while simultaneously providing UQ metrics. Specifically, we train an ensemble of convolutional deep learning detection models using one spectral prototype at a particular time of day and atmospheric condition. We find that our proposed framework is capable of accurate target detection in additional atmospheric conditions and times of day despite not being exposed to them during training. Furthermore, in comparison to Bayesian Neural Networks, another DL based UQ approach, we find that DEs provide increased target detection performance while achieving comparable probabilities of detection at constant false alarm rates.
Measurements of energy balance components (energy intake, energy expenditure, changes in energy stores) are often plagued with measurement error. Doubly-labeled water can measure energy intake (EI) with negligible error, but is expensive and cumbersome. An alternative approach that is gaining popularity is to use the energy balance principle, by measuring energy expenditure (EE) and change in energy stores (ES) and then back-calculate EI. Gold standard methods for EE and ES exist and are known to give accurate measurements, albeit at a high cost. We propose a joint statistical model to assess the measurement error in cheaper, non-intrusive measures of EE and ES. We let the unknown true EE and ES for individuals be latent variables, and model them using a bivariate distribution. We try both a bivariate Normal as well as a Dirichlet Process Mixture Model, and compare the results via simulation. Our approach, is the first to account for the dependencies that exist in individuals’ daily EE and ES. We employ semiparametric regression with free knot splines for measurements with error, and linear components for error free covariates. We adopt a Bayesian approach to estimation and inference and use Reversible Jump Markov Chain Monte Carlo to generate draws from the posterior distribution. Based on the semipar-ameteric regression, we develop a calibration equation that adjusts a cheaper, less reliable estimate, closer to the true value. Along with this calibrated value, our method also gives credible intervals to assess uncertainty. A simulation study shows our calibration helps produce a more accurate estimate. Our approach compares favorably in terms of prediction to other commonly used models.