In this paper we study the efficacy of combining machine-learning methods with projection-based model reduction techniques for creating data-driven surrogate models of computationally expensive, high-fidelity physics models. Such surrogate models are essential for many-query applications e.g., engineering design optimization and parameter estimation, where it is necessary to invoke the high-fidelity model sequentially, many times. Surrogate models are usually constructed for individual scalar quantities. However there are scenarios where a spatially varying field needs to be modeled as a function of the model’s input parameters. Here we develop a method to do so, using projections to represent spatial variability while a machine-learned model captures the dependence of the model’s response on the inputs. The method is demonstrated on modeling the heat flux and pressure on the surface of the HIFiRE-1 geometry in a Mach 7.16 turbulent flow. The surrogate model is then used to perform Bayesian estimation of freestream conditions and parameters of the SST (Shear Stress Transport) turbulence model embedded in the high-fidelity (Reynolds-Averaged Navier–Stokes) flow simulator, using shock-tunnel data. The paper provides the first-ever Bayesian calibration of a turbulence model for complex hypersonic turbulent flows. We find that the primary issues in estimating the SST model parameters are the limited information content of the heat flux and pressure measurements and the large model-form error encountered in a certain part of the flow.
The capability to identify emergent technologies based upon easily accessed open-source indicators, such as publications, is important for decision-makers in industry and government. The scientific contribution of this work is the proposition of a machine learning approach to the detection of the maturity of emerging technologies based on publication counts. Time-series of publication counts have universal features that distinguish emerging and growing technologies. We train an artificial neural network classifier, a supervised machine learning algorithm, upon these features to predict the maturity (emergent vs. growth) of an arbitrary technology. With a training set comprised of 22 technologies we obtain a classification accuracy ranging from 58.3% to 100% with an average accuracy of 84.6% for six test technologies. To enhance classifier performance, we augmented the training corpus with synthetic time-series technology life cycle curves, formed by calculating weighted averages of curves in the original training set. Training the classifier on the synthetic data set resulted in improved accuracy, ranging from 83.3% to 100% with an average accuracy of 90.4% for the test technologies. The performance of our classifier exceeds that of competing machine learning approaches in the literature, which report an average classification accuracy of only 85.7% at maximum. Moreover, in contrast to current methods our approach does not require subject matter expertise to generate training labels, and it can be automated and scaled.
Previous efforts determined a set of calibrated, optimal model parameter values for Reynolds-averaged Navier–Stokes (RANS) simulations of a compressible jet in crossflow (JIC) using a $k–ε$ turbulence model. These parameters were derived by comparing simulation results to particle image velocimetry (PIV) data of a complementary JIC experiment under a limited set of flow conditions. Here, a $k–ε$ model using both nominal and calibrated parameters is validated against PIV data acquired from a much wider variety of JIC cases, including a realistic flight vehicle. The results from the simulations using the calibrated model parameters showed considerable improvements over those using the nominal values, even for cases that were not used in the calibration procedure that defined the optimal parameters. This improvement is demonstrated using a number of quality metrics that test the spatial alignment of the jet core, the magnitudes of multiple flow variables, and the location and strengths of vortices in the counter-rotating vortex cores on the PIV planes. These results suggest that the calibrated parameters have applicability well outside the specific flow case used in defining them and that with the right model parameters, RANS solutions for the JIC can be improved significantly over those obtained from the nominal model.
We present a simple, near-real-time Bayesian method to infer and forecast a multiwave outbreak, and demonstrate it on the COVID-19 pandemic. The approach uses timely epidemiological data that has been widely available for COVID-19. It provides short-term forecasts of the outbreak’s evolution, which can then be used for medical resource planning. The method postulates one- and multiwave infection models, which are convolved with the incubation-period distribution to yield competing disease models. The disease models’ parameters are estimated via Markov chain Monte Carlo sampling and information-theoretic criteria are used to select between them for use in forecasting. The method is demonstrated on two- and three-wave COVID-19 outbreaks in California, New Mexico and Florida, as observed during Summer-Winter 2020. We find that the method is robust to noise, provides useful forecasts (along with uncertainty bounds) and that it reliably detected when the initial single-wave COVID-19 outbreaks transformed into successive surges as containment efforts in these states failed by the end of Spring 2020.
Machine-learned models, specifically neural networks, are increasingly used as “closures” or “constitutive models” in engineering simulators to represent fine-scale physical phenomena that are too computationally expensive to resolve explicitly. However, these neural net models of unresolved physical phenomena tend to fail unpredictably and are therefore not used in mission-critical simulations. In this report, we describe new methods to authenticate them, i.e., to determine the (physical) information content of their training datasets, qualify the scenarios where they may be used and to verify that the neural net, as trained, adhere to physics theory. We demonstrate these methods with neural net closure of turbulent phenomena used in Reynolds Averaged Navier-Stokes equations. We show the types of turbulent physics extant in our training datasets, and, using a test flow of an impinging jet, identify the exact locations where the neural network would be extrapolating i.e., where it would be used outside the feature-space where it was trained. Using Generalized Linear Mixed Models, we also generate explanations of the neural net (à la Local Interpretable Model agnostic Explanations) at prototypes placed in the training data and compare them with approximate analytical models from turbulence theory. Finally, we verify our findings by reproducing them using two different methods.
In this paper we investigate the utility of one-dimensional convolutional neural network (CNN) models in epidemiological forecasting. Deep learning models, in particular variants of recurrent neural networks (RNNs) have been studied for ILI (Influenza-Like Illness) forecasting, and have achieved a higher forecasting skill compared to conventional models such as ARIMA. In this study, we adapt two neural networks that employ one-dimensional temporal convolutional layers as a primary building block—temporal convolutional networks and simple neural attentive meta-learners—for epidemiological forecasting. We then test them with influenza data from the US collected over 2010-2019. We find that epidemiological forecasting with CNNs is feasible, and their forecasting skill is comparable to, and at times, superior to, plain RNNs. Thus CNNs and RNNs bring the power of nonlinear transformations to purely data-driven epidemiological models, a capability that heretofore has been limited to more elaborate mechanistic/compartmental disease models.
For digital twins (DTs) to become a central fixture in mission critical systems, a better understanding is required of potential modes of failure, quantification of uncertainty, and the ability to explain a model’s behavior. These aspects are particularly important as the performance of a digital twin will evolve during model development and deployment for real-world operations.
We demonstrate a Bayesian method for the “real-time” characterization and forecasting of partially observed COVID-19 epidemic. Characterization is the estimation of infection spread parameters using daily counts of symptomatic patients. The method is designed to help guide medical resource allocation in the early epoch of the outbreak. The estimation problem is posed as one of Bayesian inference and solved using a Markov chain Monte Carlo technique. The data used in this study was sourced before the arrival of the second wave of infection in July 2020. The proposed modeling approach, when applied at the country level, generally provides accurate forecasts at the regional, state and country level. The epidemiological model detected the flattening of the curve in California, after public health measures were instituted. The method also detected different disease dynamics when applied to specific regions of New Mexico.
This manuscript comprises the final report for the 1-year, FY19 LDRD project "Rigorous Data Fusion for Computationally Expensive Simulations," wherein an alternative approach to Bayesian calibration was developed based a new sampling technique called VoroSpokes. Vorospokes is a novel quadrature and sampling framework defined with respect to Voronoi tessellations of bounded domains in R d developed within this project. In this work, we first establish local quadrature and sampling results on convex polytopes using randomly directed rays, or spokes, to approximate the quantities of interest for a specified target function. A theoretical justification for both procedures is provided along with empirical results demonstrating the unbiased convergence in the resulting estimates/samples. The local quadrature and sampling procedures are then extended to global procedures defined on more general domains by applying the local results to the cells of a Voronoi tessellation covering the domain in consideration. We then demonstrate how the proposed global sampling procedure can be used to define a natural framework for adaptively constructing Voronoi Piecewise Surrogate (VPS) approximations based on local error estimates. Finally, we show that the adaptive VPS procedure can be used to form a surrogate model approximation to a specified, potentially unnormalized, density function, and that the global sampling procedure can be used to efficiently draw independent samples from the surrogate density in parallel. The performance of the resulting VoroSpokes sampling framework is assessed on a collection of Bayesian inference problems and is shown to provide highly accurate posterior predictions which align with the results obtained using traditional methods such as Gibbs sampling and random-walk Markov Chain Monte Carlo (MCMC). Importantly, the proposed framework provides a foundation for performing Bayesian inference tasks which is entirely independent from the theory of Markov chains.
We propose herein a probabilistic framework for assessing the consistency of an experimental dataset, i.e., whether the stated experimental conditions are consistent with the measurements provided. In case the dataset is inconsistent, our framework allows one to hypothesize and test sources of inconsistencies. This is crucial in model validation efforts. The framework relies on Bayesian inference to estimate experimental settings deemed uncertain, from measurements deemed accurate. The quality of the inferred variables is gauged by its ability to reproduce held-out experimental measurements. We test the correctness of the framework on three double-cone experiments conducted in the CUBRC Inc.'s LENS-I shock tunnel, which have also been numerically simulated successfully. Thereafter, we use the framework to investigate two double-cone experiments (executed in the LENS-XX shock tunnel) which have encountered difficulties when used in model validation exercises. We detect an inconsistency with one of the LENS-XX experiments. In addition, we hypothesize two causes for our inability to simulate LEXS-XX experiments accurately and test them using our framework. We find that there is no single cause that explains all the discrepancies between model predictions and experimental data, but different causes explain different discrepancies, to larger or smaller extent. We end by proposing that uncertainty quantification methods be used more widely to understand experiments and characterize facilities, and we cite three different methods to do so, the third of which we present in this paper.
In this study we investigate how an ensemble of disease models can be conditioned to observational data, in a bid to improve its predictive skill. We use the ensemble of influenza forecasting models gathered by the US Centers for Disease Control and Prevention (CDC) as the exemplar. This ensemble is used every year to forecast the annual influenza outbreak in the United States. The models constituting this ensemble draw on very different modeling assumptions and approximations and are a diverse collection of methods to approximate epidemiological dynamics. Currently, each models' predictions are accorded the same importance, or weight, when compiling the ensemble's forecast. We consider this equally-weighted ensemble as the baseline case which has to be improved upon. In this study, we explore whether an ensemble forecast can be improved by "conditionine the ensemble to whatever observational data is available from the ongoing outbreak. "Conditionine can imply according the ensemble's members different weights which evolve over time, or simply perform the forecast using the top k (equally-weighted) models. In the latter case, the composition of the "top-k-see of models evolves over time. This is called "model averagine in statistics. We explore four methods to perform model-averaging, three of which are new.. We find that the CDC ensemble responds best to the "top-k-models" approach to model-averaging. All the new MA methods perform better than the baseline equally-weighted ensemble. The four model-averaging methods treat the models as black-boxes and simply use their forecasts as inputs i.e., one does not need access to the models at all, but rather only their forecasts. The model-averaging approaches reviewed in this report thus form a general framework for model-averaging any model ensemble.
Compressible jet-in-crossflow interactions are difficult to simulate accurately using Reynolds-averaged Navier-Stokes (RANS) models. This could be due to simplifications inherent in RANS or the use of inappropriate RANS constants estimated by fitting to experiments of simple or canonical flows. Our previous work on Bayesian calibration of a k - ϵ model to experimental data had led to a weak hypothesis that inaccurate simulations could be due to inappropriate constants more than model-form inadequacies of RANS. In this work, Bayesian calibration of k - ϵ constants to a set of experiments that span a range of Mach numbers and jet strengths has been performed. The variation of the calibrated constants has been checked to assess the degree to which parametric estimates compensate for RANS's model-form errors. An analytical model of jet-in-crossflow interactions has also been developed, and estimates of k - ϵ constants that are free of any conflation of parametric and RANS's model-form uncertainties have been obtained. It has been found that the analytical k - ϵ constants provide mean-flow predictions that are similar to those provided by the calibrated constants. Further, both of them provide predictions that are far closer to experimental measurements than those computed using "nominal" values of these constants simply obtained from the literature. It can be concluded that the lack of predictive skill of RANS jet-in-crossflow simulations is mostly due to parametric inadequacies, and our analytical estimates may provide a simple way of obtaining predictive compressible jet-in-crossflow simulations.
This study developed and tested biologically inspired computational methods to detect anomalous signals in data streams that could indicate a pending outbreak or bio-weapon attack. Current large- scale biosurveillance systems are plagued by two principal deficiencies: (1) timely detection of disease-indicating signals in noisy data and (2) anomaly detection across multiple channels. Anomaly detectors and data fusion components modeled after human immune system processes were tested against a variety of natural and synthetic surveillance datasets. A pilot scale immune-system-based biosurveillance system performed at least as well as traditional statistical anomaly detection data fusion approaches. Machine learning approaches leveraging Deep Learning recurrent neural networks were developed and applied to challenging unstructured and multimodal health surveillance data. Within the limits imposed of data availability, both immune systems and deep learning methods were found to improve anomaly detection and data fusion performance for particularly challenging data subsets. ACKNOWLEDGEMENTS The authors acknowledge the close collaboration of Scott Lee, Jason Thomas, and Chad Heilig from the US Centers for Disease Control (CDC) in this effort. De-identified biosurveillance data provided by Ken Jeter of the New Mexico Department of Health proved to be an important contribution to our work. Discussions with members of the International Society of Disease Surveillance helped the researchers focus on questions relevant to practicing public health professionals. Funding for this work was provided by Sandia National Laboratories' Laboratory Directed Research and Development program.
This report pulls together the documentation produced for the IMPACT tool, a software-based decision support tool that provides situational awareness, incident characterization, and guidance on public health and environmental response strategies for an unfolding bio-terrorism incident.
We demonstrate a statistical procedure for learning a high-order eddy viscosity model (EVM) from experimental data and using it to improve the predictive skill of a Reynoldsaveraged Navier-Stokes (RANS) simulator. The method is tested in a three-dimensional (3D), transonic jet-in-crossflow (JIC) configuration. The process starts with a cubic eddy viscosity model (CEVM) developed for incompressible flows. It is fitted to limited experimental JIC data using shrinkage regression. The shrinkage process removes all the terms from the model, except an intercept, a linear term, and a quadratic one involving the square of the vorticity. The shrunk eddy viscosity model is implemented in an RANS simulator and calibrated, using vorticity measurements, to infer three parameters. The calibration is Bayesian and is solved using a Markov chain Monte Carlo (MCMC) method. A 3D probability density distribution for the inferred parameters is constructed, thus quantifying the uncertainty in the estimate. The phenomenal cost of using a 3D flow simulator inside an MCMC loop is mitigated by using surrogate models ("curve-fits"). A support vector machine classifier (SVMC) is used to impose our prior belief regarding parameter values, specifically to exclude nonphysical parameter combinations. The calibrated model is compared, in terms of its predictive skill, to simulations using uncalibrated linear and CEVMs. We find that the calibrated model, with one quadratic term, is more accurate than the uncalibrated simulator. The model is also checked at a flow condition at which the model was not calibrated.
In this study we developed an efficient Bayesian inversion framework for interpreting marine seismic Amplitude Versus Angle and Controlled-Source Electromagnetic data for marine reservoir characterization. The framework uses a multi-chain Markov-chain Monte Carlo sampler, which is a hybrid of DiffeRential Evolution Adaptive Metropolis and Adaptive Metropolis samplers. The inversion framework is tested by estimating reservoir-fluid saturations and porosity based on marine seismic and Controlled-Source Electromagnetic data. The multi-chain Markov-chain Monte Carlo is scalable in terms of the number of chains, and is useful for computationally demanding Bayesian model calibration in scientific and engineering problems. As a demonstration, the approach is used to efficiently and accurately estimate the porosity and saturations in a representative layered synthetic reservoir. The results indicate that the seismic Amplitude Versus Angle and Controlled-Source Electromagnetic joint inversion provides better estimation of reservoir saturations than the seismic Amplitude Versus Angle only inversion, especially for the parameters in deep layers. The performance of the inversion approach for various levels of noise in observational data was evaluated — reasonable estimates can be obtained with noise levels up to 25%. Sampling efficiency due to the use of multiple chains was also checked and was found to have almost linear scalability.
We present the development of a parallel Markov Chain Monte Carlo (MCMC) method called SAChES, Scalable Adaptive Chain-Ensemble Sampling. This capability is targed to Bayesian calibration of com- putationally expensive simulation models. SAChES involves a hybrid of two methods: Differential Evo- lution Monte Carlo followed by Adaptive Metropolis. Both methods involve parallel chains. Differential evolution allows one to explore high-dimensional parameter spaces using loosely coupled (i.e., largely asynchronous) chains. Loose coupling allows the use of large chain ensembles, with far more chains than the number of parameters to explore. This reduces per-chain sampling burden, enables high-dimensional inversions and the use of computationally expensive forward models. The large number of chains can also ameliorate the impact of silent-errors, which may affect only a few chains. The chain ensemble can also be sampled to provide an initial condition when an aberrant chain is re-spawned. Adaptive Metropolis takes the best points from the differential evolution and efficiently hones in on the poste- rior density. The multitude of chains in SAChES is leveraged to (1) enable efficient exploration of the parameter space; and (2) ensure robustness to silent errors which may be unavoidable in extreme-scale computational platforms of the future. This report outlines SAChES, describes four papers that are the result of the project, and discusses some additional results.
The k-ε turbulence model has been described as perhaps “the most widely used complete turbulence model.” This family of heuristic Reynolds Averaged Navier-Stokes (RANS) turbulence closures is supported by a suite of model parameters that have been estimated by demanding the satisfaction of well-established canonical flows such as homogeneous shear flow, log-law behavior, etc. While this procedure does yield a set of so-called nominal parameters, it is abundantly clear that they do not provide a universally satisfactory turbulence model that is capable of simulating complex flows. Recent work on the Bayesian calibration of the k-ε model using jet-in-crossflow wind tunnel data has yielded parameter estimates that are far more predictive than nominal parameter values. In this paper, we develop a self-similar asymptotic solution for axisymmetric jet-in-crossflow interactions and derive analytical estimates of the parameters that were inferred using Bayesian calibration. The self-similar method utilizes a near field approach to estimate the turbulence model parameters while retaining the classical far-field scaling to model flow field quantities. Our parameter values are seen to be far more predictive than the nominal values, as checked using RANS simulations and experimental measurements. They are also closer to the Bayesian estimates than the nominal parameters. A traditional simplified jet trajectory model is explicitly related to the turbulence model parameters and is shown to yield good agreement with measurement when utilizing the analytical derived turbulence model coefficients. Finally, the close agreement between the turbulence model coefficients obtained via Bayesian calibration and the analytically estimated coefficients derived in this paper is consistent with the contention that the Bayesian calibration approach is firmly rooted in the underlying physical description.
Yadav, Vineet; Michalak, Anna M.; Ray, Jaideep R.; Shiga, Yoichi P.
Independent verification and quantification of fossil fuel (FF) emissions constitutes a considerable scientific challenge. By coupling atmospheric observations of CO2 with models of atmospheric transport, inverse models offer the possibility of overcoming this challenge. However, disaggregating the biospheric and FF flux components of terrestrial fluxes from CO2 concentration measurements has proven to be difficult, due to observational and modeling limitations. In this study, we propose a statistical inverse modeling scheme for disaggregating winter time fluxes on the basis of their unique error covariances and covariates, where these covariances and covariates are representative of the underlying processes affecting FF and biospheric fluxes. The application of the method is demonstrated with one synthetic and two real data prototypical inversions by using in situ CO2 measurements over North America. Inversions are performed only for the month of January, as predominance of biospheric CO2 signal relative to FF CO2 signal and observational limitations preclude disaggregation of the fluxes in other months. The quality of disaggregation is assessed primarily through examination of a posteriori covariance between disaggregated FF and biospheric fluxes at regional scales. Findings indicate that the proposed method is able to robustly disaggregate fluxes regionally at monthly temporal resolution with a posteriori cross covariance lower than 0.15 µmolm-2 s-1 between FF and biospheric fluxes. Error covariance models and covariates based on temporally varying FF inventory data provide a more robust disaggregation over static proxies (e.g., nightlight intensity and population density). However, the synthetic data case study shows that disaggregation is possible even in absence of detailed temporally varying FF inventory data.
Open-source indicators have been proposed as a way of tracking and forecasting disease outbreaks. Some, such are meteorological data, are readily available as reanalysis products. Others, such as those derived from our online behavior (web searches, media article etc.) are gathered easily and are more timely than public health reporting. In this study we investigate how these datastreams may be combined to provide useful epidemiological information. The investigation is performed by building data assimilation systems to track influenza in California and dengue in India. The first does not suffer from incomplete data and was chosen to explore disease modeling needs. The second explores the case when observational data is sparse and disease modeling complexities are beside the point. The two test cases are for opposite ends of the disease tracking spectrum. We find that data assimilation systems that produce disease activity maps can be constructed. Further, being able to combine multiple open-source datastreams is a necessity as any one individually is not very infor- mative. The data assimilation systems have very little in common except that they contain disease models, calibration algorithms and some ability to impute missing data. Thus while the data assimilation systems share the goal for accurate forecasting, they are practically designed to compensate for the shortcomings of the datastreams. Thus we expect them to be disease and location-specific.
Reynolds-averaged Navier–Stokes models are not very accurate for high-Reynolds-number compressible jet-in-crossflow interactions. The inaccuracy arises from the use of inappropriate model parameters and model-form errors in the Reynolds-averaged Navier–Stokes model. In this study, the hypothesis is pursued that Reynolds-averaged Navier–Stokes predictions can be significantly improved by using parameters inferred from experimental measurements of a supersonic jet interacting with a transonic crossflow.
We present a general technique to solve Partial Differential Equations, called robust stencils, which make them tolerant to soft faults, i.e. bit flips arising in memory or CPU calculations. We show how it can be applied to a two-dimensional Lax-Wendroff solver. The resulting 2D robust stencils are derived using an orthogonal application of their 1D counterparts. Combinations of 3 to 5 base stencils can then be created. We describe how these are then implemented in a parallel advection solver. Various robust stencil combinations are explored, representing tradeoff between performance and robustness. The results indicate that the 3-stencil robust combinations are slightly faster on large parallel workloads than Triple Modular Redundancy (TMR). They also have one third of the memory footprint. We expect the improvement to be significant if suitable optimizations are performed. Because faults are avoided each time new points are computed, the proposed stencils are also comparably robust to faults as TMR for a large range of error rates. The technique can be generalized to 3D (or higher dimensions) with similar benefits.
Implicit numerical integration of nonlinear ODEs requires solving a system of nonlinear algebraic equations at each time step. Each of these systems is often solved by a Newton-like method, which incurs a sequence of linear-system solves. Most model-reduction techniques for nonlinear ODEs exploit knowledge of a system's spatial behavior to reduce the computational complexity of each linear-system solve. However, the number of linear-system solves for the reduced-order simulation often remains roughly the same as that for the full-order simulation.We propose exploiting knowledge of the model's temporal behavior to (1) forecast the unknown variable of the reduced-order system of nonlinear equations at future time steps, and (2) use this forecast as an initial guess for the Newton-like solver during the reduced-order-model simulation. To compute the forecast, we propose using the Gappy POD technique. The goal is to generate an accurate initial guess so that the Newton solver requires many fewer iterations to converge, thereby decreasing the number of linear-system solves in the reduced-order-model simulation.
We demonstrate a Bayesian method that can be used to calibrate computationally expensive 3D RANS (Reynolds Av- eraged Navier Stokes) models with complex response surfaces. Such calibrations, conditioned on experimental data, can yield turbulence model parameters as probability density functions (PDF), concisely capturing the uncertainty in the parameter estimates. Methods such as Markov chain Monte Carlo (MCMC) estimate the PDF by sampling, with each sample requiring a run of the RANS model. Consequently a quick-running surrogate is used instead to the RANS simulator. The surrogate can be very difficult to design if the model's response i.e., the dependence of the calibration variable (the observable) on the parameter being estimated is complex. We show how the training data used to construct the surrogate can be employed to isolate a promising and physically realistic part of the parameter space, within which the response is well-behaved and easily modeled. We design a classifier, based on treed linear models, to model the "well-behaved region". This classifier serves as a prior in a Bayesian calibration study aimed at estimating 3 k - ε parameters ( C μ, C ε2 , C ε1 ) from experimental data of a transonic jet-in-crossflow interaction. The robustness of the calibration is investigated by checking its predictions of variables not included in the cal- ibration data. We also check the limit of applicability of the calibration by testing at off-calibration flow regimes. We find that calibration yield turbulence model parameters which predict the flowfield far better than when the nomi- nal values of the parameters are used. Substantial improvements are still obtained when we use the calibrated RANS model to predict jet-in-crossflow at Mach numbers and jet strengths quite different from those used to generate the ex- perimental (calibration) data. Thus the primary reason for poor predictive skill of RANS, when using nominal values of the turbulence model parameters, was parametric uncertainty, which was rectified by calibration. Post-calibration, the dominant contribution to model inaccuraries are due to the structural errors in RANS.
Model reduction for dynamical systems is a promising approach for reducing the computational cost of large-scale physics-based simulations to enable high-fidelity models to be used in many- query (e.g., Bayesian inference) and near-real-time (e.g., fast-turnaround simulation) contexts. While model reduction works well for specialized problems such as linear time-invariant systems, it is much more difficult to obtain accurate, stable, and efficient reduced-order models (ROMs) for systems with general nonlinearities. This report describes several advances that enable nonlinear reduced-order models (ROMs) to be deployed in a variety of time-critical settings. First, we present an error bound for the Gauss-Newton with Approximated Tensors (GNAT) nonlinear model reduction technique. This bound allows the state-space error for the GNAT method to be quantified when applied with the backward Euler time-integration scheme. Second, we present a methodology for preserving classical Lagrangian structure in nonlinear model reduction. This technique guarantees that important properties--such as energy conservation and symplectic time-evolution maps--are preserved when performing model reduction for models described by a Lagrangian formalism (e.g., molecular dynamics, structural dynamics). Third, we present a novel technique for decreasing the temporal complexity --defined as the number of Newton-like iterations performed over the course of the simulation--by exploiting time-domain data. Fourth, we describe a novel method for refining projection-based reduced-order models a posteriori using a goal-oriented framework similar to mesh-adaptive h -refinement in finite elements. The technique allows the ROM to generate arbitrarily accurate solutions, thereby providing the ROM with a 'failsafe' mechanism in the event of insufficient training data. Finally, we present the reduced-order model error surrogate (ROMES) method for statistically quantifying reduced- order-model errors. This enables ROMs to be rigorously incorporated in uncertainty-quantification settings, as the error model can be treated as a source of epistemic uncertainty. This work was completed as part of a Truman Fellowship appointment. We note that much additional work was performed as part of the Fellowship. One salient project is the development of the Trilinos-based model-reduction software module Razor , which is currently bundled with the Albany PDE code and currently allows nonlinear reduced-order models to be constructed for any application supported in Albany. Other important projects include the following: 1. ROMES-equipped ROMs for Bayesian inference: K. Carlberg, M. Drohmann, F. Lu (Lawrence Berkeley National Laboratory), M. Morzfeld (Lawrence Berkeley National Laboratory). 2. ROM-enabled Krylov-subspace recycling: K. Carlberg, V. Forstall (University of Maryland), P. Tsuji, R. Tuminaro. 3. A pseudo balanced POD method using only dual snapshots: K. Carlberg, M. Sarovar. 4. An analysis of discrete v. continuous optimality in nonlinear model reduction: K. Carlberg, M. Barone, H. Antil (George Mason University). Journal articles for these projects are in progress at the time of this writing.
We present results from the Bayesian calibration of hydrological parameters of the Community Land Model (CLM), which is often used in climate simulations and Earth system models. A statistical inverse problem is formulated for three hydrological parameters, conditional on observations of latent heat surface fluxes over 48 months. Our calibration method uses polynomial and Gaussian process surrogates of the CLM, and solves the parameter estimation problem using a Markov chain Monte Carlo sampler. Posterior probability densities for the parameters are developed for two sites with different soil and vegetation covers. Our method also allows us to examine the structural error in CLM under two error models. We find that surrogate models can be created for CLM in most cases. The posterior distributions are more predictive than the default parameter values in CLM. Climatologically averaging the observations does not modify the parameters' distributions significantly. The structural error model reveals a correlation time-scale which can be used to identify the physical process that could be contributing to it. While the calibrated CLM has a higher predictive skill, the calibration is under-dispersive.
We develop a novel calibration approach to address the problem of predictive ke RANS simulations of jet-incrossflow. Our approach is based on the hypothesis that predictive ke parameters can be obtained by estimating them from a strongly vortical flow, specifically, flow over a square cylinder. In this study, we estimate three ke parameters, C%CE%BC, Ce2 and Ce1 by fitting 2D RANS simulations to experimental data. We use polynomial surrogates of 2D RANS for this purpose. We conduct an ensemble of 2D RANS runs using samples of (C%CE%BC;Ce2;Ce1) and regress Reynolds stresses to the samples using a simple polynomial. We then use this surrogate of the 2D RANS model to infer a joint distribution for the ke parameters by solving a Bayesian inverse problem, conditioned on the experimental data. The calibrated (C%CE%BC;Ce2;Ce1) distribution is used to seed an ensemble of 3D jet-in-crossflow simulations. We compare the ensemble's predictions of the flowfield, at two planes, to PIV measurements and estimate the predictive skill of the calibrated 3D RANS model. We also compare it against 3D RANS predictions using the nominal (uncalibrated) values of (C%CE%BC;Ce2;Ce1), and find that calibration delivers a significant improvement to the predictive skill of the 3D RANS model. We repeat the calibration using surrogate models based on kriging and find that the calibration, based on these more accurate models, is not much better that those obtained with simple polynomial surrogates. We discuss the reasons for this rather surprising outcome.
The estimation of fossil-fuel CO2 emissions (ffCO2) from limited ground-based and satellite measurements of CO2 concentrations will form a key component of the monitoring of treaties aimed at the abatement of greenhouse gas emissions. The limited nature of the measured data leads to a severely-underdetermined estimation problem. If the estimation is performed at fine spatial resolutions, it can also be computationally expensive. In order to enable such estimations, advances are needed in the spatial representation of ffCO2 emissions, scalable inversion algorithms and the identification of observables to measure. To that end, we investigate parsimonious spatial parameterizations of ffCO2 emissions which can be used in atmospheric inversions. We devise and test three random field models, based on wavelets, Gaussian kernels and covariance structures derived from easily-observed proxies of human activity. In doing so, we constructed a novel inversion algorithm, based on compressive sensing and sparse reconstruction, to perform the estimation. We also address scalable ensemble Kalman filters as an inversion mechanism and quantify the impact of Gaussian assumptions inherent in them. We find that the assumption does not impact the estimates of mean ffCO2 source strengths appreciably, but a comparison with Markov chain Monte Carlo estimates show significant differences in the variance of the source strengths. Finally, we study if the very different spatial natures of biogenic and ffCO2 emissions can be used to estimate them, in a disaggregated fashion, solely from CO2 concentration measurements, without extra information from products of incomplete combustion e.g., CO. We find that this is possible during the winter months, though the errors can be as large as 50%.
The estimation of fossil-fuel CO2 emissions (ffCO2) from limited ground-based and satellite measurements of CO2 concentrations will form a key component of the monitoring of treaties aimed at the abatement of greenhouse gas emissions. To that end, we construct a multiresolution spatial parametrization for fossil-fuel CO2 emissions (ffCO2), to be used in atmospheric inversions. Such a parametrization does not currently exist. The parametrization uses wavelets to accurately capture the multiscale, nonstationary nature of ffCO2 emissions and employs proxies of human habitation, e.g., images of lights at night and maps of built-up areas to reduce the dimensionality of the multiresolution parametrization. The parametrization is used in a synthetic data inversion to test its suitability for use in atmospheric inverse problem. This linear inverse problem is predicated on observations of ffCO2 concentrations collected at measurement towers. We adapt a convex optimization technique, commonly used in the reconstruction of compressively sensed images, to perform sparse reconstruction of the time-variant ffCO2 emission field. We also borrow concepts from compressive sensing to impose boundary conditions i.e., to limit ffCO2 emissions within an irregularly shaped region (the United States, in our case). We find that the optimization algorithm performs a data-driven sparsification of the spatial parametrization and retains only of those wavelets whose weights could be estimated from the observations. Further, our method for the imposition of boundary conditions leads to a 10computational saving over conventional means of doing so. We conclude with a discussion of the accuracy of the estimated emissions and the suitability of the spatial parametrization for use in inverse problems with a significant degree of regularization.
We construct and verify a statistical method to nowcast influenza activity from a time-series of the frequency of reports concerning influenza related topics. Such reports are published electronically by both public health organizations as well as newspapers/media sources, and thus can be harvested easily via web crawlers. Since media reports are timely, whereas reports from public health organization are delayed by at least two weeks, using timely, open-source data to compensate for the lag in %E2%80%9Cofficial%E2%80%9D reports can be useful. We use morbidity data from networks of sentinel physicians (both the Center of Disease Control's ILINet and France's Sentinelles network) as the gold standard of influenza-like illness (ILI) activity. The time-series of media reports is obtained from HealthMap (http://healthmap.org). We find that the time-series of media reports shows some correlation ( 0.5) with ILI activity; further, this can be leveraged into an autoregressive moving average model with exogenous inputs (ARMAX model) to nowcast ILI activity. We find that the ARMAX models have more predictive skill compared to autoregressive (AR) models fitted to ILI data i.e., it is possible to exploit the information content in the open-source data. We also find that when the open-source data are non-informative, the ARMAX models reproduce the performance of AR models. The statistical models are tested on data from the 2009 swine-flu outbreak as well as the mild 2011-2012 influenza season in the U.S.A.
In this report, we proposed, examined and implemented approaches for performing efficient uncertainty quantification (UQ) in climate land models. Specifically, we applied Bayesian compressive sensing framework to a polynomial chaos spectral expansions, enhanced it with an iterative algorithm of basis reduction, and investigated the results on test models as well as on the community land model (CLM). Furthermore, we discussed construction of efficient quadrature rules for forward propagation of uncertainties from high-dimensional, constrained input space to output quantities of interest. The work lays grounds for efficient forward UQ for high-dimensional, strongly non-linear and computationally costly climate models. Moreover, to investigate parameter inference approaches, we have applied two variants of the Markov chain Monte Carlo (MCMC) method to a soil moisture dynamics submodel of the CLM. The evaluation of these algorithms gave us a good foundation for further building out the Bayesian calibration framework towards the goal of robust component-wise calibration.
We investigate Bayesian techniques that can be used to reconstruct field variables from partial observations. In particular, we target fields that exhibit spatial structures with a large spectrum of lengthscales. Contemporary methods typically describe the field on a grid and estimate structures which can be resolved by it. In contrast, we address the reconstruction of grid-resolved structures as well as estimation of statistical summaries of subgrid structures, which are smaller than the grid resolution. We perform this in two different ways (a) via a physical (phenomenological), parameterized subgrid model that summarizes the impact of the unresolved scales at the coarse level and (b) via multiscale finite elements, where specially designed prolongation and restriction operators establish the interscale link between the same problem defined on a coarse and fine mesh. The estimation problem is posed as a Bayesian inverse problem. Dimensionality reduction is performed by projecting the field to be inferred on a suitable orthogonal basis set, viz. the Karhunen-Loeve expansion of a multiGaussian. We first demonstrate our techniques on the reconstruction of a binary medium consisting of a matrix with embedded inclusions, which are too small to be grid-resolved. The reconstruction is performed using an adaptive Markov chain Monte Carlo method. We find that the posterior distributions of the inferred parameters are approximately Gaussian. We exploit this finding to reconstruct a permeability field with long, but narrow embedded fractures (which are too fine to be grid-resolved) using scalable ensemble Kalman filters; this also allows us to address larger grids. Ensemble Kalman filtering is then used to estimate the values of hydraulic conductivity and specific yield in a model of the High Plains Aquifer in Kansas. Strong conditioning of the spatial structure of the parameters and the non-linear aspects of the water table aquifer create difficulty for the ensemble Kalman filter. We conclude with a demonstration of the use of multiscale stochastic finite elements to reconstruct permeability fields. This method, though computationally intensive, is general and can be used for multiscale inference in cases where a subgrid model cannot be constructed.
In this report we describe how we create a model for influenza epidemics from historical data collected from both civilian and military societies. We derive the model when the population of the society is unknown but the size of the epidemic is known. Our interest lies in estimating a time-dependent infection rate to within a multiplicative constant. The model form fitted is chosen for its similarity to published models for HIV and plague, enabling application of Bayesian techniques to discriminate among infectious agents during an emerging epidemic. We have developed models for the progression of influenza in human populations. The model is framed as a integral, and predicts the number of people who exhibit symptoms and seek care over a given time-period. The start and end of the time period form the limits of integration. The disease progression model, in turn, contains parameterized models for the incubation period and a time-dependent infection rate. The incubation period model is obtained from literature, and the parameters of the infection rate are fitted from historical data including both military and civilian populations. The calibrated infection rate models display a marked difference in which the 1918 Spanish Influenza pandemic differed from the influenza seasons in the US between 2001-2008 and the progression of H1N1 in Catalunya, Spain. The data for the 1918 pandemic was obtained from military populations, while the rest are country-wide or province-wide data from the twenty-first century. We see that the initial growth of infection in all cases were about the same; however, military populations were able to control the epidemic much faster i.e., the decay of the infection-rate curve is much higher. It is not clear whether this was because of the much higher level of organization present in a military society or the seriousness with which the 1918 pandemic was addressed. Each outbreak to which the influenza model was fitted yields a separate set of parameter values. We suggest 'consensus' parameter values for military and civilian populations in the form of normal distributions so that they may be further used in other applications. Representing the parameter values as distributions, instead of point values, allows us to capture the uncertainty and scatter in the parameters. Quantifying the uncertainty allows us to use these models further in inverse problems, predictions under uncertainty and various other studies involving risk.
We present a statistical method, predicated on the use of surrogate models, for the 'real-time' characterization of partially observed epidemics. Observations consist of counts of symptomatic patients, diagnosed with the disease, that may be available in the early epoch of an ongoing outbreak. Characterization, in this context, refers to estimation of epidemiological parameters that can be used to provide short-term forecasts of the ongoing epidemic, as well as to provide gross information on the dynamics of the etiologic agent in the affected population e.g., the time-dependent infection rate. The characterization problem is formulated as a Bayesian inverse problem, and epidemiological parameters are estimated as distributions using a Markov chain Monte Carlo (MCMC) method, thus quantifying the uncertainty in the estimates. In some cases, the inverse problem can be computationally expensive, primarily due to the epidemic simulator used inside the inversion algorithm. We present a method, based on replacing the epidemiological model with computationally inexpensive surrogates, that can reduce the computational time to minutes, without a significant loss of accuracy. The surrogates are created by projecting the output of an epidemiological model on a set of polynomial chaos bases; thereafter, computations involving the surrogate model reduce to evaluations of a polynomial. We find that the epidemic characterizations obtained with the surrogate models is very close to that obtained with the original model. We also find that the number of projections required to construct a surrogate model is O(10)-O(10{sup 2}) less than the number of samples required by the MCMC to construct a stationary posterior distribution; thus, depending upon the epidemiological models in question, it may be possible to omit the offline creation and caching of surrogate models, prior to their use in an inverse problem. The technique is demonstrated on synthetic data as well as observations from the 1918 influenza pandemic collected at Camp Custer, Michigan.
We present results from a recently developed multiscale inversion technique for binary media, with emphasis on the effect of subgrid model errors on the inversion. Binary media are a useful fine-scale representation of heterogeneous porous media. Averaged properties of the binary field representations can be used to characterize flow through the porous medium at the macroscale. Both direct measurements of the averaged properties and upscaling are complicated and may not provide accurate results. However, it may be possible to infer upscaled properties of the binary medium from indirect measurements at the coarse scale. Multiscale inversion, performed with a subgrid model to connect disparate scales together, can also yield information on the fine-scale properties. We model the binary medium using truncated Gaussian fields, and develop a subgrid model for the upscaled permeability based on excursion sets of those fields. The subgrid model requires an estimate of the proportion of inclusions at the block scale as well as some geometrical parameters of the inclusions as inputs, and predicts the effective permeability. The inclusion proportion is assumed to be spatially varying, modeled using Gaussian processes and represented using a truncated Karhunen-Louve (KL) expansion. This expansion is used, along with the subgrid model, to pose as a Bayesian inverse problem for the KL weights and the geometrical parameters of the inclusions. The model error is represented in two different ways: (1) as a homoscedastic error and (2) as a heteroscedastic error, dependent on inclusion proportionality and geometry. The error models impact the form of the likelihood function in the expression for the posterior density of the objects of inference. The problem is solved using an adaptive Markov Chain Monte Carlo method, and joint posterior distributions are developed for the KL weights and inclusion geometry. Effective permeabilities and tracer breakthrough times at a few 'sensor' locations (obtained by simulating a pump test) form the observables used in the inversion. The inferred quantities can be used to generate an ensemble of permeability fields, both upscaled and fine-scale, which are consistent with the observations. We compare the inferences developed using the two error models, in terms of the KL weights and fine-scale realizations that could be supported by the coarse-scale inferences. Permeability differences are observed mainly in regions where the inclusions proportion is near the percolation threshold, and the subgrid model incurs its largest approximation. These differences also reflected in the tracer breakthrough times and the geometry of flow streamlines, as obtained from a permeameter simulation. The uncertainty due to subgrid model error is also compared to the uncertainty in the inversion due to incomplete data.
Multi-scale binary permeability field estimation from static and dynamic data is completed using Markov Chain Monte Carlo (MCMC) sampling. The binary permeability field is defined as high permeability inclusions within a lower permeability matrix. Static data are obtained as measurements of permeability with support consistent to the coarse scale discretization. Dynamic data are advective travel times along streamlines calculated through a fine-scale field and averaged for each observation point at the coarse scale. Parameters estimated at the coarse scale (30 x 20 grid) are the spatially varying proportion of the high permeability phase and the inclusion length and aspect ratio of the high permeability inclusions. From the non-parametric, posterior distributions estimated for these parameters, a recently developed sub-grid algorithm is employed to create an ensemble of realizations representing the fine-scale (3000 x 2000), binary permeability field. Each fine-scale ensemble member is instantiated by convolution of an uncorrelated multiGaussian random field with a Gaussian kernel defined by the estimated inclusion length and aspect ratio. Since the multiGaussian random field is itself a realization of a stochastic process, the procedure for generating fine-scale binary permeability field realizations is also stochastic. Two different methods are hypothesized to perform posterior predictive tests. Different mechanisms for combining multi Gaussian random fields with kernels defined from the MCMC sampling are examined. Posterior predictive accuracy of the estimated parameters is assessed against a simulated ground truth for predictions at both the coarse scale (effective permeabilities) and at the fine scale (advective travel time distributions). The two techniques for conducting posterior predictive tests are compared by their ability to recover the static and dynamic data. The skill of the inference and the method for generating fine-scale binary permeability fields are evaluated through flow calculations on the resulting fields using fine-scale realizations and comparing them against results obtained with the ground truth fine-scale and coarse-scale permeability fields.
Techniques appear promising to construct and integrate automated detect-and-characterize technique for epidemics - Working off biosurveillance data, and provides information on the particular/ongoing outbreak. Potential use - in crisis management and planning, resource allocation - Parameter estimation capability ideal for providing the input parameters into an agent-based model, Index Cases, Time of Infection, infection rate. Non-communicable diseases are easier than communicable ones - Small anthrax can be characterized well with 7-10 days of data, post-detection; plague takes longer, Large attacks are very easy.
Results show that a time-series based classification may be possible. For the test cases considered, the correct model can be selected and the number of index case can be captured within {+-} {sigma} with 5-10 days of data. The low signal-to-noise ratio makes the classification difficult for small epidemics. The problem statement is: (1) Create Bayesian techniques to classify and characterize epidemics from a time-series of ICD-9 codes (will call this time-series a 'morbidity stream'); and (2) It is assumed the morbidity stream has already set off an alarm (through a Kalman filter anomaly detector) Starting with a set of putative diseases: Identify which disease or set of diseases 'fit the data best' and, Infer associated information about it, i.e. number of index cases, start time of the epidemic, spread rate, etc.
We present a Bayesian approach for estimating transmission chains and rates in the Abakaliki smallpox epidemic of 1967. The epidemic affected 30 individuals in a community of 74; only the dates of appearance of symptoms were recorded. Our model assumes stochastic transmission of the infections over a social network. Distinct binomial random graphs model intra- and inter-compound social connections, while disease transmission over each link is treated as a Poisson process. Link probabilities and rate parameters are objects of inference. Dates of infection and recovery comprise the remaining unknowns. Distributions for smallpox incubation and recovery periods are obtained from historical data. Using Markov chain Monte Carlo, we explore the joint posterior distribution of the scalar parameters and provide an expected connectivity pattern for the social graph and infection pathway.
This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.
Structured adaptive mesh refinement methods are being widely used for computer simulations of various physical phenomena. Parallel implementations potentially offer realistic simulations of complex three-dimensional applications. But achieving good scalability for large-scale applications is non-trivial. Performance is limited by the partitioner's ability to efficiently use the underlying parallel computer's resources. Designed on sound SAMR principles, Nature+Fable is a hybrid, dedicated SAMR partitioning tool that brings together the advantages of both domain-based and patch-based techniques while avoiding their drawbacks. But the original bi-level partitioning approach in Nature+Fable is insufficient as it for realistic applications regards frequently occurring bi-levels as ''impossible'' and fails. This document describes an improved bi-level partitioning algorithm that successfully copes with all possible bi-levels. The improved algorithm uses the original approach side-by-side with a new, complementing approach. By using a new, customized classification method, the improved algorithm switches automatically between the two approaches. This document describes the algorithms, discusses implementation issues, and presents experimental results. The improved version of Nature+Fable was found to be able to handle realistic applications and also to generate less imbalances, similar box count, but more communication as compared to the native, domain-based partitioner in the SAMR framework AMROC.
Structured adaptive mesh refinement methods are being widely used for computer simulations of various physical phenomena. Parallel implementations potentially offer realistic simulations of complex three-dimensional applications. But achieving good scalability for large-scale applications is non-trivial. Performance is limited by the partitioner's ability to efficiently use the underlying parallel computer's resources. Designed on sound SAMR principles, Nature+Fable is a hybrid, dedicated SAMR partitioning tool that brings together the advantages of both domain-based and patch-based techniques while avoiding their drawbacks. But the original bi-level partitioning approach in Nature+Fable is insufficient as it for realistic applications regards frequently occurring bi-levels as 'impossible' and fails. This document describes an improved bi-level partitioning algorithm that successfully copes with all possible hi-levels. The improved algorithm uses the original approach side-by-side with a new, complementing approach. By using a new, customized classification method, the improved algorithm switches automatically between the two approaches. This document describes the algorithms, discusses implementation issues, and presents experimental results. The improved version of Nature+Fable was found to be able to handle realistic applications and also to generate less imbalances, similar box count, but more communication as compared to the native, domain-based partitioner in the SAMR framework AMROC.