The capability to identify emergent technologies based upon easily accessed open-source indicators, such as publications, is important for decision-makers in industry and government. The scientific contribution of this work is the proposition of a machine learning approach to the detection of the maturity of emerging technologies based on publication counts. Time-series of publication counts have universal features that distinguish emerging and growing technologies. We train an artificial neural network classifier, a supervised machine learning algorithm, upon these features to predict the maturity (emergent vs. growth) of an arbitrary technology. With a training set comprised of 22 technologies we obtain a classification accuracy ranging from 58.3% to 100% with an average accuracy of 84.6% for six test technologies. To enhance classifier performance, we augmented the training corpus with synthetic time-series technology life cycle curves, formed by calculating weighted averages of curves in the original training set. Training the classifier on the synthetic data set resulted in improved accuracy, ranging from 83.3% to 100% with an average accuracy of 90.4% for the test technologies. The performance of our classifier exceeds that of competing machine learning approaches in the literature, which report an average classification accuracy of only 85.7% at maximum. Moreover, in contrast to current methods our approach does not require subject matter expertise to generate training labels, and it can be automated and scaled.
Subsurface energy activities such as unconventional resource recovery, enhanced geothermal energy systems, and geologic carbon storage require fast and reliable methods to account for complex, multiphysical processes in heterogeneous fractured and porous media. Although reservoir simulation is considered the industry standard for simulating these subsurface systems with injection and/or extraction operations, reservoir simulation requires spatio-temporal “Big Data” into the simulation model, which is typically a major challenge during model development and computational phase. In this work, we developed and applied various deep neural network-based approaches to (1) process multiscale image segmentation, (2) generate ensemble members of drainage networks, flow channels, and porous media using deep convolutional generative adversarial network, (3) construct multiple hybrid neural networks such as convolutional LSTM and convolutional neural network-LSTM to develop fast and accurate reduced order models for shale gas extraction, and (4) physics-informed neural network and deep Q-learning for flow and energy production. We hypothesized that physicsbased machine learning/deep learning can overcome the shortcomings of traditional machine learning methods where data-driven models have faltered beyond the data and physical conditions used for training and validation. We improved and developed novel approaches to demonstrate that physics-based ML can allow us to incorporate physical constraints (e.g., scientific domain knowledge) into ML framework. Outcomes of this project will be readily applicable for many energy and national security problems that are particularly defined by multiscale features and network systems.
Many individuals' mobility can be characterized by strong patterns of regular movements and is influenced by social relationships. Social networks are also often organized into overlapping communities which are associated in time or space. We develop a model that can generate the structure of a social network and attribute purpose to individuals' movements, based solely on records of individuals' locations over time. This model distinguishes the attributed purpose of check-ins based on temporal and spatial patterns in check-in data. Because a location-based social network dataset with authoritative ground-truth to test our entire model does not exist, we generate large scale datasets containing social networks and individual check-in data to test our model. We find that our model reliably assigns community purpose to social check-in data, and is robust over a variety of different situations.
Survey data from the Energy Information Administration (EIA) was combined with data from the Environmental Protection Agency (EPA) to explore ways in which operations might impact water use intensity (both withdrawals and consumption) at thermoelectric power plants. Two disparities in cooling and power systems operations were identified that could impact water use intensity: (1) Idling Gap - where cooling systems continue to operate when their boilers and generators are completely idled; and (2) Cycling Gap - where cooling systems operate at full capacity, while their associated boiler and generator systems cycle over a range of loads. Analysis of the EIA and EPA data indicated that cooling systems operated on average 13% more than their corresponding power system (Idling Gap), while power systems operated on average 30% below full load when the boiler was reported as operating (Cycling Gap). Regression analysis was then performed to explore whether the degree of power plant idling/cycling could be related to the physical characteristics of the plant, its environment or time of year. While results suggested that individual power plants' operations were unique, weak trends consistently pointed to a plant's place on the dispatch curve as influencing patterns of cooling system, boiler, and generator operation. This insight better positions us to interpret reported power plant water use data as well as improve future water use projections.
In this study we investigate how an ensemble of disease models can be conditioned to observational data, in a bid to improve its predictive skill. We use the ensemble of influenza forecasting models gathered by the US Centers for Disease Control and Prevention (CDC) as the exemplar. This ensemble is used every year to forecast the annual influenza outbreak in the United States. The models constituting this ensemble draw on very different modeling assumptions and approximations and are a diverse collection of methods to approximate epidemiological dynamics. Currently, each models' predictions are accorded the same importance, or weight, when compiling the ensemble's forecast. We consider this equally-weighted ensemble as the baseline case which has to be improved upon. In this study, we explore whether an ensemble forecast can be improved by "conditionine the ensemble to whatever observational data is available from the ongoing outbreak. "Conditionine can imply according the ensemble's members different weights which evolve over time, or simply perform the forecast using the top k (equally-weighted) models. In the latter case, the composition of the "top-k-see of models evolves over time. This is called "model averagine in statistics. We explore four methods to perform model-averaging, three of which are new.. We find that the CDC ensemble responds best to the "top-k-models" approach to model-averaging. All the new MA methods perform better than the baseline equally-weighted ensemble. The four model-averaging methods treat the models as black-boxes and simply use their forecasts as inputs i.e., one does not need access to the models at all, but rather only their forecasts. The model-averaging approaches reviewed in this report thus form a general framework for model-averaging any model ensemble.
This project explored coupling modeling and analysis methods from multiple domains to address complex hybrid (cyber and physical) attacks on mission critical infrastructure. Robust methods to integrate these complex systems are necessary to enable large trade-space exploration including dynamic and evolving cyber threats and mitigations. Reinforcement learning employing deep neural networks, as in the AlphaGo Zero solution, was used to identify "best" (or approximately optimal) resilience strategies for operation of a cyber/physical grid model. A prototype platform was developed and the machine learning (ML) algorithm was made to play itself in a game of 'Hurt the Grid'. This proof of concept shows that machine learning optimization can help us understand and control complex, multi-dimensional grid space. A simple, yet high-fidelity model proves that the data have spatial correlation which is necessary for any optimization or control. Our prototype analysis showed that the reinforcement learning successfully improved adversary and defender knowledge to manipulate the grid. When expanded to more representative models, this exact type of machine learning will inform grid operations and defense - supporting mitigation development to defend the grid from complex cyber attacks! This same research can be expanded to similar complex domains.
This report pulls together the documentation produced for the IMPACT tool, a software-based decision support tool that provides situational awareness, incident characterization, and guidance on public health and environmental response strategies for an unfolding bio-terrorism incident.
The transformation of the distribution grid from a centralized to decentralized architecture, with bi-directional power and data flows, is made possible by a surge in network intelligence and grid automation. While changes are largely beneficial, the interface between grid operator and automated technologies is not well understood, nor are the benefits and risks of automation. Quantifying and understanding the latter is an important facet of grid resilience that needs to be fully investigated. The work described in this document represents the first empirical study aimed at identifying and mitigating the vulnerabilities posed by automation for a grid that for the foreseeable future will remain a human-in-the-loop critical infrastructure. Our scenario-based methodology enabled us to conduct a series of experimental studies to identify causal relationships between grid-operator performance and automated technologies and to collect measurements of human performance as a function of automation. Our findings, though preliminary, suggest there are predictive patterns in the interplay between human operators and automation, patterns that can inform the rollout of distribution automation and the hiring and training of operators, and contribute in multiple and significant ways to the field of grid resilience.
Open-source indicators have been proposed as a way of tracking and forecasting disease outbreaks. Some, such are meteorological data, are readily available as reanalysis products. Others, such as those derived from our online behavior (web searches, media article etc.) are gathered easily and are more timely than public health reporting. In this study we investigate how these datastreams may be combined to provide useful epidemiological information. The investigation is performed by building data assimilation systems to track influenza in California and dengue in India. The first does not suffer from incomplete data and was chosen to explore disease modeling needs. The second explores the case when observational data is sparse and disease modeling complexities are beside the point. The two test cases are for opposite ends of the disease tracking spectrum. We find that data assimilation systems that produce disease activity maps can be constructed. Further, being able to combine multiple open-source datastreams is a necessity as any one individually is not very infor- mative. The data assimilation systems have very little in common except that they contain disease models, calibration algorithms and some ability to impute missing data. Thus while the data assimilation systems share the goal for accurate forecasting, they are practically designed to compensate for the shortcomings of the datastreams. Thus we expect them to be disease and location-specific.