BioRetroSynth: Identifying All Optimal Routes for Synthetic Biological and Hybrid Synthetic Biological/Chemical Production
Abstract not provided.
Abstract not provided.
Our nation's dependence on information networks makes it vital to anticipate disruptions or find weaknesses in these networks. But networks like the Internet are vast, distributed, and there is no mechanism to completely collect their structure. We are restricted to specific data collection schemes (like traceroute samples from router interfaces) that examine tiny portions of such a network. It has been empirically documented and theoretically proven that these measurements have significant biases, and direct inferences from them will be wrong. But these data collection mechanisms have limited flexibility and cannot be easily modified. Moreover, in many applications there are limits on how much data can be collected. How do we make accurate inferences of network properties with biased and limited measurements? The general problem this report deals with is how to work with incompletely observed networks. We will present several different approaches to this problem. First we will present an approach to estimate the degree distribution of a graph by sampling only a small portion of the vertices. This algorithm provides provably accurate results with sublinear samples. An alternative approach would be to try to enhance the information in the by selective collecting new information by probing for neighbors of a vertex or presence of individual edges. A different setting for working with incomplete arises when we have full access to local information, but do not have any global version of the graph. Can we still identify critical nodes in such a graph? We present an approach to identify such nodes efficiently. Finally, how can we put these ideas together to identify the structure of a network? We present an approach that can complement the existing approaches for network mapping. We start with an estimate of network structure based on existing network mapping methods. Then we find a critical router in the network, use the traffic through this network to selectively collect new data to enhance our prediction.
Abstract not provided.
Journal of Complex Networks
Network science is a powerful tool for analyzing complex systems in fields ranging from sociology to engineering to biology. This article is focused on generative models of large-scale bipartite graphs, also known as two-way graphs or two-mode networks. We propose two generative models that can be easily tuned to reproduce the characteristics of real-world networks, not just qualitatively but quantitatively. The characteristics we consider are the degree distributions and the metamorphosis coefficient. The metamorphosis coefficient, a bipartite analogue of the clustering coefficient, is the proportion of length-three paths that participate in length-four cycles. Having a high metamorphosis coefficient is a necessary condition for close-knit community structure. We define edge, node and degreewise metamorphosis coefficients, enabling a more detailed understanding of the bipartite connectivity that is not explained by degree distribution alone. Our first model, bipartite Chung-Lu, is able to reproduce real-world degree distributions, and our second model, bipartite block two-level Erdös-Rényi, reproduces both the degree distributions as well as the degreewise metamorphosis coefficients. We demonstrate the effectiveness of these models on several real-world data sets.
Abstract not provided.
Abstract not provided.
IEEE Transactions on Molecular, Biological and Multi-Scale Communications
We propose algorithms to approximate directed information graphs. Directed information graphs are probabilistic graphical models that depict causal dependencies between stochastic processes in a network. The proposed algorithms identify optimal and near-optimal approximations in terms of Kullback-Leibler divergence. The user-chosen sparsity trades off the quality of the approximation against visual conciseness and computational tractability. One class of approximations contains graphs with speci ed in-degrees. Another class additionally requires that the graph is connected. For both classes, we propose algorithms to identify the optimal approximations and also near-optimal approximations, using a novel relaxation of submodularity. We also propose algorithms to identify the r-best approximations among these classes, enabling robust decision making.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Journal of Complex Networks
The study of triangles in graphs is a standard tool in network analysis, leading to measures such as the transitivity, i.e., the fraction of paths of length two that participate in triangles. Real-world networks are often directed, and it can be difficult to meaningfully understand this network structure. We propose a collection of directed closure values for measuring triangles in directed graphs in a way that is analogous to transitivity in an undirected graph. Our study of these values reveals much information about directed triadic closure. For instance, we immediately see that reciprocal edges have a high propensity to participate in triangles. We also observe striking similarities between the triadic closure patterns of different web and social networks. We perform mathematical and empirical analysis showing that directed configuration models that preserve reciprocity cannot capture the triadic closure patterns of real networks.
26th International World Wide Web Conference, WWW 2017
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for all 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for 5-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all 4-vertex patterns. This algorithm is typically ten times faster than the state of the art 4-vertex counters.
Abstract not provided.
IEEE International Symposium on Information Theory - Proceedings
Given a network of agents interacting over time, which few interactions best characterize the dynamics of the whole network? We propose an algorithm that finds the optimal sparse approximation of a network. The user controls the level of sparsity by specifying the total number of edges. The networks are represented using directed information graphs, a graphical model that depicts causal influences between agents in a network. Goodness of approximation is measured with Kullback-Leibler divergence. The algorithm finds the best approximation with no assumptions on the topology or the class of the joint distribution.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Conference Record - Asilomar Conference on Signals, Systems and Computers
Graphs in the real-world are often temporal and can be represented as a "stream" of edges. Estimating the number of triangles in a graph observed as a stream of edges is a fundamental problem in data mining. Our goal is to design a single pass space-efficient streaming algorithm for estimating triangle counts. While there are numerous algorithms for this problem, they all (implicitly or explicitly) assume that the stream does not contain duplicate edges. However, real graph streams are rife with duplicate edges. The work around is typically an extra unaccounted pass (storing all the edges!) just to "clean up" the data. Furthermore, previous work tends to aggregate all edges to construct a graph, discarding the temporal information. It will be much more informative to investigate temporal windows, especially multiple time windows simultaneously. Can we estimate triangle counts for multiple time windows in a single pass even when the stream contains repeated edges? In this work, we give the first algorithm for estimating the triangle count of a multigraph stream of edges over arbitrary time windows. We build on existing "wedge sampling" work for triangle counting. Duplicate edges create significant biasing issues for small space streaming algorithms, which we provably resolve through a subtle debiasing mechanism. Moreover, our algorithm seamlessly handles multiple time windows. The final result is theoretically provable and has excellent performance in practice. Our algorithm discovers fascinating transitivity and triangle trends in real-world temporal graphs.
Abstract not provided.
Abstract not provided.
Proceedings - IEEE International Conference on Data Mining, ICDM
Given two sets of vectors, A = {a1→,... , am→} and B = {b1→,... , bn→}, our problem is to find the top-t dot products, i.e., the largest |ai→ · bj→| among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all mn dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i, j) is proportional to (ai→ · bj→)2, amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-theart hashing methods.
Abstract not provided.
Abstract not provided.
Post-Moore's law scaling is creating a disruptive shift in simulation workflows, as saving the entirety of raw data to persistent storage becomes expensive. We are moving away from a post-process centric data analysis paradigm towards a concurrent analysis framework, in which raw simulation data is processed as it is computed. Algorithms must adapt to machines with extreme concurrency, low communication bandwidth, and high memory latency, while operating within the time constraints prescribed by the simulation. Furthermore, in- put parameters are often data dependent and cannot always be prescribed. The study of sublinear algorithms is a recent development in theoretical computer science and discrete mathematics that has significant potential to provide solutions for these challenges. The approaches of sublinear algorithms address the fundamental mathematical problem of understanding global features of a data set using limited resources. These theoretical ideas align with practical challenges of in-situ and in-transit computation where vast amounts of data must be processed under severe communication and memory constraints. This report details key advancements made in applying sublinear algorithms in-situ to identify features of interest and to enable adaptive workflows over the course of a three year LDRD. Prior to this LDRD, there was no precedent in applying sublinear techniques to large-scale, physics based simulations. This project has definitively demonstrated their efficacy at mitigating high performance computing challenges and highlighted the rich potential for follow-on re- search opportunities in this space.
Abstract not provided.
Abstract not provided.
Abstract not provided.
WWW 2015 - Proceedings of the 24th International Conference on World Wide Web
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization. We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
2014 North American Power Symposium, NAPS 2014
Stochastic unit commitment models typically handle uncertainties in forecast demand by considering a finite number of realizations from a stochastic process model for loads. Accurate evaluations of expectations or higher moments for the quantities of interest require a prohibitively large number of model evaluations. In this paper we propose an alternative approach based on using surrogate models valid over the range of the forecast uncertainty. We consider surrogate models based on Polynomial Chaos expansions, constructed using sparse quadrature methods. Considering expected generation cost, we demonstrate that the approach can lead to several orders of magnitude reduction in computational cost relative to using Monte Carlo sampling on the original model, for a given target error threshold.
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to nd the \true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine a hierarchical structure among them. Current dense subgraph nding algorithms usually optimize some objective, and only nd a few such subgraphs without providing any hierarchy. It is also not clear how to account for overlaps in dense substructures. We de ne the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of k-cores and k-trusses. We give provable e cient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-theart solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.