BioRetroSynth: Identifying All Optimal Routes for Synthetic Biological and Hybrid Synthetic Biological/Chemical Production
Abstract not provided.
Abstract not provided.
Our nation's dependence on information networks makes it vital to anticipate disruptions or find weaknesses in these networks. But networks like the Internet are vast, distributed, and there is no mechanism to completely collect their structure. We are restricted to specific data collection schemes (like traceroute samples from router interfaces) that examine tiny portions of such a network. It has been empirically documented and theoretically proven that these measurements have significant biases, and direct inferences from them will be wrong. But these data collection mechanisms have limited flexibility and cannot be easily modified. Moreover, in many applications there are limits on how much data can be collected. How do we make accurate inferences of network properties with biased and limited measurements? The general problem this report deals with is how to work with incompletely observed networks. We will present several different approaches to this problem. First we will present an approach to estimate the degree distribution of a graph by sampling only a small portion of the vertices. This algorithm provides provably accurate results with sublinear samples. An alternative approach would be to try to enhance the information in the by selective collecting new information by probing for neighbors of a vertex or presence of individual edges. A different setting for working with incomplete arises when we have full access to local information, but do not have any global version of the graph. Can we still identify critical nodes in such a graph? We present an approach to identify such nodes efficiently. Finally, how can we put these ideas together to identify the structure of a network? We present an approach that can complement the existing approaches for network mapping. We start with an estimate of network structure based on existing network mapping methods. Then we find a critical router in the network, use the traffic through this network to selectively collect new data to enhance our prediction.
Abstract not provided.
Journal of Complex Networks
Network science is a powerful tool for analyzing complex systems in fields ranging from sociology to engineering to biology. This article is focused on generative models of large-scale bipartite graphs, also known as two-way graphs or two-mode networks. We propose two generative models that can be easily tuned to reproduce the characteristics of real-world networks, not just qualitatively but quantitatively. The characteristics we consider are the degree distributions and the metamorphosis coefficient. The metamorphosis coefficient, a bipartite analogue of the clustering coefficient, is the proportion of length-three paths that participate in length-four cycles. Having a high metamorphosis coefficient is a necessary condition for close-knit community structure. We define edge, node and degreewise metamorphosis coefficients, enabling a more detailed understanding of the bipartite connectivity that is not explained by degree distribution alone. Our first model, bipartite Chung-Lu, is able to reproduce real-world degree distributions, and our second model, bipartite block two-level Erdös-Rényi, reproduces both the degree distributions as well as the degreewise metamorphosis coefficients. We demonstrate the effectiveness of these models on several real-world data sets.
Abstract not provided.
Abstract not provided.
IEEE Transactions on Molecular, Biological and Multi-Scale Communications
We propose algorithms to approximate directed information graphs. Directed information graphs are probabilistic graphical models that depict causal dependencies between stochastic processes in a network. The proposed algorithms identify optimal and near-optimal approximations in terms of Kullback-Leibler divergence. The user-chosen sparsity trades off the quality of the approximation against visual conciseness and computational tractability. One class of approximations contains graphs with speci ed in-degrees. Another class additionally requires that the graph is connected. For both classes, we propose algorithms to identify the optimal approximations and also near-optimal approximations, using a novel relaxation of submodularity. We also propose algorithms to identify the r-best approximations among these classes, enabling robust decision making.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Journal of Complex Networks
The study of triangles in graphs is a standard tool in network analysis, leading to measures such as the transitivity, i.e., the fraction of paths of length two that participate in triangles. Real-world networks are often directed, and it can be difficult to meaningfully understand this network structure. We propose a collection of directed closure values for measuring triangles in directed graphs in a way that is analogous to transitivity in an undirected graph. Our study of these values reveals much information about directed triadic closure. For instance, we immediately see that reciprocal edges have a high propensity to participate in triangles. We also observe striking similarities between the triadic closure patterns of different web and social networks. We perform mathematical and empirical analysis showing that directed configuration models that preserve reciprocity cannot capture the triadic closure patterns of real networks.
26th International World Wide Web Conference, WWW 2017
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for all 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for 5-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all 4-vertex patterns. This algorithm is typically ten times faster than the state of the art 4-vertex counters.
Abstract not provided.
IEEE International Symposium on Information Theory - Proceedings
Given a network of agents interacting over time, which few interactions best characterize the dynamics of the whole network? We propose an algorithm that finds the optimal sparse approximation of a network. The user controls the level of sparsity by specifying the total number of edges. The networks are represented using directed information graphs, a graphical model that depicts causal influences between agents in a network. Goodness of approximation is measured with Kullback-Leibler divergence. The algorithm finds the best approximation with no assumptions on the topology or the class of the joint distribution.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Conference Record - Asilomar Conference on Signals, Systems and Computers
Graphs in the real-world are often temporal and can be represented as a "stream" of edges. Estimating the number of triangles in a graph observed as a stream of edges is a fundamental problem in data mining. Our goal is to design a single pass space-efficient streaming algorithm for estimating triangle counts. While there are numerous algorithms for this problem, they all (implicitly or explicitly) assume that the stream does not contain duplicate edges. However, real graph streams are rife with duplicate edges. The work around is typically an extra unaccounted pass (storing all the edges!) just to "clean up" the data. Furthermore, previous work tends to aggregate all edges to construct a graph, discarding the temporal information. It will be much more informative to investigate temporal windows, especially multiple time windows simultaneously. Can we estimate triangle counts for multiple time windows in a single pass even when the stream contains repeated edges? In this work, we give the first algorithm for estimating the triangle count of a multigraph stream of edges over arbitrary time windows. We build on existing "wedge sampling" work for triangle counting. Duplicate edges create significant biasing issues for small space streaming algorithms, which we provably resolve through a subtle debiasing mechanism. Moreover, our algorithm seamlessly handles multiple time windows. The final result is theoretically provable and has excellent performance in practice. Our algorithm discovers fascinating transitivity and triangle trends in real-world temporal graphs.
Abstract not provided.
Abstract not provided.
Proceedings - IEEE International Conference on Data Mining, ICDM
Given two sets of vectors, A = {a1→,... , am→} and B = {b1→,... , bn→}, our problem is to find the top-t dot products, i.e., the largest |ai→ · bj→| among all possible pairs. This is a fundamental mathematical problem that appears in numerous data applications involving similarity search, link prediction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all mn dot products. We select diamonds (i.e., four-cycles) from the weighted tripartite representation of A and B. The probability of selecting a diamond corresponding to pair (i, j) is proportional to (ai→ · bj→)2, amplifying the focus on the largest-magnitude entries. Experimental results indicate that diamond sampling is orders of magnitude faster than direct computation and requires far fewer samples than any competing approach. We also apply diamond sampling to the special case of maximum inner product search, and get significantly better results than the state-of-theart hashing methods.