Consider a scalar field f : M → R, where M is a triangulated simplicial mesh in Rd. A level set, or contour, at value v is a connected component of f–1 (v). As v is changed, these contours change topology, merge into each other, or split. Contour trees are concise representations of f that track this contour behavior. The vertices of these trees are the critical points of f, where the gradient is zero. The edges represent changes in the topology of contours. It is a fundamental data structure in data analysis and visualization, and there is significant previous work (both theoretical and practical) on algorithms for constructing contour trees. Suppose M has n vertices, N facets, and t critical points. A classic result of Carr, Snoeyink, and Axen (2000) gives an algorithm that takes O(n log n+Nα(N)) time (where α(·) is the inverse Ackermann function). A further improvement to O(t log t + N) time was given by Chiang et al. All these algorithms involve a global sort of the critical points, a significant computational bottleneck. Unfortunately, lower bounds of Ω(t log t) also exist. We present the first algorithm that can avoid the global sort and has a refined time complexity that depends on the contour tree structure. Intuitively, if the tree is short and fat, we get significant improvements in running time. For a partition of the contour tree into a set of descending paths, P, our algorithm runs in O($\Sigma$pϵP |p| log |p| + tα(t) + N). This is at most O(t log D + N), where D is the diameter of the contour tree. Moreover, it is O(tα(t) + N) for balanced trees, a significant improvement over the previous complexity. Our algorithm requires numerous ideas: partitioning the contour tree into join and split trees, a local growing procedure to iteratively build contour trees, and the use of heavy path decompositions for the time complexity analysis. There is a crucial use of a family of binomial heaps to maintain priorities, ensuring that any comparison made is between comparable nodes of the contour tree. We also prove lower bounds showing that the $\Sigma$pϵP |p| log |p| complexity is inherent to computing contour trees.
The study of triangles in graphs is a standard tool in network analysis, leading to measures such as the transitivity, i.e., the fraction of paths of length two that participate in triangles. Real-world networks are often directed, and it can be difficult to meaningfully understand this network structure. We propose a collection of directed closure values for measuring triangles in directed graphs in a way that is analogous to transitivity in an undirected graph. Our study of these values reveals much information about directed triadic closure. For instance, we immediately see that reciprocal edges have a high propensity to participate in triangles. We also observe striking similarities between the triadic closure patterns of different web and social networks. We perform mathematical and empirical analysis showing that directed configuration models that preserve reciprocity cannot capture the triadic closure patterns of real networks.
Listing all triangles is a fundamental graph operation. Triangles can have important interpretations in real-world graphs, especially social and other interaction networks. Despite the lack of provably efficient (linear, or slightly super linear) worst-case algorithms for this problem, practitioners run simple, efficient heuristics to find all triangles in graphs with millions of vertices. How are these heuristics exploiting the structure of these special graphs to provide major speedups in running time? We study one of the most prevalent algorithms used by practitioners. A trivial algorithm enumerates all paths of length 2, and checks if each such path is incident to a triangle. A good heuristic is to enumerate only those paths of length 2 in which the middle vertex has the lowest degree. It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm. We study the behavior of this algorithm over graphs with heavy-tailed degree distributions, a defining feature of real-world graphs. The erased configuration model (ECM) efficiently generates a graph with asymptotically (almost) any desired degree sequence. We show that the expected running time of this algorithm over the distribution of graphs created by the ECM is controlled by the l4/3-norm of the degree sequence. Norms of the degree sequence are a measure of the heaviness of the tail, and it is precisely this feature that allows non trivial speedups of simple triangle enumeration algorithms. As a corollary of our main theorem, we prove expected linear-time performance for degree sequences following a power law with exponent α ≥ 7/3, and non trivial speedup whenever α ∈ (2, 3).
The primary problem in property testing is to decide whether a given function satisfies a certain property, or is far from any function satisfying it. This crucially requires a notion of distance between functions. The most prevalent notion is the Hamming distance over the uniform distribution on the domain. This restriction to uniformity is rather limiting, and it is important to investigate distances induced by more general distributions. In this paper, we give simple and optimal testers for bounded derivative properties over arbitrary product distributions. Bounded derivative properties include fundamental properties such as monotonicity and Lipschitz continuity. Our results subsume almost all known results (upper and lower bounds) on monotonicity and Lipschitz testing. We prove an intimate connection between bounded derivative property testing and binary search trees (BSTs). We exhibit a tester whose query complexity is the sum of expected depths of optimal BSTs for each marginal. Furthermore, we show this sum-of-depths is also a lower bound. A technical contribution of our work is an optimal dimension reduction theorem for all bounded derivative properties, which relates the distance of a function from the property to the distance of restrictions of the function to random lines. Such a theorem has been elusive even for monotonicity, and our theorem is an exponential improvement to the previous best known result.
Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to nd the \true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine a hierarchical structure among them. Current dense subgraph nding algorithms usually optimize some objective, and only nd a few such subgraphs without providing any hierarchy. It is also not clear how to account for overlaps in dense substructures. We de ne the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of k-cores and k-trusses. We give provable e cient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-theart solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.
Network data is ubiquitous and growing, yet we lack realistic generative network models that can be calibrated to match real-world data. The recently proposed block two-level Erd?os- Rényi (BTER) model can be tuned to capture two fundamental properties: degree distribution and clustering coefficients. The latter is particularly important for reproducing graphs with community structure, such as social networks. In this paper, we compare BTER to other scalable models and show that it gives a better fit to real data. We provide a scalable implementation that requires only O(dmax) storage, where dmax is the maximum number of neighbors for a single node. The generator is trivially parallelizable, and we show results for a Hadoop MapReduce implementation for modeling a real-worldWeb graph with over 4.6 billion edges. We propose that the BTER model can be used as a graph generator for benchmarking purposes and provide idealized degree distributions and clustering coefficient profiles that can be tuned for user specifications.
This is the final SAND report for the Early-Career LDRD (# 158477) “Sublinear Algorithms for Massive Data Sets”. We provide a list of the various publications and achievements in the project. 3
We design a space efficient algorithm that approximates the transitivity (global clustering coefficient) and total triangle count with only a single pass through a graph given as a stream of edges. Our procedure is based on the classic probabilistic result, the birthday paradox. When the transitivity is constant and there are more edges than wedges (common properties for social networks), we can prove that our algorithm requires O( p n) space (n is the number of vertices) to provide accurate estimates. We run a detailed set of experiments on a variety of real graphs and demonstrate that the memory requirement of the algorithm is a tiny fraction of the graph. For example, even for a graph with 200 million edges, our algorithm stores just 60,000 edges to give accurate results. Being a single pass streaming algorithm, our procedure also maintains a real-Time estimate of the transitivity/number of triangles of a graph, by storing a miniscule fraction of edges.