Publications

Results 51–100 of 105
Skip to search filters

Hybrid methods for cybersecurity analysis :

Davis, Warren L.; Dunlavy, Daniel D.

Early 2010 saw a signi cant change in adversarial techniques aimed at network intrusion: a shift from malware delivered via email attachments toward the use of hidden, embedded hyperlinks to initiate sequences of downloads and interactions with web sites and network servers containing malicious software. Enterprise security groups were well poised and experienced in defending the former attacks, but the new types of attacks were larger in number, more challenging to detect, dynamic in nature, and required the development of new technologies and analytic capabilities. The Hybrid LDRD project was aimed at delivering new capabilities in large-scale data modeling and analysis to enterprise security operators and analysts and understanding the challenges of detection and prevention of emerging cybersecurity threats. Leveraging previous LDRD research e orts and capabilities in large-scale relational data analysis, large-scale discrete data analysis and visualization, and streaming data analysis, new modeling and analysis capabilities were quickly brought to bear on the problems in email phishing and spear phishing attacks in the Sandia enterprise security operational groups at the onset of the Hybrid project. As part of this project, a software development and deployment framework was created within the security analyst work ow tool sets to facilitate the delivery and testing of new capabilities as they became available, and machine learning algorithms were developed to address the challenge of dynamic threats. Furthermore, researchers from the Hybrid project were embedded in the security analyst groups for almost a full year, engaged in daily operational activities and routines, creating an atmosphere of trust and collaboration between the researchers and security personnel. The Hybrid project has altered the way that research ideas can be incorporated into the production environments of Sandias enterprise security groups, reducing time to deployment from months and years to hours and days for the application of new modeling and analysis capabilities to emerging threats. The development and deployment framework has been generalized into the Hybrid Framework and incor- porated into several LDRD, WFO, and DOE/CSL projects and proposals. And most importantly, the Hybrid project has provided Sandia security analysts with new, scalable, extensible analytic capabilities that have resulted in alerts not detectable using their previous work ow tool sets.

More Details

Shifted power method for computing tensor eigenpairs

Kolda, Tamara G.; Dunlavy, Daniel D.

Recent work on eigenvalues and eigenvectors for tensors of order m {>=} 3 has been motivated by applications in blind source separation, magnetic resonance imaging, molecular conformation, and more. In this paper, we consider methods for computing real symmetric-tensor eigenpairs of the form Ax{sup m-1} = {lambda}x subject to {parallel}x{parallel} = 1, which is closely related to optimal rank-1 approximation of a symmetric tensor. Our contribution is a novel shifted symmetric higher-order power method (SS-HOPM), which we showis guaranteed to converge to a tensor eigenpair. SS-HOPM can be viewed as a generalization of the power iteration method for matrices or of the symmetric higher-order power method. Additionally, using fixed point analysis, we can characterize exactly which eigenpairs can and cannot be found by the method. Numerical examples are presented, including examples from an extension of the method to fnding complex eigenpairs.

More Details

ParaText : scalable solutions for processing and searching very large document collections : final LDRD report

Dunlavy, Daniel D.; Crossno, Patricia J.

This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages of information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.

More Details

LDRD final report : leveraging multi-way linkages on heterogeneous data

Dunlavy, Daniel D.; Kolda, Tamara G.

This report is a summary of the accomplishments of the 'Leveraging Multi-way Linkages on Heterogeneous Data' which ran from FY08 through FY10. The goal was to investigate scalable and robust methods for multi-way data analysis. We developed a new optimization-based method called CPOPT for fitting a particular type of tensor factorization to data; CPOPT was compared against existing methods and found to be more accurate than any faster method and faster than any equally accurate method. We extended this method to computing tensor factorizations for problems with incomplete data; our results show that you can recover scientifically meaningfully factorizations with large amounts of missing data (50% or more). The project has involved 5 members of the technical staff, 2 postdocs, and 1 summer intern. It has resulted in a total of 13 publications, 2 software releases, and over 30 presentations. Several follow-on projects have already begun, with more potential projects in development.

More Details

Scalable tensor factorizations with incomplete data

Kolda, Tamara G.; Dunlavy, Daniel D.

The problem of incomplete data - i.e., data with missing or unknown values - in multi-way arrays is ubiquitous in biomedical signal processing, network traffic analysis, bibliometrics, social network analysis, chemometrics, computer vision, communication networks, etc. We consider the problem of how to factorize data sets with missing values with the goal of capturing the underlying latent structure of the data and possibly reconstructing missing values (i.e., tensor completion). We focus on one of the most well-known tensor factorizations that captures multi-linear structure, CANDECOMP/PARAFAC (CP). In the presence of missing data, CP can be formulated as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) that uses a first-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factorize tensors with noise and up to 99% missing data. A unique aspect of our approach is that it scales to sparse large-scale data, e.g., 1000 x 1000 x 1000 with five million known entries (0.5% dense). We further demonstrate the usefulness of CP-WOPT on two real-world applications: a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes and the problem of modeling computer network traffic where data may be absent due to the expense of the data collection process.

More Details

ParaText : scalable text analysis and visualization

Dunlavy, Daniel D.

Automated analysis of unstructured text documents (e.g., web pages, newswire articles, research publications, business reports) is a key capability for solving important problems in areas including decision making, risk assessment, social network analysis, intelligence analysis, scholarly research and others. However, as data sizes continue to grow in these areas, scalable processing, modeling, and semantic analysis of text collections becomes essential. In this paper, we present the ParaText text analysis engine, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents. Results on several document collections using hundreds of processors are presented to illustrate the exibility, extensibility, and scalability of the the entire process of text modeling from raw data ingestion to application analysis.

More Details

Compressively sensed complex networks

Pinar, Ali P.; Dunlavy, Daniel D.

The aim of this project is to develop low dimension parametric (deterministic) models of complex networks, to use compressive sensing (CS) and multiscale analysis to do so and to exploit the structure of complex networks (some are self-similar under coarsening). CS provides a new way of sampling and reconstructing networks. The approach is based on multiresolution decomposition of the adjacency matrix and its efficient sampling. It requires preprocessing of the adjacency matrix to make it 'blocky' which is the biggest (combinatorial) algorithm challenge. Current CS reconstruction algorithm makes no use of the structure of a graph, its very general (and so not very efficient/customized). Other model-based CS techniques exist, but not yet adapted to networks. Obvious starting point for future work is to increase the efficiency of reconstruction.

More Details

Link prediction on evolving graphs using matrix and tensor factorizations

Kolda, Tamara G.; Dunlavy, Daniel D.

More Details

ParaText : scalable text modeling and analysis

Dunlavy, Daniel D.

Automated processing, modeling, and analysis of unstructured text (news documents, web content, journal articles, etc.) is a key task in many data analysis and decision making applications. As data sizes grow, scalability is essential for deep analysis. In many cases, documents are modeled as term or feature vectors and latent semantic analysis (LSA) is used to model latent, or hidden, relationships between documents and terms appearing in those documents. LSA supplies conceptual organization and analysis of document collections by modeling high-dimension feature vectors in many fewer dimensions. While past work on the scalability of LSA modeling has focused on the SVD, the goal of our work is to investigate the use of distributed memory architectures for the entire text analysis process, from data ingestion to semantic modeling and analysis. ParaText is a set of software components for distributed processing, modeling, and analysis of unstructured text. The ParaText source code is available under a BSD license, as an integral part of the Titan toolkit. ParaText components are chained-together into data-parallel pipelines that are replicated across processes on distributed-memory architectures. Individual components can be replaced or rewired to explore different computational strategies and implement new functionality. ParaText functionality can be embedded in applications on any platform using the native C++ API, Python, or Java. The ParaText MPI Process provides a 'generic' text analysis pipeline in a command-line executable that can be used for many serial and parallel analysis tasks. ParaText can also be deployed as a web service accessible via a RESTful (HTTP) API. In the web service configuration, any client can access the functionality provided by ParaText using commodity protocols ... from standard web browsers to custom clients written in any language.

More Details

Scalable tensor factorizations with missing data

Dunlavy, Daniel D.; Kolda, Tamara G.

The problem of missing data is ubiquitous in domains such as biomedical signal processing, network traffic analysis, bibliometrics, social network analysis, chemometrics, computer vision, and communication networks|all domains in which data collection is subject to occasional errors. Moreover, these data sets can be quite large and have more than two axes of variation, e.g., sender, receiver, time. Many applications in those domains aim to capture the underlying latent structure of the data; in other words, they need to factorize data sets with missing entries. If we cannot address the problem of missing data, many important data sets will be discarded or improperly analyzed. Therefore, we need a robust and scalable approach for factorizing multi-way arrays (i.e., tensors) in the presence of missing data. We focus on one of the most well-known tensor factorizations, CANDECOMP/PARAFAC (CP), and formulate the CP model as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) using a first-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factor tensors with noise and up to 70% missing data. Moreover, our approach is significantly faster than the leading alternative and scales to larger problems. To show the real-world usefulness of CP-WOPT, we illustrate its applicability on a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes.

More Details

Poblano v1.0 : a Matlab toolbox for gradient-based optimization

Dunlavy, Daniel D.; Kolda, Tamara G.

We present Poblano v1.0, a Matlab toolbox for solving gradient-based unconstrained optimization problems. Poblano implements three optimization methods (nonlinear conjugate gradients, limited-memory BFGS, and truncated Newton) that require only first order derivative information. In this paper, we describe the Poblano methods, provide numerous examples on how to use Poblano, and present results of Poblano used in solving problems from a standard test collection of unconstrained optimization problems.

More Details

An optimization approach for fitting canonical tensor decompositions

Acar Ataman, Evrim N.; Dunlavy, Daniel D.

Tensor decompositions are higher-order analogues of matrix decompositions and have proven to be powerful tools for data analysis. In particular, we are interested in the canonical tensor decomposition, otherwise known as the CANDECOMP/PARAFAC decomposition (CPD), which expresses a tensor as the sum of component rank-one tensors and is used in a multitude of applications such as chemometrics, signal processing, neuroscience, and web analysis. The task of computing the CPD, however, can be difficult. The typical approach is based on alternating least squares (ALS) optimization, which can be remarkably fast but is not very accurate. Previously, nonlinear least squares (NLS) methods have also been recommended; existing NLS methods are accurate but slow. In this paper, we propose the use of gradient-based optimization methods. We discuss the mathematical calculation of the derivatives and further show that they can be computed efficiently, at the same cost as one iteration of ALS. Computational experiments demonstrate that the gradient-based optimization methods are much more accurate than ALS and orders of magnitude faster than NLS.

More Details

Multilinear algebra for analyzing data with multiple linkages

Dunlavy, Daniel D.; Kolda, Tamara G.; Kegelmeyer, William P.

Link analysis typically focuses on a single type of connection, e.g., two journal papers are linked because they are written by the same author. However, often we want to analyze data that has multiple linkages between objects, e.g., two papers may have the same keywords and one may cite the other. The goal of this paper is to show that multilinear algebra provides a tool for multilink analysis. We analyze five years of publication data from journals published by the Society for Industrial and Applied Mathematics. We explore how papers can be grouped in the context of multiple link types using a tensor to represent all the links between them. A PARAFAC decomposition on the resulting tensor yields information similar to the SVD decomposition of a standard adjacency matrix. We show how the PARAFAC decomposition can be used to understand the structure of the document space and define paper-paper similarities based on multiple linkages. Examples are presented where the decomposed tensor data is used to find papers similar to a body of work (e.g., related by topic or similar to a particular author's papers), find related authors using linkages other than explicit co-authorship or citations, distinguish between papers written by different authors with the same name, and predict the journal in which a paper was published.

More Details

CPOPT : optimization for fitting CANDECOMP/PARAFAC models

Kolda, Tamara G.; Acar Ataman, Evrim N.; Dunlavy, Daniel D.

Tensor decompositions (e.g., higher-order analogues of matrix decompositions) are powerful tools for data analysis. In particular, the CANDECOMP/PARAFAC (CP) model has proved useful in many applications such chemometrics, signal processing, and web analysis; see for details. The problem of computing the CP decomposition is typically solved using an alternating least squares (ALS) approach. We discuss the use of optimization-based algorithms for CP, including how to efficiently compute the derivatives necessary for the optimization methods. Numerical studies highlight the positive features of our CPOPT algorithms, as compared with ALS and Gauss-Newton approaches.

More Details

Yucca Mountain licensing support network archive assistant

Dunlavy, Daniel D.; Basilico, Justin D.; Verzi, Stephen J.; Bauer, Travis L.

This report describes the Licensing Support Network (LSN) Assistant--a set of tools for categorizing e-mail messages and documents, and investigating and correcting existing archives of categorized e-mail messages and documents. The two main tools in the LSN Assistant are the LSN Archive Assistant (LSNAA) tool for recategorizing manually labeled e-mail messages and documents and the LSN Realtime Assistant (LSNRA) tool for categorizing new e-mail messages and documents. This report focuses on the LSNAA tool. There are two main components of the LSNAA tool. The first is the Sandia Categorization Framework, which is responsible for providing categorizations for documents in an archive and storing them in an appropriate Categorization Database. The second is the actual user interface, which primarily interacts with the Categorization Database, providing a way for finding and correcting categorizations errors in the database. A procedure for applying the LSNAA tool and an example use case of the LSNAA tool applied to a set of e-mail messages are provided. Performance results of the categorization model designed for this example use case are presented.

More Details

QCS: A system for querying, clustering and summarizing documents

Information Processing and Management

Dunlavy, Daniel D.; O'Leary, Dianne P.; Conroy, John M.; Schlesinger, Judith D.

Information retrieval systems consist of many complicated components. Research and development of such systems is often hampered by the difficulty in evaluating how each particular component would behave across multiple systems. We present a novel integrated information retrieval system-the Query, Cluster, Summarize (QCS) system-which is portable, modular, and permits experimentation with different instantiations of each of the constituent text analysis components. Most importantly, the combination of the three types of methods in the QCS design improves retrievals by providing users more focused information organized by topic. We demonstrate the improved performance by a series of experiments using standard test sets from the Document Understanding Conferences (DUC) as measured by the best known automatic metric for summarization system evaluation, ROUGE. Although the DUC data and evaluations were originally designed to test multidocument summarization, we developed a framework to extend it to the task of evaluation for each of the three components: query, clustering, and summarization. Under this framework, we then demonstrate that the QCS system (end-to-end) achieves performance as good as or better than the best summarization engines. Given a query, QCS retrieves relevant documents, separates the retrieved documents into topic clusters, and creates a single summary for each cluster. In the current implementation, Latent Semantic Indexing is used for retrieval, generalized spherical k-means is used for the document clustering, and a method coupling sentence "trimming" and a hidden Markov model, followed by a pivoted QR decomposition, is used to create a single extract summary for each cluster. The user interface is designed to provide access to detailed information in a compact and useful format. Our system demonstrates the feasibility of assembling an effective IR system from existing software libraries, the usefulness of the modularity of the design, and the value of this particular combination of modules. © 2007.

More Details

Dakota, a multilevel parallel object-oriented framework for design optimization, parameter estimation, uncertainty quantification, and sensitivity analysis:version 4.0 developers manual

Brown, Shannon L.; Griffin, Joshua G.; Hough, Patricia D.; Kolda, Tamara G.; Martinez-Canales, Monica L.; Williams, Pamela J.; Adams, Brian M.; Dunlavy, Daniel D.; Swiler, Laura P.; Giunta, Anthony A.; Hart, William E.; Watson, Jean-Paul W.; Eddy, John P.

The DAKOTA (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a flexible and extensible interface between simulation codes and iterative analysis methods. DAKOTA contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quantification with sampling, reliability, and stochastic finite element methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the DAKOTA toolkit provides a flexible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a developers manual for the DAKOTA software and describes the DAKOTA class hierarchies and their interrelationships. It derives directly from annotation of the actual source code and provides detailed class documentation, including all member functions and attributes.

More Details

DAKOTA, a multilevel parellel object-oriented framework for design optimization, parameter estimation, uncertainty quantification, and sensitivity analysis:version 4.0 uers's manual

Swiler, Laura P.; Giunta, Anthony A.; Hart, William E.; Watson, Jean-Paul W.; Eddy, John P.; Griffin, Joshua G.; Hough, Patricia D.; Kolda, Tamara G.; Martinez-Canales, Monica L.; Williams, Pamela J.; Eldred, Michael S.; Brown, Shannon L.; Adams, Brian M.; Dunlavy, Daniel D.

The DAKOTA (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a flexible and extensible interface between simulation codes and iterative analysis methods. DAKOTA contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quantification with sampling, reliability, and stochastic finite element methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the DAKOTA toolkit provides a flexible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a user's manual for the DAKOTA software and provides capability overviews and procedures for software execution, as well as a variety of example studies.

More Details
Results 51–100 of 105
Results 51–100 of 105