Arithmetic Coding (AC) using Prediction by Partial Matching (PPM) is a compression algorithm that can be used as a machine learning algorithm. This paper describes a new algorithm, NGram PPM. NGram PPM has all the predictive power of AC/PPM, but at a fraction of the computational cost. Unlike compression-based analytics, it is also amenable to a vector space interpretation, which creates the ability for integration with other traditional machine learning algorithms. AC/PPM is reviewed, including its application to machine learning. Then NGram PPM is described and test results are presented, comparing them to AC/PPM.
This report describes the results of a seven day effort to assist subject matter experts address a problem related to COVID-19. In the course of this effort, we analyzed the 29K documents provided as part of the White House's call to action. This involved applying a variety of natural language processing techniques and compression-based analytics in combination with visualization techniques and assessment with subject matter experts to pursue answers to a specific question. In this paper, we will describe the algorithms, the software, the study performed, and availability of the software developed during the effort.
The flexibility of network communication within Internet protocols is fundamental to network function, yet this same flexibility permits the possibility of malicious use. In particular, malicious behavior can masquerade as benign traffic, thus evading systems designed to catch misuse of network resources. However, perfect imitation of benign traffic is difficult, meaning that small unintentional deviations from normal can occur. Identifying these deviations requires that the defenders know what features reveal malicious behavior. Herein, we present an application of compression-based analytics to network communication that can reduce the need for defenders to know a priori what features they need to examine. Motivating the approach is the idea that compression relies on the ability to discover and make use of predictable elements in information, thereby highlighting any deviations between expected and received content. We introduce a so-called 'slice compression' score to identify malicious or anomalous communication in two ways. First, we apply normalized compression distances to classification problems and discuss methods for reducing the noise by excising application content (as opposed to protocol features) using slice compression. Second, we present a new technique for anomaly detection, referred to as slice compression for anomaly detection. A diverse collection of datasets are analyzed to illustrate the efficacy of the proposed approaches. While our focus is network communication, other types of data are also considered to illustrate the generality of the method.
We present a new method for boundary detection within sequential data using compression-based analytics. Our approach is to approximate the information distance between two adjacent sliding windows within the sequence. Large values in the distance metric are indicative of boundary locations. A new algorithm is developed, referred to as sliding information distance (SLID), that provides a fast, accurate, and robust approximation to the normalized information distance. A modified smoothed z-score algorithm is used to locate peaks in the distance metric, indicating boundary locations. A variety of data sources are considered, including text and audio, to demonstrate the efficacy of our approach.
In this work, we approach topic tracking and meme trending in social media with a temporal focus; rather than analyzing topics, we aim to identify time periods whose content differs significantly from normal. We detail two approaches. The first is an information-theoretic analysis of the distributions of terms emitted during each time period. In the second, we cluster the documents from each time period and analyze the tightness of each clustering. We also discuss a method of combining the scores created by each technique, and we provide ample empirical analysis of our methodology on various Twitter datasets.
On August 15, 2016, Sandia hosted a visit by Professor Venkatesh Narayanamurti. Prof Narayanamurti (Benjamin Peirce Research Professor of Technology and Public Policy at Harvard, Board Member of the Belfer Center for Science and International Affairs, former Dean of the School of Engineering and Applied Science at Harvard, former Dean of Engineering at UC Santa Barbara, and former Vice President of Division 1000 at Sandia). During the visit, a small, informal, all-day idea exploration session on "Towards an Engineering and Applied Science of Research" was conducted. This document is a brief synopsis or "footprint" of the presentations and discussions at this Idea Exploration Session. The intent of this document is to stimulate further discussion about pathways Sandia can take to improve its Research practices.
An underserved niche exists for data mining tools in complex analytical environments. We propose three attributes of analytical tool development that facilitates rapid operationalization of new tools into complex, dynamic environments: accessibility, adaptability, and extendibility. Accessibility we define as the ability to load data into an analytical system quickly and seamlessly. Adaptability we define as the ability to apply a tool rapidly to new, unanticipated use cases. Extendibility we define as the ability to create new functionality “in the field” where it is being used and, if needed, harden that new functionality into a new, more permanent user interface. Distributed “big data” systems generally do not optimize for these attributes, creating an underserved niche for new analytical tools. In this paper we will define the problem, examine the three attributes, and describe the architecture of an example system called Citrus that we have built and use that is especially focused on these three attributes.
In the context of text-based analysis workflows, we propose that an effective analytic tool facilitates triage by a) enabling users to identify and set aside irrelevant content (i.e., reduce the complexity of information in a dataset) and b) develop a working mental model of which items are most relevant to the question at hand. This LDRD funded research developed a dataset that is enabling this team to evaluate propose normalized compression distance (NCD) as a task, user, and context-insensitive measure of categorization outcomes (Shannon entropy is reduced as order is imposed). Effective analytics tools help people impose order, reducing complexity in measurable ways. Our concept and research was documented in a paper accepted to the ACM conference Beyond Time and Error: Novel Methods in Information Visualization Evaluation , part of the IEEE VisWeek Conference, Baltimore, MD, October 16-21, 2016. The paper is included as an appendix to this report.
In this project, we researched new techniques for detecting hidden networks of individuals and institutions by introducing the use of temporal correlations among behaviors, leveraging both information sources and metadata. We validated the algorithms using the Wikipedia edit history. The rapid increase in crowd-sourced applications like Wikipedia is providing a rich set of data with both a record of behaviors and a set of direct interactions among individuals. Data sets with network ground truth are needed to develop and validate models, before applying them to national security settings where content and meta-data alone are available.
This paper has three goals. The first is to review Shannon's theory of information and the subsequent advances leading to today's statistics-based text analysis algorithms, showing that the semantics of the text is neglected. The second goal is to propose an extension of Shannon's original model that can take into account semantics, where the 'semantics' of a message is understood in terms of the intended or actual changes on the recipient of a message. The third goal is to propose several lines of research that naturally fall out of the proposed model. Each computational approach to solving some problem rests on an underlying model or set of models that describe how key phenomena in the real world are represented and how they are manipulated. These models are both liberating and constraining. They are liberating in that they suggest a path of development for new tools and algorithms. They are constraining in that they intentionally ignore other potential paths of development. Modern statistical-based text analysis algorithms have a specific intellectual history and set of underlying models rooted in Shannon's theory of communication. For Shannon, language is treated as a stochastic generator of symbol sequences. Shannon himself, subsequently Weaver, and at least one of his predecessors are all explicit in their decision to exclude semantics from their models. This rejection of semantics as 'irrelevant to the engineering problem' is elegant and combined with developments particularly by Salton and subsequently by Latent Semantic Analysis, has led to a whole collection of powerful algorithms and an industry for data mining technologies. However, the kinds of problems currently facing us go beyond what can be accounted for by this stochastic model. Today's problems increasingly focus on the semantics of specific pieces of information. And although progress is being made with the old models, it seems natural to develop or extend information theory to account for semantics. By developing such theory, we can improve the quality of the next generation analytical tools. Far from being a mere intellectual curiosity, a new theory can provide the means for us to take into account information that has been to date ignored by the algorithms and technologies we develop. This paper will begin with an examination of Shannon's theory of communication, discussing the contributions and the limitations of the theory and how that theory gets expanded into today's statistical text analysis algorithms. Next, we will expand Shannon's model. We'll suggest a transactional definition of semantics that focuses on the intended and actual change that messages are intended to have on the recipient. Finally, we will examine implications of the model for algorithm development.
This report describes the Licensing Support Network (LSN) Assistant--a set of tools for categorizing e-mail messages and documents, and investigating and correcting existing archives of categorized e-mail messages and documents. The two main tools in the LSN Assistant are the LSN Archive Assistant (LSNAA) tool for recategorizing manually labeled e-mail messages and documents and the LSN Realtime Assistant (LSNRA) tool for categorizing new e-mail messages and documents. This report focuses on the LSNAA tool. There are two main components of the LSNAA tool. The first is the Sandia Categorization Framework, which is responsible for providing categorizations for documents in an archive and storing them in an appropriate Categorization Database. The second is the actual user interface, which primarily interacts with the Categorization Database, providing a way for finding and correcting categorizations errors in the database. A procedure for applying the LSNAA tool and an example use case of the LSNAA tool applied to a set of e-mail messages are provided. Performance results of the categorization model designed for this example use case are presented.
An experiment was conducted comparing the effectiveness of individual versus group electronic brainstorming in order to address difficult, real world challenges. While industrial reliance on electronic communications has become ubiquitous, empirical and theoretical understanding of the bounds of its effectiveness have been limited. Previous research using short-term, laboratory experiments have engaged small groups of students in answering questions irrelevant to an industrial setting. The current experiment extends current findings beyond the laboratory to larger groups of real-world employees addressing organization-relevant challenges over the course of four days. Findings are twofold. First, the data demonstrate that (for this design) individuals perform at least as well as groups in producing quantity of electronic ideas, regardless of brainstorming duration. However, when judged with respect to quality along three dimensions (originality, feasibility, and effectiveness), the individuals significantly (p<0.05) out performed the group working together. The theoretical and applied (e.g., cost effectiveness) implications of this finding are discussed. Second, the current experiment yielded several viable solutions to the wickedly difficult problem that was posed.
The large number of government and industry activities supporting the Unit of Action (UA), with attendant documents, reports and briefings, can overwhelm decision-makers with an overabundance of information that hampers the ability to make quick decisions often resulting in a form of gridlock. In particular, the large and rapidly increasing amounts of data and data formats stored on UA Advanced Collaborative Environment (ACE) servers has led to the realization that it has become impractical and even impossible to perform manual analysis leading to timely decisions. UA Program Management (PM UA) has recognized the need to implement a Decision Support System (DSS) on UA ACE. The objective of this document is to research the commercial Knowledge Discovery and Data Mining (KDDM) market and publish the results in a survey. Furthermore, a ranking mechanism based on UA ACE-specific criteria has been developed and applied to a representative set of commercially available KDDM solutions. In addition, an overview of four R&D areas identified as critical to the implementation of DSS on ACE is provided. Finally, a comprehensive database containing detailed information on surveyed KDDM tools has been developed and is available upon customer request.