Publications

69 Results
Skip to search filters

MalGen: Malware Generation with Specific Behaviors to Improve Machine Learning-based Detectors

Smith, Michael R.; Carbajal, Armida J.; Domschot, Eva D.; Johnson, Nicholas J.; Goyal, Akul A.; Lamb, Christopher L.; Lubars, Joseph L.; Kegelmeyer, William P.; Krishnakumar, Raga K.; Quynn, Sophie Q.; Ramyaa, Ramyaa R.; Verzi, Stephen J.; Zhou, Xin Z.

In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classifcation accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the diffcultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware.

More Details

Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis

AISec 2020 - Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security

Smith, Michael R.; Johnson, Nicholas T.; Ingram, Joey; Carbajal, Armida J.; Haus, Bridget I.; Domschot, Eva; Ramyaa, Ramyaa; Lamb, Christopher L.; Verzi, Stephen J.; Kegelmeyer, William P.

Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities-as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports-1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57%-84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.

More Details

An Example of Counter-Adversarial Community Detection Analysis

Kegelmeyer, William P.; Wendt, Jeremy D.; Pinar, Ali P.

Community detection is often used to understand the nature of a network. However, there may exist an adversarial member of the network who wishes to evade that understanding. We analyze one such specific situation, quantifying the efficacy of certain attacks against a particular analytic use of community detection and providing a preliminary assessment of a possible defense.

More Details

Adverse Event Prediction Using Graph-Augmented Temporal Analysis: Final Report

Brost, Randolph B.; Carrier, Erin E.; Carroll, Michelle C.; Groth, Katrina M.; Kegelmeyer, William P.; Leung, Vitus J.; Link, Hamilton E.; Patterson, Andrew J.; Phillips, Cynthia A.; Richter, Samuel N.; Robinson, David G.; Staid, Andrea S.; Woodbridge, Diane M.-K.

This report summarizes the work performed under the Sandia LDRD project "Adverse Event Prediction Using Graph-Augmented Temporal Analysis." The goal of the project was to de- velop a method for analyzing multiple time-series data streams to identify precursors provid- ing advance warning of the potential occurrence of events of interest. The proposed approach combined temporal analysis of each data stream with reasoning about relationships between data streams using a geospatial-temporal semantic graph. This class of problems is relevant to several important topics of national interest. In the course of this work we developed new temporal analysis techniques, including temporal analysis using Markov Chain Monte Carlo techniques, temporal shift algorithms to refine forecasts, and a version of Ripley's K-function extended to support temporal precursor identification. This report summarizes the project's major accomplishments, and gathers the abstracts and references for the publication sub- missions and reports that were prepared as part of this work. We then describe work in progress that is not yet ready for publication.

More Details

(Active) Learning on Groups of Data with Information-Theoretic Estimators

Sutherland, Dougal S.; Kegelmeyer, William P.; Hutchinson, Robert L.

A wide range of machine learning problems, including astronomical inference about galaxy clusters, scene classification, parametric statistical inference, and predictions of public opinion, can be well-modeled as learning a function on (samples from) distributions. This project explores problems in learning such functions via kernel methods, particularly for large-scale problems. When learning from large numbers of distributions, the computation of typical methods scales between quadratically and cubically, and so they are not amenable to large datasets. We investigate the approach of approximate embeddings into Euclidean spaces such that inner products in the embedding space approximate kernel values between the source distributions. We first improve the understanding of the workhorse methods of random Fourier features: we show that of the two approaches in common usage, one is strictly superior. We then present a new embedding for a class of information-theoretic distribution distances, and evaluate it and existing embeddings on several real-world applications.

More Details

PANTHER. Trajectory Analysis

Rintoul, Mark D.; Wilson, Andrew T.; Valicka, Christopher G.; Kegelmeyer, William P.; Shead, Timothy M.; Czuchlewski, Kristina R.; Newton, Benjamin D.

We want to organize a body of trajectories in order to identify, search for, classify and predict behavior among objects such as aircraft and ships. Existing compari- son functions such as the Fr'echet distance are computationally expensive and yield counterintuitive results in some cases. We propose an approach using feature vectors whose components represent succinctly the salient information in trajectories. These features incorporate basic information such as total distance traveled and distance be- tween start/stop points as well as geometric features related to the properties of the convex hull, trajectory curvature and general distance geometry. Additionally, these features can generally be mapped easily to behaviors of interest to humans that are searching large databases. Most of these geometric features are invariant under rigid transformation. We demonstrate the use of different subsets of these features to iden- tify trajectories similar to an exemplar, cluster a database of several hundred thousand trajectories, predict destination and apply unsupervised machine learning algorithms.

More Details

Streaming malware classification in the presence of concept drift and class imbalance

Proceedings - 2013 12th International Conference on Machine Learning and Applications, ICMLA 2013

Kegelmeyer, William P.; Chiang, Ken C.; Ingram, Joey

Malware, or malicious software, is capable of performing any action or command that can be expressed in code and is typically used for illicit activities, such as e-mail spamming, corporate espionage, and identity theft. Most organizations rely on anti-virus software to identifymalware, which typically utilize signatures that can only identify previously-seen malware instances. We consider the detection ofmalware executables that are downloaded in streaming network data as a supervised machine learning problem. Using malwaredata collected over multiple years, we characterize the effect of concept drift and class imbalance on batch and streaming decision tree ensembles. In particular, we illustrate a surprising vulnerability generated by precisely the aspect of streaming methods that seemed most likely to help them, when compared to batch methods. © 2013 IEEE.

More Details

COMET: A recipe for learning and using large ensembles on massive data

Proceedings - IEEE International Conference on Data Mining, ICDM

Basilico, Justin D.; Munson, M.A.; Kolda, Tamara G.; Dixon, Kevin R.; Kegelmeyer, William P.

COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more. © 2011 IEEE.

More Details

Multilingual sentiment analysis using Latent Semantic Indexing and machine learning

Proceedings - IEEE International Conference on Data Mining, ICDM

Bader, Brett W.; Kegelmeyer, William P.; Chew, Peter A.

We present a novel approach to predicting the sentiment of documents in multiple languages, without translation. The only prerequisite is a multilingual parallel corpus wherein a training sample of the documents, in a single language only, have been tagged with their overall sentiment. Latent Semantic Indexing (LSI) converts that multilingual corpus into a multilingual "concept space". Both training and test documents can be projected into that space, allowing crosslingual semantic comparisons between the documents without the need for translation. Accordingly, the training documents with known sentiment are used to build a machine learning model which can, because of the multilingual nature of the document projections, be used to predict sentiment in the other languages. We explain and evaluate the accuracy of this approach. We also design and conduct experiments to investigate the extent to which topic and sentiment separately contribute to that classification accuracy, and thereby shed some initial light on the question of whether topic and sentiment can be sensibly teased apart. © 2011 IEEE.

More Details

Network discovery, characterization, and prediction : a grand challenge LDRD final report

Kegelmeyer, William P.

This report is the final summation of Sandia's Grand Challenge LDRD project No.119351, 'Network Discovery, Characterization and Prediction' (the 'NGC') which ran from FY08 to FY10. The aim of the NGC, in a nutshell, was to research, develop, and evaluate relevant analysis capabilities that address adversarial networks. Unlike some Grand Challenge efforts, that ambition created cultural subgoals, as well as technical and programmatic ones, as the insistence on 'relevancy' required that the Sandia informatics research communities and the analyst user communities come to appreciate each others needs and capabilities in a very deep and concrete way. The NGC generated a number of technical, programmatic, and cultural advances, detailed in this report. There were new algorithmic insights and research that resulted in fifty-three refereed publications and presentations; this report concludes with an abstract-annotated bibliography pointing to them all. The NGC generated three substantial prototypes that not only achieved their intended goals of testing our algorithmic integration, but which also served as vehicles for customer education and program development. The NGC, as intended, has catalyzed future work in this domain; by the end it had already brought in, in new funding, as much funding as had been invested in it. Finally, the NGC knit together previously disparate research staff and user expertise in a fashion that not only addressed our immediate research goals, but which promises to have created an enduring cultural legacy of mutual understanding, in service of Sandia's national security responsibilities in cybersecurity and counter proliferation.

More Details

FCLib: The Feature Characterization Library

Gentile, Ann C.; Kegelmeyer, William P.; Ulmer, Craig D.

The Feature Characterization Library (FCLib) is a software library that simplifies the process of interrogating, analyzing, and understanding complex data sets generated by finite element applications. This document provides an overview of the library, a description of both the design philosophy and implementation of the library, and examples of how the library can be utilized to extract understanding from raw datasets.

More Details

Multilinear algebra for analyzing data with multiple linkages

Dunlavy, Daniel D.; Kolda, Tamara G.; Kegelmeyer, William P.

Link analysis typically focuses on a single type of connection, e.g., two journal papers are linked because they are written by the same author. However, often we want to analyze data that has multiple linkages between objects, e.g., two papers may have the same keywords and one may cite the other. The goal of this paper is to show that multilinear algebra provides a tool for multilink analysis. We analyze five years of publication data from journals published by the Society for Industrial and Applied Mathematics. We explore how papers can be grouped in the context of multiple link types using a tensor to represent all the links between them. A PARAFAC decomposition on the resulting tensor yields information similar to the SVD decomposition of a standard adjacency matrix. We show how the PARAFAC decomposition can be used to understand the structure of the document space and define paper-paper similarities based on multiple linkages. Examples are presented where the decomposed tensor data is used to find papers similar to a body of work (e.g., related by topic or similar to a particular author's papers), find related authors using linkages other than explicit co-authorship or citations, distinguish between papers written by different authors with the same name, and predict the journal in which a paper was published.

More Details

Multilinear operators for higher-order decompositions

Kolda, Tamara G.; Dunlavy, Daniel D.; Kegelmeyer, William P.

We propose two new multilinear operators for expressing the matrix compositions that are needed in the Tucker and PARAFAC (CANDECOMP) decompositions. The first operator, which we call the Tucker operator, is shorthand for performing an n-mode matrix multiplication for every mode of a given tensor and can be employed to concisely express the Tucker decomposition. The second operator, which we call the Kruskal operator, is shorthand for the sum of the outer-products of the columns of N matrices and allows a divorce from a matricized representation and a very concise expression of the PARAFAC decomposition. We explore the properties of the Tucker and Kruskal operators independently of the related decompositions. Additionally, we provide a review of the matrix and tensor operations that are frequently used in the context of tensor decompositions.

More Details

FCLib: A library for building data analysis and data discovery tools

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Koegler, Wendy S.; Kegelmeyer, William P.

In this paper we describe a data analysis toolkit constructed to meet the needs of data discovery in large scale spatio-temporal data. The toolkit is a C library of building blocks that can be assembled into data analyses. Our goals were to build a toolkit which is easy to use, is applicable to a wide variety of science domains, supports feature-based analysis, and minimizes low-level processing. The discussion centers on the design of a data model and interface that best supports these goals and we present three usage examples. © Springer-Verlag Berlin Heidelberg 2005.

More Details

Creating and managing lookmarks in ParaView

Kegelmeyer, William P.

This paper describes the integration of lookmarks into the ParaView visualization tool. Lookmarks are pointers to views of specific parts of a dataset. They were so named because lookmarks are to a visualization tool and dataset as bookmarks are to a browser and the World Wide Web. A lookmark can be saved and organized among other lookmarks within the context of ParaView. Then at a later time, either in the same ParaView session or in a different one, it can be regenerated, displaying the exact view of the data that had previously been saved. This allows the user to pick up where they left off, to continue to adjust the view or otherwise manipulate the data. Lookmarks facilitate collaboration between users who wish to share views of a dataset. They enable more effective data comparison because they can be applied to other datasets. They also serve as a way of organizing a user's data. Ultimately, a lookmark is a time-saving tool that automates the recreation of a complex view of the data.

More Details
69 Results
69 Results