Publications

Results 1–50 of 69
Skip to search filters

MalGen: Malware Generation with Specific Behaviors to Improve Machine Learning-based Detectors

Smith, Michael R.; Carbajal, Armida J.; Domschot, Eva D.; Johnson, Nicholas J.; Goyal, Akul A.; Lamb, Christopher L.; Lubars, Joseph L.; Kegelmeyer, William P.; Krishnakumar, Raga K.; Quynn, Sophie Q.; Ramyaa, Ramyaa R.; Verzi, Stephen J.; Zhou, Xin Z.

In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classifcation accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the diffcultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware.

More Details

Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis

AISec 2020 - Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security

Smith, Michael R.; Johnson, Nicholas T.; Ingram, Joey; Carbajal, Armida J.; Haus, Bridget I.; Domschot, Eva; Ramyaa, Ramyaa; Lamb, Christopher L.; Verzi, Stephen J.; Kegelmeyer, William P.

Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities-as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports-1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57%-84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.

More Details

An Example of Counter-Adversarial Community Detection Analysis

Kegelmeyer, William P.; Wendt, Jeremy D.; Pinar, Ali P.

Community detection is often used to understand the nature of a network. However, there may exist an adversarial member of the network who wishes to evade that understanding. We analyze one such specific situation, quantifying the efficacy of certain attacks against a particular analytic use of community detection and providing a preliminary assessment of a possible defense.

More Details

Adverse Event Prediction Using Graph-Augmented Temporal Analysis: Final Report

Brost, Randolph B.; Carrier, Erin E.; Carroll, Michelle C.; Groth, Katrina M.; Kegelmeyer, William P.; Leung, Vitus J.; Link, Hamilton E.; Patterson, Andrew J.; Phillips, Cynthia A.; Richter, Samuel N.; Robinson, David G.; Staid, Andrea S.; Woodbridge, Diane M.-K.

This report summarizes the work performed under the Sandia LDRD project "Adverse Event Prediction Using Graph-Augmented Temporal Analysis." The goal of the project was to de- velop a method for analyzing multiple time-series data streams to identify precursors provid- ing advance warning of the potential occurrence of events of interest. The proposed approach combined temporal analysis of each data stream with reasoning about relationships between data streams using a geospatial-temporal semantic graph. This class of problems is relevant to several important topics of national interest. In the course of this work we developed new temporal analysis techniques, including temporal analysis using Markov Chain Monte Carlo techniques, temporal shift algorithms to refine forecasts, and a version of Ripley's K-function extended to support temporal precursor identification. This report summarizes the project's major accomplishments, and gathers the abstracts and references for the publication sub- missions and reports that were prepared as part of this work. We then describe work in progress that is not yet ready for publication.

More Details

(Active) Learning on Groups of Data with Information-Theoretic Estimators

Sutherland, Dougal S.; Kegelmeyer, William P.; Hutchinson, Robert L.

A wide range of machine learning problems, including astronomical inference about galaxy clusters, scene classification, parametric statistical inference, and predictions of public opinion, can be well-modeled as learning a function on (samples from) distributions. This project explores problems in learning such functions via kernel methods, particularly for large-scale problems. When learning from large numbers of distributions, the computation of typical methods scales between quadratically and cubically, and so they are not amenable to large datasets. We investigate the approach of approximate embeddings into Euclidean spaces such that inner products in the embedding space approximate kernel values between the source distributions. We first improve the understanding of the workhorse methods of random Fourier features: we show that of the two approaches in common usage, one is strictly superior. We then present a new embedding for a class of information-theoretic distribution distances, and evaluate it and existing embeddings on several real-world applications.

More Details

PANTHER. Trajectory Analysis

Rintoul, Mark D.; Wilson, Andrew T.; Valicka, Christopher G.; Kegelmeyer, William P.; Shead, Timothy M.; Czuchlewski, Kristina R.; Newton, Benjamin D.

We want to organize a body of trajectories in order to identify, search for, classify and predict behavior among objects such as aircraft and ships. Existing compari- son functions such as the Fr'echet distance are computationally expensive and yield counterintuitive results in some cases. We propose an approach using feature vectors whose components represent succinctly the salient information in trajectories. These features incorporate basic information such as total distance traveled and distance be- tween start/stop points as well as geometric features related to the properties of the convex hull, trajectory curvature and general distance geometry. Additionally, these features can generally be mapped easily to behaviors of interest to humans that are searching large databases. Most of these geometric features are invariant under rigid transformation. We demonstrate the use of different subsets of these features to iden- tify trajectories similar to an exemplar, cluster a database of several hundred thousand trajectories, predict destination and apply unsupervised machine learning algorithms.

More Details

Streaming malware classification in the presence of concept drift and class imbalance

Proceedings - 2013 12th International Conference on Machine Learning and Applications, ICMLA 2013

Kegelmeyer, William P.; Chiang, Ken C.; Ingram, Joey

Malware, or malicious software, is capable of performing any action or command that can be expressed in code and is typically used for illicit activities, such as e-mail spamming, corporate espionage, and identity theft. Most organizations rely on anti-virus software to identifymalware, which typically utilize signatures that can only identify previously-seen malware instances. We consider the detection ofmalware executables that are downloaded in streaming network data as a supervised machine learning problem. Using malwaredata collected over multiple years, we characterize the effect of concept drift and class imbalance on batch and streaming decision tree ensembles. In particular, we illustrate a surprising vulnerability generated by precisely the aspect of streaming methods that seemed most likely to help them, when compared to batch methods. © 2013 IEEE.

More Details

COMET: A recipe for learning and using large ensembles on massive data

Proceedings - IEEE International Conference on Data Mining, ICDM

Basilico, Justin D.; Munson, M.A.; Kolda, Tamara G.; Dixon, Kevin R.; Kegelmeyer, William P.

COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more. © 2011 IEEE.

More Details

Multilingual sentiment analysis using Latent Semantic Indexing and machine learning

Proceedings - IEEE International Conference on Data Mining, ICDM

Bader, Brett W.; Kegelmeyer, William P.; Chew, Peter A.

We present a novel approach to predicting the sentiment of documents in multiple languages, without translation. The only prerequisite is a multilingual parallel corpus wherein a training sample of the documents, in a single language only, have been tagged with their overall sentiment. Latent Semantic Indexing (LSI) converts that multilingual corpus into a multilingual "concept space". Both training and test documents can be projected into that space, allowing crosslingual semantic comparisons between the documents without the need for translation. Accordingly, the training documents with known sentiment are used to build a machine learning model which can, because of the multilingual nature of the document projections, be used to predict sentiment in the other languages. We explain and evaluate the accuracy of this approach. We also design and conduct experiments to investigate the extent to which topic and sentiment separately contribute to that classification accuracy, and thereby shed some initial light on the question of whether topic and sentiment can be sensibly teased apart. © 2011 IEEE.

More Details
Results 1–50 of 69
Results 1–50 of 69