Publications Details

Publications / SAND Report

MalGen: Malware Generation with Specific Behaviors to Improve Machine Learning-based Detectors

Smith, Michael R.; Carbajal, Armida J.; Domschot, Eva D.; Johnson, Nicholas T.; Goyal, Akul A.; Lamb, Christopher L.; Lubars, Joseph L.; Kegelmeyer, William P.; Krishnakumar, Raga K.; Quynn, Sophie Q.; Ramyaa, Ramyaa; Verzi, Stephen J.; Zhou, Xin Z.

In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classifcation accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the diffcultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware.