Publications

8 Results
Skip to search filters

Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis

AISec 2020 - Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security

Smith, Michael R.; Johnson, Nicholas T.; Ingram, Joey; Carbajal, Armida J.; Haus, Bridget I.; Domschot, Eva; Ramyaa, Ramyaa; Lamb, Christopher L.; Verzi, Stephen J.; Kegelmeyer, William P.

Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities-as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports-1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57%-84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.

More Details

Detailed Statistical Models of Host-Based Data for Detection of Malicious Activity

Acquesta, Erin A.; Chen, Guenevere C.; Adams, Susan S.; Bryant, Ross D.; Haas, Jason J.; Johnson, Nicholas T.; Romanowich, Paul R.; Roy, Krishna C.; Shakamuri, Mayuri S.; Ting, Christina T.

The cybersecurity research community has focused primarily on the analysis and automation of intrusion detection systems by examining network traffic behaviors. Expanding on this expertise, advanced cyber defense analysis is turning to host-based data to use in research and development to produce the next generation network defense tools. The ability to perform deep packet inspection of network traffic is increasingly harder with most boundary network traffic moving to HTTPS. Additionally, network data alone does not provide a full picture of end-to-end activity. These are some of the reasons that necessitate looking at other data sources such as host data. We outline our investigation into the processing, formatting, and storing of the data along with the preliminary results from our exploratory data analysis. In writing this report, it is our goal to aid in guiding future research by providing foundational understanding for an area of cybersecurity that is rich with a variety of complex, categorical, and sparse data, with a strong human influence component. Including suggestions for guiding potential directions for future research.

More Details
8 Results
8 Results