In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classifcation accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the diffcultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware.
Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities-as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports-1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57%-84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.
In recognition of their mission and in response to continuously evolving cyber threats against nuclear facilities, Department of Energy - Nuclear Energy (DOE-NE) is building the Nuclear Energy Cyber security Research, Development, and Demonstration (RD&D) Program, which includes a cyber risk management thrust. This report supports the cyber risk management thrust objective which is to deliver "Standardized methodologies for credible risk-based identification, evaluation and prioritization of digital components." In a previous task, the Sandia National Laboratories (SNL) team presented evaluation criteria and a survey to review methods to determine the most suitable techniques [1] . In this task we will identify and evaluate a series of candidate methodologies. In this report, 10 distinct methodologies are evaluated. The overall goal of this effort was to identify the current range of risk analysis techniques that were currently available, and how they could be applied, with an focus on industrial control systems (ICS). Overall, most of the techniques identified did fall into accepted risk analysis practices, though they generally addressed only one step of the multi-step risk management process. A few addressed multiple steps, but generally their treatment was superficial. This study revealed that the current state of security risk analysis in digital control systems was not comprehensive and did not support a science-based evaluation. The papers surveyed did use mathematical formulation to describe the addressed problems, and tied the models to some kind of experimental or experiential evidence as support. Most of the papers, however, did not use a rigorous approach to experimentally support the proposed models, nor did they have enough evidence supporting the efficacy of the models to statistically analyze model impact. Both of these issues stem from the difficulty and expense associated with collecting experimental data in this domain.