In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classifcation accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the diffcultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware.
The goal of this work was to pioneer a novel, low-overhead protocol for simultaneously assaying cell-surface markers and intracellular gene expression in a single mammalian cell. The purpose of developing such a method is to be able to understand the mechanisms by which pathogens engage with individual mammalian cells, depending on their cell surface proteins, and how both host and pathogen gene expression changes are reflective of these mechanisms. The knowledge gained from such analyses of single cells will ultimately lead to more robust pathogen detection and countermeasures. Our method was aimed at streamlining both the upstream cell sample preparation using microfluidic methods, as well as the actual library making protocol. Specifically, we wanted to implement a random hexamer-based reverse transcription of all RNA within a single cell (as opposed to oligo dT-based which would only capture polyadenylated transcripts), and then use a CRISPR-based method called scDash to deplete ribosomal DNAs (since ribosomal RNAs make up the majority of the RNA in a mammalian cell). After significant troubleshooting, we demonstrate that we are able to prepare cDNA from RNA using the random hexamer primer, and perform the rDNA depletion. We also show that we can visualize individually stained cells, setting up the pipeline for connecting surface markers to RNA-sequencing profiles. Finally, we test a number of devices for various parts of the pipeline, including bead generation, optical barcoding and cell dispensing, and demonstrate that while some of these have potential, more work is needed to optimize this part of the pipeline.
This project was broadly motivated by the need for new hardware that can process information such as images and sounds right at the point of where the information is sensed (e.g. edge computing). The project was further motivated by recent discoveries by group demonstrating that while certain organic polymer blends can be used to fabricate elements of such hardware, the need to mix ionic and electronic conducting phases imposed limits on performance, dimensional scalability and the degree of fundamental understanding of how such devices operated. As an alternative to blended polymers containing distinct ionic and electronic conducting phases, in this LDRD project we have discovered that a family of mixed valence coordination compounds called Prussian blue analogue (PBAs), with an open framework structure and ability to conduct both ionic and electronic charge, can be used for inkjet-printed flexible artificial synapses that reversibly switch conductance by more than four orders of magnitude based on electrochemically tunable oxidation state. Retention of programmed states is improved by nearly two orders of magnitude compared to the extensively studied organic polymers, thus enabling in-memory compute and avoiding energy costly off-chip access during training. We demonstrate dopamine detection using PBA synapses and biocompatibility with living neurons, evoking prospective application for brain - computer interfacing. By application of electron transfer theory to in-situ spectroscopic probing of intervalence charge transfer, we elucidate a switching mechanism whereby the degree of mixed valency between N-coordinated Ru sites controls the carrier concentration and mobility, as supported by density functional theory (DFT) .
This report describes research conducted to use data science and machine learning methods to distinguish targeted genome editing versus natural mutation and sequencer machine noise. Genome editing capabilities have been around for more than 20 years, and the efficiencies of these techniques has improved dramatically in the last 5+ years, notably with the rise of CRISPR-Cas technology. Whether or not a specific genome has been the target of an edit is concern for U.S. national security. The research detailed in this report provides first steps to address this concern. A large amount of data is necessary in our research, thus we invested considerable time collecting and processing it. We use an ensemble of decision tree and deep neural network machine learning methods as well as anomaly detection to detect genome edits given either whole exome or genome DNA reads. The edit detection results we obtained with our algorithms tested against samples held out during training of our methods are significantly better than random guessing, achieving high F1 and recall scores as well as with precision overall.
Operon prediction in prokaryotes is critical not only for understanding the regulation of endogenous gene expression, but also for exogenous targeting of genes using newly developed tools such as CRISPR-based gene modulation. A number of methods have used transcriptomics data to predict operons, based on the premise that contiguous genes in an operon will be expressed at similar levels. While promising results have been observed using these methods, most of them do not address uncertainty caused by technical variability between experiments, which is especially relevant when the amount of data available is small. In addition, many existing methods do not provide the flexibility to determine the stringency with which genes should be evaluated for being in an operon pair. We present OperonSEQer, a set of machine learning algorithms that uses the statistic and p-value from a non-parametric analysis of variance test (Kruskal-Wallis) to determine the likelihood that two adjacent genes are expressed from the same RNA molecule. We implement a voting system to allow users to choose the stringency of operon calls depending on whether your priority is high recall or high specificity. In addition, we provide the code so that users can retrain the algorithm and re-establish hyperparameters based on any data they choose, allowing for this method to be expanded as additional data is generated. We show that our approach detects operon pairs that are missed by current methods by comparing our predictions to publicly available long-read sequencing data. OperonSEQer therefore improves on existing methods in terms of accuracy, flexibility, and adaptability.
Previous strain development efforts for cyanobacteria have failed to achieve the necessary productivities needed to support economic biofuel production. We proposed to develop CRISPR Engineering for Rapid Enhancement of Strains (CERES). We developed genetic and computational tools to enable future high-throughput screening of CRISPR interference (CRISPRi) libraries in the cyanobacterium Synechococcus sp. PCC 7002, including: (1) Operon- SEQer: an ensemble of algorithms for predicting operon pairs using RNA-seq data, (2) experimental characterization and machine learning prediction of gRNA design rules for CRISPRi, and (3) a shuttle vector for gene expression. These tools lay the foundation for CRISPR library screening to develop cyanobacterial strains that are optimized for growth or metabolite production under a wide range of environmental conditions. The optimization of cyanobacterial strains will directly advance U.S. energy and climate security by enabling domestic biofuel production while simultaneously mitigating atmospheric greenhouse gases through photoautotrophic fixation of carbon dioxide.
Genome editing technologies, particularly those based on zinc-finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs), and CRISPR (clustered regularly interspaced short palindromic repeat DNA sequences)/Cas9 are rapidly progressing into clinical trials. Most clinical use of CRISPR to date has focused on ex vivo gene editing of cells followed by their re-introduction back into the patient. The ex vivo editing approach is highly effective for many disease states, including cancers and sickle cell disease, but ideally genome editing would also be applied to diseases which require cell modification in vivo. However, in vivo use of CRISPR technologies can be confounded by problems such as off-target editing, inefficient or off-target delivery, and stimulation of counterproductive immune responses. Current research addressing these issues may provide new opportunities for use of CRISPR in the clinical space. In this review, we examine the current status and scientific basis of clinical trials featuring ZFNs, TALENs, and CRISPR-based genome editing, the known limitations of CRISPR use in humans, and the rapidly developing CRISPR engineering space that should lay the groundwork for further translation to clinical application.
We report a prototype system to automate the DNA library preparation of bacterial genomes for analysis with the Oxford MinION nanopore sequencer as a first step towards a universal bacterial pathogen identification and biosurveillance tool. The ASPIRE (Automated Sample Preparation by Indexed Rotary Exchange) platform incorporates a rotary hydrophobic substrate that provides sequential delivery of sample and reagent droplets to heater and magnetic bead trapping modules via a single capillary coupled to a syringe pump. We have applied ASPIRE-based library preparation to lambda-phage and E. coli genomic DNA (gDNA) and verified its ability to produce libraries with DNA yield and ultimate sequenced read size distribution, quality, and reference-mapping percentages comparable to those obtained for benchtop prep methods.