Publications

72 Results
Skip to search filters

Evaluating causal-based feature selection for fuel property prediction models

Statistical Analysis and Data Mining

Nguyen, Bernard; Whitmore, Leanne S.; George, Anthe G.; Hudson, Corey H.

In-silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench-scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation-based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal-based feature selection might have on both model performance and identification of key molecular substructures. We found that causal-based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.

More Details

Benchmarking blockchain-based gene-drug interaction data sharing methods: A case study from the iDASH 2019 secure genome analysis competition blockchain track

International Journal of Medical Informatics

Kuo, Tsung T.; Bath, Tyler; Ma, Shuaicheng; Pattengale, Nicholas D.; Yang, Meng; Cao, Yang; Hudson, Corey H.; Kim, Jihoon; Post, Kai; Xiong, Li; Ohno-Machado, Lucila

Background: Blockchain distributed ledger technology is just starting to be adopted in genomics and healthcare applications. Despite its increased prevalence in biomedical research applications, skepticism regarding the practicality of blockchain technology for real-world problems is still strong and there are few implementations beyond proof-of-concept. We focus on benchmarking blockchain strategies applied to distributed methods for sharing records of gene-drug interactions. We expect this type of sharing will expedite personalized medicine. Basic Procedures: We generated gene-drug interaction test datasets using the Clinical Pharmacogenetics Implementation Consortium (CPIC) resource. We developed three blockchain-based methods to share patient records on gene-drug interactions: Query Index, Index Everything, and Dual-Scenario Indexing. Main Findings: We achieved a runtime of about 60 s for importing 4,000 gene-drug interaction records from four sites, and about 0.5 s for a data retrieval query. Our results demonstrated that it is feasible to leverage blockchain as a new platform to share data among institutions. Principal Conclusions: We show the benchmarking results of novel blockchain-based methods for institutions to share patient outcomes related to gene-drug interactions. Our findings support blockchain utilization in healthcare, genomic and biomedical applications. The source code is publicly available at https://github.com/tsungtingkuo/genedrug.

More Details

RetSynth: Determining all optimal and sub-optimal synthetic pathways that facilitate synthesis of target compounds in chassis organisms

BMC Bioinformatics

Whitmore, Leanne S.; Nguyen, Bernard; Pinar, Ali P.; George, Anthe G.; Hudson, Corey H.

Background: The efficient biological production of industrially and economically important compounds is a challenging problem. Brute-force determination of the optimal pathways to efficient production of a target chemical in a chassis organism is computationally intractable. Many current methods provide a single solution to this problem, but fail to provide all optimal pathways, optional sub-optimal solutions or hybrid biological/non-biological solutions. Results: Here we present RetSynth, software with a novel algorithm for determining all optimal biological pathways given a starting biological chassis and target chemical. By dynamically selecting constraints, the number of potential pathways scales by the number of fully independent pathways and not by the number of overall reactions or size of the metabolic network. This feature allows all optimal pathways to be determined for a large number of chemicals and for a large corpus of potential chassis organisms. Additionally, this software contains other features including the ability to collect data from metabolic repositories, perform flux balance analysis, and to view optimal pathways identified by our algorithm using a built-in visualization module. This software also identifies sub-optimal pathways and allows incorporation of non-biological chemical reactions, which may be performed after metabolic production of precursor molecules. Conclusions: The novel algorithm designed for RetSynth streamlines an arduous and complex process in metabolic engineering. Our stand-alone software allows the identification of candidate optimal and additional sub-optimal pathways, and provides the user with necessary ranking criteria such as target yield to decide which route to select for target production. Furthermore, the ability to incorporate non-biological reactions into the final steps allows determination of pathways to production for targets that cannot be solely produced biologically. With this comprehensive suite of features RetSynth exceeds any open-source software or webservice currently available for identifying optimal pathways for target production.

More Details

Exploiting Time and Subject Locality for Fast, Efficient, and Understandable Alert Triage

2018 International Conference on Computing, Networking and Communications, ICNC 2018

Kavaler, David; Hudson, Corey H.; Bierma, Michael B.

In many organizations, intrusion detection and other related systems are tuned to generate security alerts, which are then manually inspected by cyber-security analysts. These analysts often devote a large portion of time to inspecting these alerts, most of which are innocuous. Thus, it would be greatly beneficial to reduce the number of innocuous alerts, allowing analysts to utilize their time and skills for other aspects of cyber defense. In this work, we devise several simple, fast, and easily understood models to cut back this manual inspection workload, while maintaining high true positive and true negative rates. We demonstrate their effectiveness on real data, and discuss their potential utility in application by others.

More Details

Experimental single-strain mobilomics reveals events that shape pathogen emergence

Nucleic Acids Research

Schoeniger, Joseph S.; Hudson, Corey H.; Bent, Zachary W.; Sinha, Anupama S.; Williams, Kelly P.

Virulence genes on mobile DNAs such as genomic islands (GIs) and plasmids promote bacterial pathogen emergence. Excision is an early step in GI mobilization, producing a circular GI and a deletion site in the chromosome; circular forms are also known for some bacterial insertion sequences (ISs). The recombinant sequence at the junctions of such circles and deletions can be detected sensitively in high-throughput sequencing data, using new computational methods that enable empirical discovery of mobile DNAs. For the rich mobilome of a hospital Klebsiella pneumoniae strain, circularization junctions (CJs) were detected for six GIs and seven IS types. Our methods revealed differential biology of multiple mobile DNAs, imprecision of integrases and transposases, and differential activity among identical IS copies for IS26, ISKpn18 and ISKpn21. Using the resistance of circular dsDNA molecules to exonuclease, internally calibrated with the native plasmids, showed that not all molecules bearing GI CJs were circular. Transpositions were also detected, revealing replicon preference (ISKpn18 prefers a conjugative IncA/C2 plasmid), local action (IS26), regional preferences, selection (against capsule synthesis) and IS polarity inversion. Efficient discovery and global characterization of numerous mobile elements per experiment improves accounting for the new gene combinations that arise in emerging pathogens.

More Details

The tmRNA website

Nucleic Acids Research

Hudson, Corey H.; Williams, Kelly P.

The transfer-messenger RNA (tmRNA) and its partner protein SmpB act together in resolving problems arising when translating bacterial ribosomes reach the end of mRNA with no stop codon. Their genes have been found in nearly all bacterial genomes and in some organelles. The tmRNA Website serves tmRNA sequences, alignments and feature annotations, and has recently moved to http://bioinformatics.sandia.gov/tmrna/. New features include software used to find the sequences, an update raising the number of unique tmRNA sequences from 492 to 1716, and a database of SmpB sequences which are served along with the tmRNA sequence from the same organism.

More Details

RNAcentral: an international database of ncRNA sequences

Nucleic Acids Research

Williams, Kelly P.; Hudson, Corey H.; authors, 34 o.

The field of non-coding RNA biology has been hampered by the lack of availability of a comprehensive, up-to-date collection of accessioned RNA sequences. Here we present the first release of RNAcentral, a database that collates and integrates information from an international consortium of established RNA sequence databases. The initial release contains over 8.1 million sequences, including representatives of all major functional classes. A web portal (http://rnacentral.org) provides free access to data, search functionality, cross-references, source code and an integrated genome browser for selected species.

More Details

Ends of the line for tmRNA-SmpB

Frontiers in Microbiology

Hudson, Corey H.; Williams, Kelly P.

Genes for the RNA tmRNA and protein SmpB, partners in the trans-translation process that rescues stalled ribosomes, have previously been found in all bacteria and some organelles. We validate recent identification of tmRNA homologs in oomycete mitochondria by finding partner genes from oomycete nuclei that target SmpB to the mitochondrion. Exhaustive search now identifies a small number of complete, often highly derived, bacterial genomes that appear to lack a functional copy of one or the other partner gene (but not both). Three groups with reduced genomes have lost the central loop of SmpB, which is thought to improve alanylation and EF-Tu activation: Carsonella, Hodgkinia and the hemplasmas (hemotropic Mycoplasma). Carsonella has also lost the SmpB C-terminal tail, thought to stimulate the decoding center of the ribosome. Carsonella moreover exhibits gene overlap such that tmRNA maturation should produce a non-stop smpB mRNA, and one isolate exhibits complete degradation of the tmRNA gene yet its smpB shows no evidence for relaxed selective constraint. After loss of the SmpB central loop in the hemoplasmas, a subclade apparently lost tmRNA. At least some of the tmRNA/SmpB-deficient strains appear to further lack the ArfA and ArfB backup systems for ribosome rescue. The most frequent neighbors of smpB are the tmRNA gene, a ratA/rnfH unit, and the gene for RNaseR, a known physical and functional partner of tmRNA-SmpB. The tmRNA Website has moved and been updated, adding an SmpB sequence database (http://bioinformatics.sandia.gov/tmrna).

More Details

Understanding and regulation of microbial lignolysis for renewable platform chemicals

Turner, Kevin T.; Hudson, Corey H.; Tran-Gyamfi, Mary B.; Powell, Amy J.; Williams, Kelly P.

Lignin is often overlooked in the valorization of lignocellulosic biomass, but lignin-based materials and chemicals represent potential value-added products for biorefineries that could significantly improve the economics of a biorefinery. Fluctuating crude oil prices and changing fuel specifications are some of the driving factors to develop new technologies that could be used to convert polymeric lignin into low molecular weight lignin and or monomeric aromatic feedstocks to assist in the displacement of the current products associated with the conversion of a whole barrel of oil. Our project of understanding microbial lignolysis for renewable platform chemicals aimed to understand microbial and enzymatic lignolysis processes to break down lignin for conversion into commercially viable drop-in fuels. We developed novel lignin analytics to interrogate enzymatic and microbial lignolysis of native polymeric lignin and established a detailed understanding of lignolysis as a function of fungal enzyme, microbes and endophytes. Bioinformatics pipeline was developed for metatranscryptomic analysis of aridland ecosystem for investigating the potential discovery of new lignolysis gene and gene products.

More Details
72 Results
72 Results