A novel hyperspectral fluorescence microscope for high-resolution 3D optical sectioning of cells and other structures has been designed, constructed, and used to investigate a number of different problems. We have significantly extended new multivariate curve resolution (MCR) data analysis methods to deconvolve the hyperspectral image data and to rapidly extract quantitative 3D concentration distribution maps of all emitting species. The imaging system has many advantages over current confocal imaging systems including simultaneous monitoring of numerous highly overlapped fluorophores, immunity to autofluorescence or impurity fluorescence, enhanced sensitivity, and dramatically improved accuracy, reliability, and dynamic range. Efficient data compression in the spectral dimension has allowed personal computers to perform quantitative analysis of hyperspectral images of large size without loss of image quality. We have also developed and tested software to perform analysis of time resolved hyperspectral images using trilinear multivariate analysis methods. The new imaging system is an enabling technology for numerous applications including (1) 3D composition mapping analysis of multicomponent processes occurring during host-pathogen interactions, (2) monitoring microfluidic processes, (3) imaging of molecular motors and (4) understanding photosynthetic processes in wild type and mutant Synechocystis cyanobacteria.
We have developed a new, high performance, hyperspectral microscope for biological and other applications. For each voxel within a three-dimensional specimen, the microscope simultaneously records the emission spectrum from 500 nm to 800 nm, with better than 3 nm spectral resolution. The microscope features a fully confocal design to ensure high spatial resolution and high quality optical sectioning. Optical throughput and detection efficiency are maximized through the use of a custom prism spectrometer and a backside thinned electron multiplying charge coupled device (EMCCD) array. A custom readout mode and synchronization scheme enable 512-point spectra to be recorded at a rate of 8300 spectra per second. In addition, the EMCCD readout mode eliminates curvature and keystone artifacts that often plague spectral imaging systems. The architecture of the new microscope is described in detail, and hyperspectral images from several specimens are presented.
Hyperspectral imaging provides complex image data with spectral information from many fluorescent species contained within the sample such as the fluorescent labels and cellular or pigment autofluorescence. To maximize the utility of this spectral imaging technique it is necessary to couple hyperspectral imaging with sophisticated multivariate analysis methods to extract meaningful relationships from the overlapped spectra. Many commonly employed multivariate analysis techniques require the identity of the emission spectra of each component to be known or pure component pixels within the image, a condition rarely met in biological samples. Multivariate curve resolution (MCR) has proven extremely useful for analyzing hyperspectral and multispectral images of biological specimens because it can operate with little or no a priori information about the emitting species, making it appropriate for interrogating samples containing autofluorescence and unanticipated contaminating fluorescence. To demonstrate the unique ability of our hyperspectral imaging system coupled with MCR analysis techniques we will analyze hyperspectral images of four-color in-situ hybridized rat brain tissue containing 455 spectral pixels from 550 - 850 nm. Even though there were only four colors imparted onto the tissue in this case, analysis revealed seven fluorescent species, including contributions from cellular autofluorescence and the tissue mounting media. Spectral image analysis will be presented along with a detailed discussion of the origin of the fluorescence and specific illustrations of the adverse effects of ignoring these additional fluorescent species in a traditional microscopy experiment and a hyperspectral imaging system.
Our aim is to determine the network of events, or the regulatory network, that defines an immune response to a bio-toxin. As a model system, we are studying T cell regulatory network triggered through tyrosine kinase receptor activation using a combination of pathway stimulation and time-series microarray experiments. Our approach is composed of five steps (1) microarray experiments and data error analysis, (2) data clustering, (3) data smoothing and discretization, (4) network reverse engineering, and (5) network dynamics analysis and fingerprint identification. The technological outcome of this study is a suite of experimental protocols and computational tools that reverse engineer regulatory networks provided gene expression data. The practical biological outcome of this work is an immune response fingerprint in terms of gene expression levels. Inferring regulatory networks from microarray data is a new field of investigation that is no more than five years old. To the best of our knowledge, this work is the first attempt that integrates experiments, error analyses, data clustering, inference, and network analysis to solve a practical problem. Our systematic approach of counting, enumeration, and sampling networks matching experimental data is new to the field of network reverse engineering. The resulting mathematical analyses and computational tools lead to new results on their own and should be useful to others who analyze and infer networks.
While hyperspectral imaging systems are increasingly used in remote sensing and offer enhanced scene characterization relative to univariate and multispectral technologies, it has proven difficult in practice to extract all of the useful information from these systems due to overwhelming data volume, confounding atmospheric effects, and the limited a priori knowledge regarding the scene. The need exists for the ability to perform rapid and comprehensive data exploitation of remotely sensed hyperspectral imagery. To address this need, this paper describes the application of a fast and rigorous multivariate curve resolution (MCR) algorithm to remotely sensed thermal infrared hyperspectral images. Employing minimal a priori knowledge, notably non-negativity constraints on the extracted endmember profiles and a constant abundance constraint for the atmospheric upwelling component, it is demonstrated that MCR can successfully compensate thermal infrared hyperspectral images for atmospheric upwelling and, thereby, transmittance effects. We take a semi-synthetic approach to obtaining image data containing gas plumes by adding emission gas signals onto real hyperspectral images. MCR can accurately estimate the relative spectral absorption coefficients and thermal contrast distribution of an ammonia gas plume component added near the minimum detectable quantity.
A variety of multivariate calibration algorithms for quantitative spectral analyses were investigated and compared, and new algorithms were developed in the course of this Laboratory Directed Research and Development project. We were able to demonstrate the ability of the hybrid classical least squares/partial least squares (CLSIPLS) calibration algorithms to maintain calibrations in the presence of spectrometer drift and to transfer calibrations between spectrometers from the same or different manufacturers. These methods were found to be as good or better in prediction ability as the commonly used partial least squares (PLS) method. We also present the theory for an entirely new class of algorithms labeled augmented classical least squares (ACLS) methods. New factor selection methods are developed and described for the ACLS algorithms. These factor selection methods are demonstrated using near-infrared spectra collected from a system of dilute aqueous solutions. The ACLS algorithm is also shown to provide improved ease of use and better prediction ability than PLS when transferring calibrations between near-infrared calibrations from the same manufacturer. Finally, simulations incorporating either ideal or realistic errors in the spectra were used to compare the prediction abilities of the new ACLS algorithm with that of PLS. We found that in the presence of realistic errors with non-uniform spectral error variance across spectral channels or with spectral errors correlated between frequency channels, ACLS methods generally out-performed the more commonly used PLS method. These results demonstrate the need for realistic error structure in simulations when the prediction abilities of various algorithms are compared. The combination of equal or superior prediction ability and the ease of use of the ACLS algorithms make the new ACLS methods the preferred algorithms to use for multivariate spectral calibrations.
Molecular analysis of cancer, at the genomic level, could lead to individualized patient diagnostics and treatments. The developments to follow will signal a significant paradigm shift in the clinical management of human cancer. Despite our initial hopes, however, it seems that simple analysis of microarray data cannot elucidate clinically significant gene functions and mechanisms. Extracting biological information from microarray data requires a complicated path involving multidisciplinary teams of biomedical researchers, computer scientists, mathematicians, statisticians, and computational linguists. The integration of the diverse outputs of each team is the limiting factor in the progress to discover candidate genes and pathways associated with the molecular biology of cancer. Specifically, one must deal with sets of significant genes identified by each method and extract whatever useful information may be found by comparing these different gene lists. Here we present our experience with such comparisons, and share methods developed in the analysis of an infant leukemia cohort studied on Affymetrix HG-U95A arrays. In particular, spatial gene clustering, hyper-dimensional projections, and computational linguistics were used to compare different gene lists. In spatial gene clustering, different gene lists are grouped together and visualized on a three-dimensional expression map, where genes with similar expressions are co-located. In another approach, projections from gene expression space onto a sphere clarify how groups of genes can jointly have more predictive power than groups of individually selected genes. Finally, online literature is automatically rearranged to present information about genes common to multiple groups, or to contrast the differences between the lists. The combination of these methods has improved our understanding of infant leukemia. While the complicated reality of the biology dashed our initial, optimistic hopes for simple answers from microarrays, we have made progress by combining very different analytic approaches.
Multivariate curve resolution (MCR) using constrained alternating least squares algorithms represents a powerful analysis capability for the quantitative analysis of hyperspectral image data. We will demonstrate the application of MCR using data from a new hyperspectral fluorescence imaging microarray scanner for monitoring gene expression in cells from thousands of genes on the array. The new scanner collects the entire fluorescence spectrum from each pixel of the scanned microarray. Application of MCR with nonnegativity and equality constraints reveals several sources of undesired fluorescence that emit in the same wavelength range as the reporter fluorophores. MCR analysis of the hyperspectral images confirms that one of the sources of fluorescence is due to contaminant fluorescence under the printed DNA spots that is spot localized. Thus, traditional background subtraction methods used with data collected from the current commercial microarray scanners will lead to errors in determining the relative expression of low-expressed genes. With the new scanner and MCR analysis, we generate relative concentration maps of the background, impurity, and fluorescent labels over the entire image. Since the concentration maps of the fluorescent labels are relatively unaffected by the presence of background and impurity emissions, the accuracy and useful dynamic range of the gene expression data are both greatly improved over those obtained by commercial microarray scanners.
High throughput instruments and analysis techniques are required in order to make good use of the genomic sequences that have recently become available for many species, including humans. These instruments and methods must work with tens of thousands of genes simultaneously, and must be able to identify the small subsets of those genes that are implicated in the observed phenotypes, or, for instance, in responses to therapies. Microarrays represent one such high throughput method, which continue to find increasingly broad application. This project has improved microarray technology in several important areas. First, we developed the hyperspectral scanner, which has discovered and diagnosed numerous flaws in techniques broadly employed by microarray researchers. Second, we used a series of statistically designed experiments to identify and correct errors in our microarray data to dramatically improve the accuracy, precision, and repeatability of the microarray gene expression data. Third, our research developed new informatics techniques to identify genes with significantly different expression levels. Finally, natural language processing techniques were applied to improve our ability to make use of online literature annotating the important genes. In combination, this research has improved the reliability and precision of laboratory methods and instruments, while also enabling substantially faster analysis and discovery.
We describe the design, construction, and operation of a hyperspectral microarray scanner for functional genomic research. The hyperspectral instrument operates with spatial resolutions ranging from 3 to 30 {micro}m and records the emission spectrum between 490 and 900 nm with a spectral resolution of 3 nm for each pixel of the microarray. This spectral information, when coupled with multivariate data analysis techniques, allows for identification and elimination of unwanted artifacts and greatly improves the accuracy of microarray experiments. Microarray results presented in this study clearly demonstrate the separation of fluorescent label emission from the spectrally overlapping emission due to the underlying glass substrate. We also demonstrate separation of the emission due to green fluorescent protein expressed by yeast cells from the spectrally overlapping autofluorescence of the yeast cells and the growth media.
A manuscript describing this work summarized below has been submitted to Applied Spectroscopy. Comparisons of prediction models from the new ACLS and PLS multivariate spectral analysis methods were conducted using simulated data with deviations from the idealized model. Simulated uncorrelated concentration errors, and uncorrelated and correlated spectral noise were included to evaluate the methods on situations representative of experimental data. The simulations were based on pure spectral components derived from real near-infrared spectra of multicomponent dilute aqueous solutions containing glucose, urea, ethanol, and NaCl in the concentration range from 0-500 mg/dL. The statistical significance of differences was evaluated using the Wilcoxon signed rank test. The prediction abilities with nonlinearities present were similar for both calibration methods although concentration noise, number of samples, and spectral noise distribution sometimes affected one method more than the other. In the case of ideal errors and in the presence of nonlinear spectral responses, the differences between the standard error of predictions of the two methods were sometimes statistically significant, but the differences were always small in magnitude. Importantly, SRACLS was found to be competitive with PLS when component concentrations were only known for a single component. Thus, SRACLS has a distinct advantage over standard CLS methods that require that all spectral components be included in the model. In contrast to simulations with ideal error, SRACLS often generated models with superior prediction performance relative to PLS when the simulations were more realistic and included either non-uniform errors and/or correlated errors. Since the generalized ACLS algorithm is compatible with the PACLS method that allows rapid updating of models during prediction, the powerful combination of PACLS with ACLS is very promising for rapidly maintaining and transferring models for system drift, spectrometer differences, and unmodeled components without the need for recalibration. The comparisons under different noise assumptions in the simulations obtained during this investigation emphasize the need to use realistic simulations when making comparisons between various multivariate calibration methods. Clearly, the conclusions of the relative performance of various methods were found to be dependent on how realistic the spectral errors were in the simulated data. Results demonstrating the simplicity and power of ACLS relative to PLS are presented in the following section.
The U.S. Department of Energy recently announced the first five grants for the Genomes to Life (GTL) Program. The goal of this program is to ''achieve the most far-reaching of all biological goals: a fundamental, comprehensive, and systematic understanding of life.'' While more information about the program can be found at the GTL website (www.doegenomestolife.org), this paper provides an overview of one of the five GTL projects funded, ''Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling.'' This project is a combined experimental and computational effort emphasizing developing, prototyping, and applying new computational tools and methods to elucidate the biochemical mechanisms of the carbon sequestration of Synechococcus Sp., an abundant marine cyanobacteria known to play an important role in the global carbon cycle. Understanding, predicting, and perhaps manipulating carbon fixation in the oceans has long been a major focus of biological oceanography and has more recently been of interest to a broader audience of scientists and policy makers. It is clear that the oceanic sinks and sources of CO(2) are important terms in the global environmental response to anthropogenic atmospheric inputs of CO(2) and that oceanic microorganisms play a key role in this response. However, the relationship between this global phenomenon and the biochemical mechanisms of carbon fixation in these microorganisms is poorly understood. The project includes five subprojects: an experimental investigation, three computational biology efforts, and a fifth which deals with addressing computational infrastructure challenges of relevance to this project and the Genomes to Life program as a whole. Our experimental effort is designed to provide biology and data to drive the computational efforts and includes significant investment in developing new experimental methods for uncovering protein partners, characterizing protein complexes, identifying new binding domains. We will also develop and apply new data measurement and statistical methods for analyzing microarray experiments. Our computational efforts include coupling molecular simulation methods with knowledge discovery from diverse biological data sets for high-throughput discovery and characterization of protein-protein complexes and developing a set of novel capabilities for inference of regulatory pathways in microbial genomes across multiple sources of information through the integration of computational and experimental technologies. These capabilities will be applied to Synechococcus regulatory pathways to characterize their interaction map and identify component proteins in these pathways. We will also investigate methods for combining experimental and computational results with visualization and natural language tools to accelerate discovery of regulatory pathways. Furthermore, given that the ultimate goal of this effort is to develop a systems-level of understanding of how the Synechococcus genome affects carbon fixation at the global scale, we will develop and apply a set of tools for capturing the carbon fixation behavior of complex of Synechococcus at different levels of resolution. Finally, because the explosion of data being produced by high-throughput experiments requires data analysis and models which are more computationally complex, more heterogeneous, and require coupling to ever increasing amounts of experimentally obtained data in varying formats, we have also established a companion computational infrastructure to support this effort as well as the Genomes to Life program as a whole.
Hyperspectral Fourier transform infrared images have been obtained from a neoprene sample aged in air at elevated temperatures. The massive amount of spectra available from this heterogeneous sample provides the opportunity to perform quantitative analysis of the spectral data without the need for calibration standards. Multivariate curve resolution (MCR) methods with non-negativity constraints applied to the iterative alternating least squares analysis of the spectral data has been shown to achieve the goal of quantitative image analysis without the use of standards. However, the pure-component spectra and the relative concentration maps were heavily contaminated by the presence of system artifacts in the spectral data. We have demonstrated that the detrimental effects of these artifacts can be minimized by adding an estimate of the error covariance structure of the spectral image data to the MCR algorithm. The estimate is added by augmenting the concentration and pure-component spectra matrices with scores and eigenvectors obtained from the mean-centered repeat image differences of the sample. The implementation of augmentation is accomplished by employing efficient equality constraints on the MCR analysis. Augmentation with the scores from the repeat images is found to primarily improve the pure-component spectral estimates while augmentation with the corresponding eigenvectors primarily improves the concentration maps. Augmentation with both scores and eigenvectors yielded the best result by generating less noisy pure-component spectral estimates and relative concentration maps that were largely free from a striping artifact that is present due to system errors in the FT-IR images. The MCR methods presented are general and can also be applied productively to non-image spectral data.
Monitoring of dielectric thin-film production in the microelectronics industry is generally accomplished by depositing a representative film on a monitor wafer and determining the film properties off line. One of the most important dielectric thin films in the manufacture of integrated circuits is borophosphosilicate glass (BPSG). The critical properties of BPSG thin films are the boron content, phosphorus content and film thickness. We have completed an experimental study that demonstrates that infrared emission spectroscopy coupled with multivariate analysis can be used to simultaneous y determine these properties directly from the spectra of product wafers, thus eliminating the need of producing monitor wafers. In addition, infrared emission data can be used to simultaneously determine the film temperature, which is an important film production parameter. The infrared data required to make these determinations can be collected on a time scale that is much faster than the film deposition time, hence infrared emission is an ideal candidate for an in-situ process monitor for dielectric thin-film production.
A significant improvement to the classical least squares (CLS) multivariate analysis method has been developed. The new method, called prediction-augmented classical least squares (PACLS), removes the restriction for CLS that all interfering spectral species must be known and their concentrations included during the calibration. The authors demonstrate that PACLS can correct inadequate CLS models if spectral components left out of the calibration can be identified and if their spectral shapes can be derived and added during a PACLS prediction step. The new PACLS method is demonstrated for a system of dilute aqueous solutions containing urea, creatinine, and NaCl analytes with and without temperature variations. The authors demonstrate that if CLS calibrations are performed using only a single analyte's concentration, then there is little, if any, prediction ability. However, if pure-component spectra of analytes left out of the calibration are independently obtained and added during PACLS prediction, then the CLS prediction ability is corrected and predictions become comparable to that of a CLS calibration that contains all analyte concentrations. It is also demonstrated that constant-temperature CLS models can be used to predict variable-temperature data by employing the PACLS method augmented by the spectral shape of a temperature change of the water solvent. In this case, PACLS can also be used to predict sample temperature with a standard error of prediction of 0.07 C even though the calibration data did not contain temperature variations. The PACLS method is also shown to be capable of modeling system drift to maintain a calibration in the presence of spectrometer drift.