Multilingual Sentiment Analysis Using Latent Semantic Indexing and Machine Learning
Abstract not provided.
Abstract not provided.
Proceedings - IEEE International Conference on Data Mining, ICDM
We present a novel approach to predicting the sentiment of documents in multiple languages, without translation. The only prerequisite is a multilingual parallel corpus wherein a training sample of the documents, in a single language only, have been tagged with their overall sentiment. Latent Semantic Indexing (LSI) converts that multilingual corpus into a multilingual "concept space". Both training and test documents can be projected into that space, allowing crosslingual semantic comparisons between the documents without the need for translation. Accordingly, the training documents with known sentiment are used to build a machine learning model which can, because of the multilingual nature of the document projections, be used to predict sentiment in the other languages. We explain and evaluate the accuracy of this approach. We also design and conduct experiments to investigate the extent to which topic and sentiment separately contribute to that classification accuracy, and thereby shed some initial light on the question of whether topic and sentiment can be sensibly teased apart. © 2011 IEEE.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This presentation will discuss two tensor decompositions that are not as well known as PARAFAC (parallel factors) and Tucker, but have proven useful in informatics applications. Three-way DEDICOM (decomposition into directional components) is an algebraic model for the analysis of 3-way arrays with nonsymmetric slices. PARAFAC2 is a related model that is less constrained than PARAFAC and allows for different objects in one mode. Applications of both models to informatics problems will be shown.
Abstract not provided.
IEEE Aerospace Conference Proceedings
As the U.S. and the International Community come to grips with anthropogenic climate change, it will be necessary to develop accurate techniques with global span for remote measurement of emissions and uptake of greenhouse gases (GHGs), with special emphasis on carbon dioxide. Presently, techniques exist for in situ and local remote measurements. The first steps towards expansion of these techniques to span the world are only now being taken with the launch of satellites with the capability to accurately measure column abundances of selected GHGs, including carbon dioxide. These satellite sensors do not directly measure emissions and uptake. The satellite data, appropriately filtered and processed, provide only one necessary, but not sufficient, input for the determination of emission and uptake rates. Optimal filtering and processing is a challenge in itself. But these data must be further combined with output from data-assimilation models of atmospheric structure and flows in order to infer emission and uptake rates for relevant points and regions. In addition, it is likely that substantially more accurate determinations would be possible given the addition of data from a sparse network of in situ and/or upward-looking remote GHG sensors. We will present the most promising approaches we've found for combining satellite, in situ, fixed remote sensing, and other potentially available data with atmospheric data-assimilation and backwarddispersion models for the purpose of determination of point and regional GHG emission and uptake rates. We anticipate that the first application of these techniques will be to GHG management for the U.S. Subsequent application may be to confirmation of compliance of other nations with future international GHG agreements. ©2010 IEEE.
Abstract not provided.
Abstract not provided.
Abstract not provided.
A standard and widespread approach to part-of-speech tagging is based on Hidden Markov Models (HMMs). An alternative approach, pioneered by Schuetze (1993), induces parts of speech from scratch using singular value decomposition (SVD). We introduce DEDICOM as an alternative to SVD for part-of-speech induction. DEDICOM retains the advantages of SVD in that it is completely unsupervised: no prior knowledge is required to induce either the tagset or the associations of terms with tags. However, unlike SVD, it is also fully compatible with the HMM framework, in that it can be used to estimate emission- and transition-probability matrices which can then be used as the input for an HMM. We apply the DEDICOM method to the CONLL corpus (CONLL 2000) and compare the output of DEDICOM to the part-of-speech tags given in the corpus, and find that the correlation (almost 0.5) is quite high. Using DEDICOM, we also estimate part-of-speech ambiguity for each term, and find that these estimates correlate highly with part-of-speech ambiguity as measured in the original corpus (around 0.88). Finally, we show how the output of DEDICOM can be evaluated and compared against the more familiar output of supervised HMM-based tagging.
Survey of Text Mining II: Clustering, Classification, and Retrieval
In this chapter, we apply a nonnegative tensor factorization algorithm to extract and detect meaningful discussions from electronic mail messages for a period of one year. For the publicly released Enron electronic mail collection, we encode a sparse term-author-month array for subsequent three-way factorization using the PARAllel FACtors (or PARAFAC) three-way decomposition first proposed by Harshman. Using nonnegative tensors, we preserve natural data nonnegativity and avoid subtractive basis vector and encoding interactions present in techniques such as principal component analysis. Results in thread detection and interpretation are discussed in the context of published Enron business practices and activities, and benchmarks addressing the computational complexity of our approach are provided. The resulting tensor factorizations can be used to produce Gantt-like charts that can be used to assess the duration, order, and dependencies of focused discussions against the progression of time. © 2008 Springer-Verlag London.
Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams. © 2008 Licensed under the Creative Commons.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
In the relatively new field of visual analytics there is a great need for automated approaches to both verify and discover the intentions and schemes of primary actors through time. Data mining and knowledge discovery play critical roles in facilitating the ability to extract meaningful information from large and complex textual-based (digital) collections. In this study, we develop a mathematical strategy based on nonnegative tensor factorization (NTF) to extract and sequence important activities and specific events from sources such as news articles. The ability to automatically reconstruct a plot or confirm involvement in a questionable activity is greatly facilitated by our approach. As a variant of the PARAFAC multidimensional data model, we apply our NTF algorithm to the terrorism-based scenarios of the VAST 2007 Contest data set to demonstrate how term-by-entity associations can be used for scenario/plot discovery and evaluation. © 2008 Springer-Verlag Berlin Heidelberg.
Proposed for publication in Computational Linguistics.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Latent Semantic Analysis (LSA) is based on the Singular Value Decomposition (SVD) of a term-by-document matrix for identifying relationships among terms and documents from cooccurrence patterns. Among the multiple ways of computing the SVD of a rectangular matrix X, one approach is to compute the eigenvalue decomposition (EVD) of a square 2 × 2 composite matrix consisting of four blocks with X and XT in the off-diagonal blocks and zero matrices in the diagonal blocks. We point out that significant value can be added to LSA by filling in some of the values in the diagonal blocks (corresponding to explicit term-to-term or document-to-document associations) and computing a term-by-concept matrix from the EVD. For the case of multilingual LSA, we incorporate information on cross-language term alignments of the same sort used in Statistical Machine Translation (SMT). Since all elements of the proposed EVD-based approach can rely entirely on lexical statistics, hardly any price is paid for the improved empirical results. In particular, the approach, like LSA or SMT, can still be generalized to virtually any language(s); computation of the EVD takes similar resources to that of the SVD since all the blocks are sparse; and the results of EVD are just as economical as those of SVD. © 2008 Licensed under the Creative Commons.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
A standard approach to cross-language information retrieval (CLIR) uses Latent Semantic Analysis (LSA) in conjunction with a multilingual parallel aligned corpus. This approach has been shown to be successful in identifying similar documents across languages - or more precisely, retrieving the most similar document in one language to a query in another language. However, the approach has severe drawbacks when applied to a related task, that of clustering documents 'language independently', so that documents about similar topics end up closest to one another in the semantic space regardless of their language. The problem is that documents are generally more similar to other documents in the same language than they are to documents in a different language, but on the same topic. As a result, when using multilingual LSA, documents will in practice cluster by language, not by topic. We propose a novel application of PARAFAC2 (which is a variant of PARAFAC, a multi-way generalization of the singular value decomposition [SVD]) to overcome this problem. Instead of forming a single multilingual term-by-document matrix which, under LSA, is subjected to SVD, we form an irregular three-way array, each slice of which is a separate term-by-document matrix for a single language in the parallel corpus. The goal is to compute an SVD for each language such that V (the matrix of right singular vectors) is the same across all languages. Effectively, PARAFAC2 imposes the constraint, not present in standard LSA, that the 'concepts' in all documents in the parallel corpus are the same regardless of language. Intuitively, this constraint makes sense, since the whole purpose of using a parallel corpus is that exactly the same concepts are expressed in the translations. We tested this approach by comparing the performance of PARAFAC2 with standard LSA in solving a particular CLIR problem. From our results, we conclude that PARAFAC2 offers a very promising alternative to LSA not only for multilingual document clustering, but also for solving other problems in crosslanguage information retrieval. © 2007 ACM.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
A standard approach to cross-language information retrieval (CLIR) uses Latent Semantic Analysis (LSA) in conjunction with a multilingual parallel aligned corpus. This approach has been shown to be successful in identifying similar documents across languages - or more precisely, retrieving the most similar document in one language to a query in another language. However, the approach has severe drawbacks when applied to a related task, that of clustering documents 'language-independently', so that documents about similar topics end up closest to one another in the semantic space regardless of their language. The problem is that documents are generally more similar to other documents in the same language than they are to documents in a different language, but on the same topic. As a result, when using multilingual LSA, documents will in practice cluster by language, not by topic. We propose a novel application of PARAFAC2 (which is a variant of PARAFAC, a multi-way generalization of the singular value decomposition [SVD]) to overcome this problem. Instead of forming a single multilingual term-by-document matrix which, under LSA, is subjected to SVD, we form an irregular three-way array, each slice of which is a separate term-by-document matrix for a single language in the parallel corpus. The goal is to compute an SVD for each language such that V (the matrix of right singular vectors) is the same across all languages. Effectively, PARAFAC2 imposes the constraint, not present in standard LSA, that the 'concepts' in all documents in the parallel corpus are the same regardless of language. Intuitively, this constraint makes sense, since the whole purpose of using a parallel corpus is that exactly the same concepts are expressed in the translations. We tested this approach by comparing the performance of PARAFAC2 with standard LSA in solving a particular CLIR problem. From our results, we conclude that PARAFAC2 offers a very promising alternative to LSA not only for multilingual document clustering, but also for solving other problems in cross-language information retrieval.
Abstract not provided.
SIAM Journal on Scientific Computing
This paper investigates the performance of tensor methods for solving small-to large-scale systems of nonlinear equations where the Jacobian matrix at the root is ill-conditioned or singular. This condition occurs on many classes of problems, such as identifying or approaching turning points in path-following problems. The singular case has been studied more than the highly ill-conditioned case, for both Newton and tensor methods. It is known that Newton-based methods do not work well with singular problems because they converge linearly to the solution and, in some cases, with poor accuracy. On the other hand, direct tensor methods have performed well on singular problems and have superlinear convergence on such problems under certain conditions. This behavior originates from the use of a special, restricted form of the second-order term included in the local tensor model that provides information lacking in a (nearly) singular: Jacobian. With several implementations available for large-scale problems, tensor: methods now are capable oi solving larger problems. We compare the performance of tensor methods and Newton-based methods for small-to large-scale problems over a range of conditionings, from well-conditioned to ill-conditioned to singular. Previous studies with tensor methods concerned only the ends of this spectrum. Our results show that tensor methods are increasingly superior to Newton-based methods as the problem grows more ill-conditioned. © 2007 Society for Industrial and Applied Mathematics.
In this paper, the term tensor refers simply to a multidimensional or N-way array, and we consider how specially structured tensors allow for efficient storage and computation. First, we study sparse tensors, which have the property that the vast majority of the elements are zero. We propose storing sparse tensors using coordinate format and describe the computational efficiency of this scheme for various mathematical operations, including those typical to tensor decomposition algorithms. Second, we study factored tensors, which have the property that they can be assembled from more basic components. We consider two specific types: a Tucker tensor can be expressed as the product of a core tensor (which itself may be dense, sparse, or factored) and a matrix along each mode, and a Kruskal tensor can be expressed as the sum of rank-1 tensors. We are interested in the case where the storage of the components is less than the storage of the full tensor, and we demonstrate that many elementary operations can be computed using only the components. All of the efficiencies described in this paper are implemented in the Tensor Toolbox for MATLAB.
DEDICOM is a linear algebra model for analyzing intrinsically asymmetric relationships, such as trade among nations or the exchange of emails among individuals. DEDICOM decomposes a complex pattern of observed relations among objects into a sum of simpler patterns of inferred relations among latent components of the objects. Three-way DEDICOM is a higher-order extension of the model that incorporates a third mode of the data, such as time, giving it stronger uniqueness properties and consequently enhancing interpretability of solutions. In this paper, we present algorithms for computing these decompositions on large, sparse data as well as a variant for computing an asymmetric nonnegative factorization. When we apply these techniques to adjacency arrays arising from directed graphs with edges labeled by time, we obtain a smaller graph on latent semantic dimensions and gain additional information about their changing relationships over time. We demonstrate these techniques on the Enron email corpus to learn about the social networks and their transient behavior. The mixture of roles assigned to individuals by DEDICOM showed strong correspondence with known job classifications and revealed the patterns of communication between these roles. Changes in the communication pattern over time, e.g., between top executives and the legal department, were also apparent in the solutions.
Abstract not provided.
Abstract not provided.
Proceedings - IEEE International Conference on Data Mining, ICDM
Linear algebra is a powerful and proven tool in web search. Techniques, such as the PageRank algorithm of Brin and Page and the HITS algorithm of Kleinberg, score web pages based on the principal eigenvector (or singular vector) of a particular non-negative matrix that captures the hyperlink structure of the web graph. We propose and test a new methodology that uses multilinear algebra to elicit more information from a higher-order representation of the hyperlink graph. We start by labeling the edges in our graph with the anchor text of the hyperlinks so that the associated linear algebra representation is a sparse, three-way tensor. The first two dimensions of the tensor represent the web pages while the third dimension adds the anchor text. We then use the rank-1 factors of a multilinear PARAFAC tensor decomposition, which are akin to singular vectors of the SVD, to automatically identify topics in the collection along with the associated authoritative web pages. © 2005 IEEE.
This report documents research to develop robust and efficient solution techniques for solving large-scale systems of nonlinear equations. The most widely used method for solving systems of nonlinear equations is Newton's method. While much research has been devoted to augmenting Newton-based solvers (usually with globalization techniques), little has been devoted to exploring the application of different models. Our research has been directed at evaluating techniques using different models than Newton's method: a lower order model, Broyden's method, and a higher order model, the tensor method. We have developed large-scale versions of each of these models and have demonstrated their use in important applications at Sandia. Broyden's method replaces the Jacobian with an approximation, allowing codes that cannot evaluate a Jacobian or have an inaccurate Jacobian to converge to a solution. Limited-memory methods, which have been successful in optimization, allow us to extend this approach to large-scale problems. We compare the robustness and efficiency of Newton's method, modified Newton's method, Jacobian-free Newton-Krylov method, and our limited-memory Broyden method. Comparisons are carried out for large-scale applications of fluid flow simulations and electronic circuit simulations. Results show that, in cases where the Jacobian was inaccurate or could not be computed, Broyden's method converged in some cases where Newton's method failed to converge. We identify conditions where Broyden's method can be more efficient than Newton's method. We also present modifications to a large-scale tensor method, originally proposed by Bouaricha, for greater efficiency, better robustness, and wider applicability. Tensor methods are an alternative to Newton-based methods and are based on computing a step based on a local quadratic model rather than a linear model. The advantage of Bouaricha's method is that it can use any existing linear solver, which makes it simple to write and easily portable. However, the method usually takes twice as long to solve as Newton-GMRES on general problems because it solves two linear systems at each iteration. In this paper, we discuss modifications to Bouaricha's method for a practical implementation, including a special globalization technique and other modifications for greater efficiency. We present numerical results showing computational advantages over Newton-GMRES on some realistic problems. We further discuss a new approach for dealing with singular (or ill-conditioned) matrices. In particular, we modify an algorithm for identifying a turning point so that an increasingly ill-conditioned Jacobian does not prevent convergence.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Optimization-ready reduced-order models should target a particular output functional, span an applicable range of dynamic and parametric inputs, and respect the underlying governing equations of the system. To achieve this goal, we present an approach for determining a projection basis that uses a goal-oriented, model-based optimization framework. The mathematical framework permits consideration of general dynamical systems with general parametric variations. The methodology is applicable to both linear and nonlinear systems and to systems with many input parameters. This paper focuses on an initial presentation and demonstration of the methodology on a simple linear model problem of the two-dimensional, time-dependent heat equation with a small number of inputs. For this example, the reduced models determined by the new approach provide considerable improvement over those derived using the proper orthogonal decomposition.
Abstract not provided.
Tensors (also known as mutidimensional arrays or N-way arrays) are used in a variety of applications ranging from chemometrics to psychometrics. We describe four MATLAB classes for tensor manipulations that can be used for fast algorithm prototyping. The tensor class extends the functionality of MATLAB's multidimensional arrays by supporting additional operations such as tensor multiplication. The tensor as matrix class supports the 'matricization' of a tensor, i.e., the conversion of a tensor to a matrix (and vice versa), a commonly used operation in many algorithms. Two additional classes represent tensors stored in decomposed formats: cp tensor and tucker tensor. We descibe all of these classes and then demonstrate their use by showing how to implement several tensor algorithms that have appeared in the literature.
We present the source code for three MATLAB classes for manipulating tensors in order to allow fast algorithm prototyping. A tensor is a multidimensional or Nway array. This is a supplementary report; details on using this code are provided separately in SAND-XXXX.
This paper develops and investigates iterative tensor methods for solving large-scale systems of nonlinear equations. Direct tensor methods for nonlinear equations have performed especially well on small, dense problems where the Jacobian matrix at the solution is singular or ill-conditioned, which may occur when approaching turning points, for example. This research extends direct tensor methods to large-scale problems by developing three tensor-Krylov methods that base each iteration upon a linear model augmented with a limited second-order term, which provides information lacking in a (nearly) singular Jacobian. The advantage of the new tensor-Krylov methods over existing large-scale tensor methods is their ability to solve the local tensor model to a specified accuracy, which produces a more accurate tensor step. The performance of these methods in comparison to Newton-GMRES and tensor-GMRES is explored on three Navier-Stokes fluid flow problems. The numerical results provide evidence that tensor-Krylov methods are generally more robust and more efficient than Newton-GMRES on some important and difficult problems. In addition, the results show that the new tensor-Krylov methods and tensor- GMRES each perform better in certain situations.
We present the source code for three MATLAB classes for manipulating tensors in order to allow fast algorithm prototyping. A tensor is a multidimensional or Nway array. This is a supplementary report; details on using this code are provided separately in SAND-XXXX.
We describe three MATLAB classes for manipulating tensors in order to allow fast algorithm prototyping. A tensor is a multidimensional or N-way array. We present a tensor class for manipulating tensors which allows for tensor multiplication and 'matricization.' We have further added two classes for representing tensors in decomposed format: cp{_}tensor and tucker{_}tensor. We demonstrate the use of these classes by implementing several algorithms that have appeared in the literature.