Multilingual Sentiment Analysis Using Latent Semantic Indexing and Machine Learning
Abstract not provided.
Abstract not provided.
Proceedings - IEEE International Conference on Data Mining, ICDM
We present a novel approach to predicting the sentiment of documents in multiple languages, without translation. The only prerequisite is a multilingual parallel corpus wherein a training sample of the documents, in a single language only, have been tagged with their overall sentiment. Latent Semantic Indexing (LSI) converts that multilingual corpus into a multilingual "concept space". Both training and test documents can be projected into that space, allowing crosslingual semantic comparisons between the documents without the need for translation. Accordingly, the training documents with known sentiment are used to build a machine learning model which can, because of the multilingual nature of the document projections, be used to predict sentiment in the other languages. We explain and evaluate the accuracy of this approach. We also design and conduct experiments to investigate the extent to which topic and sentiment separately contribute to that classification accuracy, and thereby shed some initial light on the question of whether topic and sentiment can be sensibly teased apart. © 2011 IEEE.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This presentation will discuss two tensor decompositions that are not as well known as PARAFAC (parallel factors) and Tucker, but have proven useful in informatics applications. Three-way DEDICOM (decomposition into directional components) is an algebraic model for the analysis of 3-way arrays with nonsymmetric slices. PARAFAC2 is a related model that is less constrained than PARAFAC and allows for different objects in one mode. Applications of both models to informatics problems will be shown.
Abstract not provided.
IEEE Aerospace Conference Proceedings
As the U.S. and the International Community come to grips with anthropogenic climate change, it will be necessary to develop accurate techniques with global span for remote measurement of emissions and uptake of greenhouse gases (GHGs), with special emphasis on carbon dioxide. Presently, techniques exist for in situ and local remote measurements. The first steps towards expansion of these techniques to span the world are only now being taken with the launch of satellites with the capability to accurately measure column abundances of selected GHGs, including carbon dioxide. These satellite sensors do not directly measure emissions and uptake. The satellite data, appropriately filtered and processed, provide only one necessary, but not sufficient, input for the determination of emission and uptake rates. Optimal filtering and processing is a challenge in itself. But these data must be further combined with output from data-assimilation models of atmospheric structure and flows in order to infer emission and uptake rates for relevant points and regions. In addition, it is likely that substantially more accurate determinations would be possible given the addition of data from a sparse network of in situ and/or upward-looking remote GHG sensors. We will present the most promising approaches we've found for combining satellite, in situ, fixed remote sensing, and other potentially available data with atmospheric data-assimilation and backwarddispersion models for the purpose of determination of point and regional GHG emission and uptake rates. We anticipate that the first application of these techniques will be to GHG management for the U.S. Subsequent application may be to confirmation of compliance of other nations with future international GHG agreements. ©2010 IEEE.
Abstract not provided.
Abstract not provided.
Abstract not provided.
A standard and widespread approach to part-of-speech tagging is based on Hidden Markov Models (HMMs). An alternative approach, pioneered by Schuetze (1993), induces parts of speech from scratch using singular value decomposition (SVD). We introduce DEDICOM as an alternative to SVD for part-of-speech induction. DEDICOM retains the advantages of SVD in that it is completely unsupervised: no prior knowledge is required to induce either the tagset or the associations of terms with tags. However, unlike SVD, it is also fully compatible with the HMM framework, in that it can be used to estimate emission- and transition-probability matrices which can then be used as the input for an HMM. We apply the DEDICOM method to the CONLL corpus (CONLL 2000) and compare the output of DEDICOM to the part-of-speech tags given in the corpus, and find that the correlation (almost 0.5) is quite high. Using DEDICOM, we also estimate part-of-speech ambiguity for each term, and find that these estimates correlate highly with part-of-speech ambiguity as measured in the original corpus (around 0.88). Finally, we show how the output of DEDICOM can be evaluated and compared against the more familiar output of supervised HMM-based tagging.
Survey of Text Mining II: Clustering, Classification, and Retrieval
In this chapter, we apply a nonnegative tensor factorization algorithm to extract and detect meaningful discussions from electronic mail messages for a period of one year. For the publicly released Enron electronic mail collection, we encode a sparse term-author-month array for subsequent three-way factorization using the PARAllel FACtors (or PARAFAC) three-way decomposition first proposed by Harshman. Using nonnegative tensors, we preserve natural data nonnegativity and avoid subtractive basis vector and encoding interactions present in techniques such as principal component analysis. Results in thread detection and interpretation are discussed in the context of published Enron business practices and activities, and benchmarks addressing the computational complexity of our approach are provided. The resulting tensor factorizations can be used to produce Gantt-like charts that can be used to assess the duration, order, and dependencies of focused discussions against the progression of time. © 2008 Springer-Verlag London.
Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams. © 2008 Licensed under the Creative Commons.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
In the relatively new field of visual analytics there is a great need for automated approaches to both verify and discover the intentions and schemes of primary actors through time. Data mining and knowledge discovery play critical roles in facilitating the ability to extract meaningful information from large and complex textual-based (digital) collections. In this study, we develop a mathematical strategy based on nonnegative tensor factorization (NTF) to extract and sequence important activities and specific events from sources such as news articles. The ability to automatically reconstruct a plot or confirm involvement in a questionable activity is greatly facilitated by our approach. As a variant of the PARAFAC multidimensional data model, we apply our NTF algorithm to the terrorism-based scenarios of the VAST 2007 Contest data set to demonstrate how term-by-entity associations can be used for scenario/plot discovery and evaluation. © 2008 Springer-Verlag Berlin Heidelberg.
Proposed for publication in Computational Linguistics.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Latent Semantic Analysis (LSA) is based on the Singular Value Decomposition (SVD) of a term-by-document matrix for identifying relationships among terms and documents from cooccurrence patterns. Among the multiple ways of computing the SVD of a rectangular matrix X, one approach is to compute the eigenvalue decomposition (EVD) of a square 2 × 2 composite matrix consisting of four blocks with X and XT in the off-diagonal blocks and zero matrices in the diagonal blocks. We point out that significant value can be added to LSA by filling in some of the values in the diagonal blocks (corresponding to explicit term-to-term or document-to-document associations) and computing a term-by-concept matrix from the EVD. For the case of multilingual LSA, we incorporate information on cross-language term alignments of the same sort used in Statistical Machine Translation (SMT). Since all elements of the proposed EVD-based approach can rely entirely on lexical statistics, hardly any price is paid for the improved empirical results. In particular, the approach, like LSA or SMT, can still be generalized to virtually any language(s); computation of the EVD takes similar resources to that of the SVD since all the blocks are sparse; and the results of EVD are just as economical as those of SVD. © 2008 Licensed under the Creative Commons.
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
A standard approach to cross-language information retrieval (CLIR) uses Latent Semantic Analysis (LSA) in conjunction with a multilingual parallel aligned corpus. This approach has been shown to be successful in identifying similar documents across languages - or more precisely, retrieving the most similar document in one language to a query in another language. However, the approach has severe drawbacks when applied to a related task, that of clustering documents 'language independently', so that documents about similar topics end up closest to one another in the semantic space regardless of their language. The problem is that documents are generally more similar to other documents in the same language than they are to documents in a different language, but on the same topic. As a result, when using multilingual LSA, documents will in practice cluster by language, not by topic. We propose a novel application of PARAFAC2 (which is a variant of PARAFAC, a multi-way generalization of the singular value decomposition [SVD]) to overcome this problem. Instead of forming a single multilingual term-by-document matrix which, under LSA, is subjected to SVD, we form an irregular three-way array, each slice of which is a separate term-by-document matrix for a single language in the parallel corpus. The goal is to compute an SVD for each language such that V (the matrix of right singular vectors) is the same across all languages. Effectively, PARAFAC2 imposes the constraint, not present in standard LSA, that the 'concepts' in all documents in the parallel corpus are the same regardless of language. Intuitively, this constraint makes sense, since the whole purpose of using a parallel corpus is that exactly the same concepts are expressed in the translations. We tested this approach by comparing the performance of PARAFAC2 with standard LSA in solving a particular CLIR problem. From our results, we conclude that PARAFAC2 offers a very promising alternative to LSA not only for multilingual document clustering, but also for solving other problems in crosslanguage information retrieval. © 2007 ACM.
Abstract not provided.
Abstract not provided.