Publications

Publications / Report

Final Report: Weighted Neighbor Data Mining

Carlson, Jeffrey J.; Muguira, Maritza R.

Data mining involves the discovery and fusion of features from large databases to establish minimal probability of error (MPE) decision and estimation models. Our approach combines a weighted nearest neighbor (WNN) decision model for classification and estimation with genetic algorithms (GA) for feature discovery and model optimization. The WNN model is used to provide a mathematical framework for adaptively discovering and fusing features into near-MPE decision algorithms. The GA is used to discover weighted features and select decision points for the WNN decision model to achieve near-MPE decisions. The performance of the WNN fusion model is demonstrated on the first of two very different problems to demonstrate its robust and practical application to a wide variety of data-mining problems. The first problem involves the isolation of factors that cause hepatitis C virus (HCV) and requires the evaluation of large databases to establish the critical features that can detect with minimal error whether a person is at risk of having HCV. This requires discovering and extracting relevant information (features) from a questionnaire database and combining (fusing) them to achieve a minimal error decision rule. The primary objective of the research is to develop a practical basis for fusing information from questionnaires administered at hospitals to identify and verify features important to isolate risk factors for HCV. The basic problem involves creating a feature database from the questionnaire information, discovering features that provide sufficient information to reliably identify when a person is at risk under conditions with uncertainties caused by recording errors and evasive tactics of people answering the questionnaire. The results of this study demonstrate the WNN fusion algorithm ability to perform in supervised learning environments. The second phase of the research project is directed at the unsupervised learning environment. In this environment the feature data is presented without any classification. Clustering algorithms are developed to partition the feature data into clusters based upon similarity measure models. After the feature data is clustered and classified the supervised WNN fusion algorithms are used to classify the data based upon the minimal probability of error decision rule.