Project Title: Learning Critical Discriminants and Descriptors in Data Mining

Project Award Number: IIS-0221954

Principal Investigator

First Name: Li
Middle Initial: M.
Last Name: Fu
Department: CISE
Institution: University of Florida
Address line 1: 301 CSE
Address line 2: P.O. Box 116120
City: Gainesville
State: FL
Zip Code: 32611
Phone Number: (352)392-1485
Fax Number: (352)392-1220
Email: fu@cise.ufl.edu

URL: http://www.cise.ufl.edu/~fu

Keywords:

data mining, discriminant, feature selection, machine learning, bioinformatics, text mining.

Project Summary

The objective of this research is to develop general methods that automatically discover original and useful knowledge from historical or experimental data. Learning discriminants and descriptors associated with patterns extracted from the data is a central issue in data mining. The project will develop techniques to achieve this objective. The results will be experimentally evaluated. In addition, an integrated data mining performance system will be developed.

Publications and Products

Selected Peer-Reviewed Journal Publications

Li M. Fu and E. S. Youn, "Improving reliability of gene selection from microarray data", IEEE Transactions on Information Technology in Biomedicine, Vol. 7, No. 3, September, 2003.

Y. Yang and Li M. Fu, "TSGDB: A web-based database system for tumore suppressor genes", Bioinformatics, in press.

Software Products

GeneSelect is a general data mining tool for identifying critical data elements based on cross validation, with applications ranging from gene selection to text mining.

Project Impact

Machine learning research has resulted in many successful real-world applications and become a crucial discipline in modern science and engineering. The interaction between the disciplines of machine learning and data mining has motivated new algorithms to accommodate growing data complexity. The luxury of developing computational algorithms with higher time and space complexities in exchange for better accuracy is being fostered by increased sophistication in computer hardware and software. In this context, a particularly important problem is to identify, among thousands or more, a critical subset of features that sufficiently capture or define the basic nature of the domain toward problem understanding and solving.

The research results are being disseminated through publications in peer-reviewed journals. Some of these results have been added to the teaching materials of graduate courses on machine learning. The technology developed through this project is placed in the public domain and shared with the research community by providing freely downloadable software packages.

Through our research efforts, new methods with validated performance are developed across the frontier of science and engineering, as confirmed by our works published in premium peer-reviewed journals. Successful real-world applications like microarray data mining illustrate the practical benefit of our research outcomes to the society. Through collaboration with scientists in other fields, we have established a multi-disciplinary research team that broadens the scope of our project in both theory and practice. Finally, this project trains graduate students with advanced mathematical and statistical techniques. This project has helped considerably the training of E. S. Youn, who is a Ph.D. student of the PI, in conducting research on the cutting edge. As research in the areas of data mining and bioinformatics is highly competitive, how to identify a right issue to address and a right approach to pursue is quite a challenge. By working on the project, the student has learned to develop a skill for in-depth analysis of research data. Continued progress is anticipated.

Goals, Objectives and Targeted Activities

(1) Development of a new, feature selection algorithm that uses cross validation for reliability assessment. It is essential for data mining under extremely high dimensionality, such as gene selection from microarray data, in which task we are interested in finding biologically important genes with diagnostic and/or therapeutic values.

(2) Development of a text mining algorithm that can discover the source mechanism underlying data generation. This capability is very useful for recognizing keywords in the texts of classified documents and in turn for text indexing and inference, with a wide range of applications, e.g., epidemics, terrorism, scientific research, etc.

Area Background

Human experts traditionally derive their knowledge from their observation and experience. Automated knowledge discovery is an important research topic today, as the computer technology continues to advance and real-world data become increasingly complex. Over the past years, many business organizations have begun to collect data concerning their own operations, products, and even customers. On the other hand, the science and engineering fields have generated a huge amount of data, such as functional brain imaging and human genomic data. With this background, data mining arises as a growing discipline that addresses the question of how to discover a gold mine from historical or experimental data. The goal of data mining and knowledge discovery is to extract implicit and previously unknown nontrivial patterns, regularities, or knowledge from large data sets that can be used to improve strategic planning and decision making. The discovered knowledge capturing the relations among the variables of interest can be formulated as a function for prediction and classification or as a model for simulation and understanding of domain behavior. At issue are variable and system identification from empirical data. This is no longer a simple engineering problem in a domain involving thousands or more interacting variables. It is the main challenge in this project.

Area References

[1] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, "Cluster analysis and display of genome-wide expression patterns," Proc Natl Acad Sci U S A, vol. 95, pp. 14863-14868, 1998.
[2] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, "Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation," Proc Natl Acad Sci U S A, vol. 96, pp. 2907-2912, 1999.
[3] M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, Jr., and D. Haussler, "Knowledge-based analysis of microarray gene expression data by using support vector machines," Proc Natl Acad Sci U S A, vol. 97, pp. 262-267, 2000.
[4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring," Science, vol. 286, pp. 531-537, 1999.
[5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," machine learning, vol. 46, pp. 389-422, 2002.
[6] L. M. Fu, Neural Networks in Computer Intelligence. New York: McGraw-Hill, 1994.
[7] A. Blum and P. Langley, "Selection of relevant features and examples in machine learning," Artificial Intelligence, vol. 97, pp. 245-271, 1997.
[8] T. Bo and I. Jonassen, "New feature subset selection procedures for classification of expression profiles," Genome Biol, vol. 3, pp. RESEARCH0017, 2002.
[9] L. Li, C. R. Weinberg, T. A. Darden, and L. G. Pedersen, "Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method," Bioinformatics, vol. 17, pp. 1131-1142, 2001.
[10] K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick, "Gene selection: a Bayesian variable selection approach," Bioinformatics, vol. 19, pp. 90-97, 2003.
[11] M. Xiong, W. Li, J. Zhao, L. Jin, and E. Boerwinkle, "Feature (gene) selection in gene expression-based tumor classification," Mol Genet Metab, vol. 73, pp. 239-247, 2001.
[12] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, "Support vector machine classification and validation of cancer tissue samples using microarray expression data," Bioinformatics, vol. 16, pp. 906-914, 2000.
[13] C. Cortes and V. Vapnik, "Support vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
[14] C. Ambroise and G. J. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data," Proc Natl Acad Sci U S A, vol. 99, pp. 6562-6566, 2002.

Potential Related Projects

Bioinformatics and Medical Informatics Projects

Project Websites

http://www.cise.ufl.edu/~fu/NSF/
The overview of the project including the title, award number, P.I., abstract, and hyperlinks to other project pages.

Illustrations

http://www.cise.ufl.edu/~fu/NSF/illustration.html
The power point presentation of the project.

Online Software

http://www.cise.ufl.edu/~fu/NSF/software.html
The software packages developed through this project and downloadable from the Internet.

Online Data

http://www.cise.ufl.edu/~fu/NSF/data.html
The research or experimental data used or produced, which can be downloaded from the Internet.

Other Resources

http://www.cise.ufl.edu/~fu/NSF/resouce.html
The links to other resourceful web sites.