Project Title: Learning Critical Discriminants and Descriptors in
Data Mining
Project Award Number: IIS-0221954
Principal Investigator
First Name: Li
Middle Initial: M.
Last Name: Fu
Department: CISE
Institution: University of Florida
Address line 1: 301 CSE
Address line 2: P.O. Box 116120
City: Gainesville
State: FL
Zip Code: 32611
Phone Number: (352)392-1485
Fax Number: (352)392-1220
Email: fu@cise.ufl.edu
URL: http://www.cise.ufl.edu/~fu
Keywords:
data mining, discriminant, feature selection, machine learning,
bioinformatics, text mining.
Project Summary
The objective of this research is to develop general methods that
automatically discover original and useful knowledge from historical
or experimental data. Learning discriminants and descriptors associated
with patterns extracted from the data is a central issue in data mining.
The project will develop techniques to achieve this objective.
The results will be experimentally evaluated.
In addition, an integrated data mining performance system will be developed.
Publications and Products
Selected Peer-Reviewed Journal Publications
Li M. Fu and E. S. Youn, "Improving reliability of gene selection
from microarray data", IEEE Transactions on Information Technology
in Biomedicine, Vol. 7, No. 3, September, 2003.
Y. Yang and Li M. Fu, "TSGDB: A web-based database system
for tumore suppressor genes", Bioinformatics, in press.
Software Products
GeneSelect is a general data mining tool for identifying
critical data elements based on cross validation,
with applications ranging from gene selection to text mining.
Project Impact
Machine learning research has resulted in many successful real-world
applications and become a crucial discipline
in modern science and engineering.
The interaction between the disciplines of
machine learning and data mining has motivated new algorithms
to accommodate growing data complexity.
The luxury of developing computational
algorithms with higher time and space complexities in exchange
for better accuracy is being fostered by
increased sophistication in computer hardware and software.
In this context, a particularly important problem is to identify,
among thousands or more, a critical subset of features
that sufficiently capture or define the basic nature of the domain
toward problem understanding and solving.
The research results are being disseminated through publications in
peer-reviewed journals.
Some of these results have been added to the teaching materials
of graduate courses on machine learning.
The technology developed through this project is placed in
the public domain and shared with the research community
by providing freely downloadable software packages.
Through our research efforts, new methods with validated performance
are developed across the frontier of science and engineering,
as confirmed by our works published in premium peer-reviewed journals.
Successful real-world applications like microarray data mining
illustrate the practical benefit of our research outcomes to the
society. Through collaboration with scientists in other
fields, we have established a multi-disciplinary research team that
broadens the scope of our project in both theory and practice.
Finally, this project trains graduate students with advanced
mathematical and statistical techniques. This project has helped
considerably the training of E. S. Youn, who
is a Ph.D. student of the PI, in conducting research on the cutting
edge. As research in the areas of data mining and bioinformatics is
highly competitive, how to identify a right issue to address and a
right approach to pursue is quite a challenge. By working on the
project, the student has learned to develop a skill for in-depth
analysis of research data. Continued progress is anticipated.
Goals, Objectives and Targeted Activities
(1) Development of a new, feature selection algorithm that uses cross
validation for reliability assessment. It is essential for data mining
under extremely high dimensionality, such as gene selection from
microarray data, in which task we are interested in finding
biologically important genes with diagnostic and/or therapeutic
values.
(2) Development of a text mining algorithm that can discover the
source mechanism underlying data generation. This capability is very
useful for recognizing keywords in the texts of classified documents
and in turn for text indexing and inference, with a wide range of
applications, e.g., epidemics, terrorism, scientific research, etc.
Area Background
Human experts traditionally derive
their knowledge from their observation and experience.
Automated knowledge discovery is an important research
topic today, as the computer technology continues to advance and
real-world data become increasingly complex.
Over the past years, many business organizations have begun to collect
data concerning their own operations, products, and even customers.
On the other hand,
the science and engineering fields have generated a huge amount of
data, such as functional brain imaging and human genomic data.
With this background,
data mining arises as a growing discipline that addresses the question of
how to discover a gold mine from historical or experimental data.
The goal of data mining and knowledge discovery
is to extract implicit and previously unknown nontrivial
patterns, regularities, or knowledge from large data sets that
can be used to improve strategic planning and decision making.
The discovered knowledge capturing the relations among
the variables of interest can be formulated as a function for
prediction and classification or as a model for simulation and
understanding of domain behavior. At issue are variable and
system identification from empirical data. This is no longer
a simple engineering problem in a domain involving thousands
or more interacting variables. It is the main challenge in
this project.
Area References
[1] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, "Cluster analysis and display of genome-wide expression patterns," Proc Natl Acad Sci U S A, vol. 95, pp. 14863-14868, 1998.
[2] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, "Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation," Proc Natl Acad Sci U S A, vol. 96, pp. 2907-2912, 1999.
[3] M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, Jr., and D. Haussler, "Knowledge-based analysis of microarray gene expression data by using support vector machines," Proc Natl Acad Sci U S A, vol. 97, pp. 262-267, 2000.
[4] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring," Science, vol. 286, pp. 531-537, 1999.
[5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, "Gene selection for cancer classification using support vector machines," machine learning, vol. 46, pp. 389-422, 2002.
[6] L. M. Fu, Neural Networks in Computer Intelligence. New York: McGraw-Hill, 1994.
[7] A. Blum and P. Langley, "Selection of relevant features and examples in machine learning," Artificial Intelligence, vol. 97, pp. 245-271, 1997.
[8] T. Bo and I. Jonassen, "New feature subset selection procedures for classification of expression profiles," Genome Biol, vol. 3, pp. RESEARCH0017, 2002.
[9] L. Li, C. R. Weinberg, T. A. Darden, and L. G. Pedersen, "Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method," Bioinformatics, vol. 17, pp. 1131-1142, 2001.
[10] K. E. Lee, N. Sha, E. R. Dougherty, M. Vannucci, and B. K. Mallick, "Gene selection: a Bayesian variable selection approach," Bioinformatics, vol. 19, pp. 90-97, 2003.
[11] M. Xiong, W. Li, J. Zhao, L. Jin, and E. Boerwinkle, "Feature (gene) selection in gene expression-based tumor classification," Mol Genet Metab, vol. 73, pp. 239-247, 2001.
[12] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, "Support vector machine classification and validation of cancer tissue samples using microarray expression data," Bioinformatics, vol. 16, pp. 906-914, 2000.
[13] C. Cortes and V. Vapnik, "Support vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.
[14] C. Ambroise and G. J. McLachlan, "Selection bias in gene extraction on the basis of microarray gene-expression data," Proc Natl Acad Sci U S A, vol. 99, pp. 6562-6566, 2002.
Potential Related Projects
Bioinformatics and Medical Informatics Projects
http://www.cise.ufl.edu/~fu/NSF/
The overview of the project including the title, award number,
P.I., abstract, and hyperlinks to other project pages.
http://www.cise.ufl.edu/~fu/NSF/illustration.html
The power point presentation of the project.
http://www.cise.ufl.edu/~fu/NSF/software.html
The software packages developed through this project and downloadable
from the Internet.
http://www.cise.ufl.edu/~fu/NSF/data.html
The research or experimental data used or produced, which can
be downloaded from the Internet.
http://www.cise.ufl.edu/~fu/NSF/resouce.html
The links to other resourceful web sites.