Mining Comparative Genomic Hybridization (CGH) data
Numerical and structural chromosomal imbalances are one of the most
prominent and pathogenetically relevant features of neoplastic
cells. Over the past decades, thousands of (molecular-) cytogenetic
studies of human neoplasias have searched for insights into genetic
mechanisms of tumor development and the detection of targets for
pharmacologic intervention. It is assumed that repetitive chromosomal
aberration patterns reflect the supposed cooperation of a multitude of
tumor relevant genesin most malignant diseases.
One method for measuring genomic aberrations is Comparative Genomic
Hybridization (CGH). CGH is a molecular-cytogenetic analysis method
for detecting regions with genomic imbalances (gains or losses of DNA
segments). Raw data from CGH experiments is expressed as the ratio of
normalized fluorescence of tumor and reference DNA. Normalized CGH
ratio data surpassing predefined thresholds is considered indicative
for genomic gains or losses, respectively. The chromosomal CGH results
are annotated in a reverse in-situ karyotype format describing
imbalanced genomic regions with reference to their chromosomal
location. CGH data of an individual tumor can be considered as an
ordered list of status values, where each value corresponds to a
genomic interval (e.g., a single chromosomal band).The status can be
expressed as a real number (positive, negative, or zero for gain,
loss, or no aberration respectively). A large repository of CGH data
is available at Progenetix.
A key feature of the CGH data is that consecutive values are highly
correlated. Here is a summary of our research in this area:
- We investigated novel distance measures for mining of such
high dimensional applications. We developed three pairwise
distance/similarity measures, namely raw, cosine, and sim
[Liu06a]. The first one ignores the correlation, while the
latter two can effectively leverage this correlation. We tested
our distance/similarity measures on Cytogenetic (CGH) aberration
data. Our results show that Sim consistently performs better
than the remaining measures since it can effectively utilize the
consecutive correlations in the underlying dataset.
- We also developed a dynamic programming algorithm to
identify a small set of important consecutive intervals called
markers. This was used to develop two novel clustering
strategies using these markers. Our results demonstrate that the
markers we found represent the aberration patterns of CGH data
very well and they improve the quality of clustering
significantly [Liu06b].
- We consider the problem of feature selection for multiclass
CGH data. We have developed several methods for feature
selection techniques that are based on support vector
machines. These results show that the accuracy of classification
that can be achieved by only using only 15% of the features is
comparable or better than using all the features.
Software
CGH
Mining Tools
People
-
Jun Liu
-
Tamer Kahveci
-
Sanjay Ranka
Publications
-
Jun Liu, Sanjay Ranka, Tamer Kahveci,
Classification and
Feature Selection Algorithms for Multi-class CGH data,
ISMB, 2008.
-
Jun Liu, Sanjay Ranka, and Tamer Kahveci
A web server for
mining Comparative Genomic Hybridization (CGH) data,
Data
mining, systems analysis and optimization in biomedicine,
2007, pages 144-131 (PDF)
-
Jun Liu, Sanjay Ranka, Tamer Kahveci,
Markers improve
clustering of CGH data,
Bioinformatics, 23:4, pages
450-457, 2007. (PubMed)
(Bioinformatics)
-
Jun Liu, Jaaved Mohammed, James Carter, Sanjay Ranka, Tamer
Kahveci, Michael Baudis,
Distance-based Clustering of CGH
Data,
Bioinformatics, 22:16, pages 1971-1978, 2006. (PubMed)
(Bioinformatics)
Tamer Kahveci
Last modified: Thu Apr 3 09:52:14 EDT 2008