Numerical and structural chromosomal imbalances are one
of the
most
prominent and pathogenetically relevant features of neoplastic cells.
Over the past decades, thousands of (molecular-) cytogenetic studies of
human neoplasias have searched for insights into genetic mechanisms of
tumor development and the detection of targets for pharmacologic
intervention. It is assumed that repetitive chromosomal aberration
patterns reflect the supposed cooperation of a multitude of tumor
relevant genesin most malignant diseases.
One method for measuring genomic aberrations is
Comparative
Genomic
Hybridization (CGH). CGH is a molecular-cytogenetic analysis method for
detecting regions with genomic imbalances (gains or losses of DNA
segments). Raw data from CGH experiments is expressed as the ratio of
normalized fluorescence of tumor and reference DNA. Normalized CGH
ratio data surpassing predefined thresholds is considered indicative
for genomic gains or losses, respectively. The chromosomal CGH results
are annotated in a reverse in-situ karyotype format describing
imbalanced genomic regions with reference to their chromosomal
location. CGH data of an individual tumor can be considered as an
ordered list of status values, where each value corresponds to a
genomic interval (e.g., a single chromosomal band).The status can be
expressed as a real number (positive, negative, or zero for gain, loss,
or no aberration respectively). A large repository of CGH data is
available at Progenetix.
A key feature of the CGH data is that consecutive values
are
highly correlated. Here is a summary of our research in this area:
We investigated novel distance measures for mining
of such
high dimensional applications. We developed three pairwise
distance/similarity measures, namely raw, cosine, and sim [Liu06a]. The
first one ignores the correlation, while the latter two can effectively
leverage this correlation. We tested our distance/similarity measures
on Cytogenetic (CGH) aberration data. Our results show that Sim
consistently performs better than the remaining measures since it can
effectively utilize the consecutive correlations in the underlying
dataset.
We also developed a dynamic programming algorithm to
identify
a small set of important consecutive intervals called markers. This was
used to develop two novel clustering strategies using these markers.
Our results demonstrate that the markers we found represent the
aberration patterns of CGH data very well and they improve the quality
of clustering significantly [Liu06b].
We consider the problem of feature selection for
multiclass
CGH data. We have developed several methods for feature selection
techniques that are based on support vector machines. These results
show that the accuracy of classification that can be achieved by only
using only 15% of the features is comparable or better than using all
the features.
Figure 1.
Plot of
dissimilarDS, a CGH dataset containing 680 samples that belong to four
cancer types. Samples are grouped by cancer types. Each sample contains
862 dimensions. The gain and loss of copy numbers are plotted in green
and red, respectively.
Figure 2.
Plot of
four markers found
in 121 CGH samples of Retinoblastoma, NOS(ICD-O 9510/3). Each sample
contains 862 dimensions. The gain and loss of copy numbers are plotted
in green and red, respectively.
Publications:
Jun Liu, Sanjay Ranka, Tamer Kahveci, Feature selection algorithms for CGH data, submitted to ISMB 2007. (PDF)
Jun Liu, Sanjay Ranka, Tamer Kahveci, Markers
improve
clustering of CGH data, Bioinformatics, page btl624, 2006. (PubMed)
(Bioinformatics)
Jun Liu, Jaaved Mohammed, James Carter, Sanjay
Ranka,
Tamer
Kahveci, Michael Baudis, Distance-based Clustering of CGH Data,
Bioinformatics, 22:16, pages 1971-1978, 2006. (PubMed)
(Bioinformatics)
Sponsor:
This material is based upon work supported by the
National
Science Foundation under Grant No. 0325459. Any opinions, findings, and
conclusions or recommendations
expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation.
Software:
We are currently in the process of developing a
software
package
that incorporates these algorithms. The software package will be
available as a web service starting Feb 28, 2007.