Data Mining Methods for
Comparative Genomic Hybridization (CGH) data

Team:

Sanjay Ranka
Tamer Kahveci
Michael Baudis
Jun Liu

Description:

Numerical and structural chromosomal imbalances are one of the most prominent and pathogenetically relevant features of neoplastic cells. Over the past decades, thousands of (molecular-) cytogenetic studies of human neoplasias have searched for insights into genetic mechanisms of tumor development and the detection of targets for pharmacologic intervention. It is assumed that repetitive chromosomal aberration patterns reflect the supposed cooperation of a multitude of tumor relevant genesin most malignant diseases.

One method for measuring genomic aberrations is Comparative Genomic Hybridization (CGH). CGH is a molecular-cytogenetic analysis method for detecting regions with genomic imbalances (gains or losses of DNA segments). Raw data from CGH experiments is expressed as the ratio of normalized fluorescence of tumor and reference DNA. Normalized CGH ratio data surpassing predefined thresholds is considered indicative for genomic gains or losses, respectively. The chromosomal CGH results are annotated in a reverse in-situ karyotype format describing imbalanced genomic regions with reference to their chromosomal location. CGH data of an individual tumor can be considered as an ordered list of status values, where each value corresponds to a genomic interval (e.g., a single chromosomal band).The status can be expressed as a real number (positive, negative, or zero for gain, loss, or no aberration respectively). A large repository of CGH data is available at Progenetix.

A key feature of the CGH data is that consecutive values are highly correlated. Here is a summary of our research in this area:

  1. We investigated novel distance measures for mining of such high dimensional applications. We developed three pairwise distance/similarity measures, namely raw, cosine, and sim [Liu06a]. The first one ignores the correlation, while the latter two can effectively leverage this correlation. We tested our distance/similarity measures on Cytogenetic (CGH) aberration data. Our results show that Sim consistently performs better than the remaining measures since it can effectively utilize the consecutive correlations in the underlying dataset.
  2. We also developed a dynamic programming algorithm to identify a small set of important consecutive intervals called markers. This was used to develop two novel clustering strategies using these markers. Our results demonstrate that the markers we found represent the aberration patterns of CGH data very well and they improve the quality of clustering significantly [Liu06b].
  3. We consider the problem of feature selection for multiclass CGH data. We have developed several methods for feature selection techniques that are based on support vector machines. These results show that the accuracy of classification that can be achieved by only using only 15% of the features is comparable or better than using all the features.

  Figure 1. Plot of dissimilarDS, a CGH dataset containing 680 samples that belong to four cancer types. Samples are grouped by cancer types. Each sample contains 862 dimensions. The gain and loss of copy numbers are plotted in green and red, respectively.  

 

Figure 2. Plot of four markers found in 121 CGH samples of Retinoblastoma, NOS(ICD-O 9510/3). Each sample contains 862 dimensions. The gain and loss of copy numbers are plotted in green and red, respectively.

 

Publications:

  • Jun Liu, Sanjay Ranka, Tamer Kahveci, Feature selection algorithms for CGH data, submitted to ISMB 2007. (PDF)
  • Jun Liu, Sanjay Ranka, Tamer Kahveci, Markers improve clustering of CGH data, Bioinformatics, page btl624, 2006. (PubMed) (Bioinformatics)
  • Jun Liu, Jaaved Mohammed, James Carter, Sanjay Ranka, Tamer Kahveci, Michael Baudis, Distance-based Clustering of CGH Data, Bioinformatics, 22:16, pages 1971-1978, 2006. (PubMed) (Bioinformatics)

Sponsor:

This material is based upon work supported by the National Science Foundation under Grant No. 0325459. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Software:

We are currently in the process of developing a software package that incorporates these algorithms. The software package will be available as a web service starting Feb 28, 2007.

Last updated: June 23, 2005 10:28 EDT