Mining Comparative Genomic Hybridization (CGH) data

Numerical and structural chromosomal imbalances are one of the most prominent and pathogenetically relevant features of neoplastic cells. Over the past decades, thousands of (molecular-) cytogenetic studies of human neoplasias have searched for insights into genetic mechanisms of tumor development and the detection of targets for pharmacologic intervention. It is assumed that repetitive chromosomal aberration patterns reflect the supposed cooperation of a multitude of tumor relevant genesin most malignant diseases. One method for measuring genomic aberrations is Comparative Genomic Hybridization (CGH). CGH is a molecular-cytogenetic analysis method for detecting regions with genomic imbalances (gains or losses of DNA segments). Raw data from CGH experiments is expressed as the ratio of normalized fluorescence of tumor and reference DNA. Normalized CGH ratio data surpassing predefined thresholds is considered indicative for genomic gains or losses, respectively. The chromosomal CGH results are annotated in a reverse in-situ karyotype format describing imbalanced genomic regions with reference to their chromosomal location. CGH data of an individual tumor can be considered as an ordered list of status values, where each value corresponds to a genomic interval (e.g., a single chromosomal band).The status can be expressed as a real number (positive, negative, or zero for gain, loss, or no aberration respectively). A large repository of CGH data is available at Progenetix. A key feature of the CGH data is that consecutive values are highly correlated. Here is a summary of our research in this area:
  1. We investigated novel distance measures for mining of such high dimensional applications. We developed three pairwise distance/similarity measures, namely raw, cosine, and sim [Liu06a]. The first one ignores the correlation, while the latter two can effectively leverage this correlation. We tested our distance/similarity measures on Cytogenetic (CGH) aberration data. Our results show that Sim consistently performs better than the remaining measures since it can effectively utilize the consecutive correlations in the underlying dataset.
  2. We also developed a dynamic programming algorithm to identify a small set of important consecutive intervals called markers. This was used to develop two novel clustering strategies using these markers. Our results demonstrate that the markers we found represent the aberration patterns of CGH data very well and they improve the quality of clustering significantly [Liu06b].
  3. We consider the problem of feature selection for multiclass CGH data. We have developed several methods for feature selection techniques that are based on support vector machines. These results show that the accuracy of classification that can be achieved by only using only 15% of the features is comparable or better than using all the features.

Software

CGH Mining Tools

People


Publications


Tamer Kahveci
Last modified: Thu Apr 3 09:52:14 EDT 2008