A suffix-array based algorithm for clustering expression data.

 

Abstract:

With the advent of Second generation sequencing technology like pyrosequencing, the research on expression data has been revived. For example, such kind of sequencing technology provides the opportunity to study the transcriptomes of organisms for which good quality genomes are not known.

Clustering such data remains a significant challenge and a much larger data sets and a whole lot of new different error profiles add upon it. The previously used all-versus-all comparison of sequence are not at all optimum for large data sets.  Therefore to eliminate this we need to develop a new filter which can reduce this complexity of operation.

Thus we introduce a new filter for string similarity based on multiple long exact matches between 2 strings. The additional constraint is that the matches should be apart by a sufficient distance. This alleviates the  drawbacks of computation encountered in the all-versus-all filter.

 

The main goal here is to find a partitional clustering such that sequences derived from the same gene are members of the same cluster. Given a set S of strings over {A,C,G,T} (sequences derived from expression data), we want to find a clustering, i.e. a partition, of S such that two sequences end up in the same cluster if and only if they have been derived from the same gene.

 

KABOOM Filter:

It is a (k, x, y) multiword filter. This filter defines the two sequences to be similar if they share x substrings of length k with the additional constraint that the distance between the first and last is at least y. This is the essence of the algorithm.

 

Our algorithm implements the following   :

 

1.Introducing new heuristic filter for sequence similarity.

2.Implementing this heuristic and eliminates the need for all-versus-all comparison.

 

 

If time permits we compare our algorithm to other current tools and show that it is very competitive both with respect to quality and run time.

 

Team Members:

Ameya Bondre, Anuj Shroff

 

Actions tasks:

Program would be implemented in JAVA.

1. Understanding the functioning of Suffix arrays and Suffix Trees.

2. In our approach we use the dissimilarity measure. Initial research of the project would go in understanding them and choosing the best out of them for example q-gram distance and windowed d^2.

3. Because edit distance or windowed d^2 is quadratic in the length of the two strings, often filters are used. To get an efficient filter we need to study the few filters H filter and k-word similarity filter so that we can develop a new filter which is a combination of both these filters.

 

4. Implementing the actual algorithm as mentioned in the paper and reference from the internet.

5. Print the clusters on standard output in compact form.

 

6. Try to print the above result as a histogram

 

7. Try to compare the above method with other existing tools for clustering expression data.

 

 

Assumptions:

 

Our algorithm makes a significant use of a modified suffix arrays.

 

We are focusing on Expression clustering for which a reference genome is not known (also called ab initio or de novo clustering).

 

Work Distribution:

 

The project will be a group effort and each project member will have equal share of their participation in all aspects of the project. Individual work distribution will be uploaded after each action is performed.

 

Work done:

 

1. Started implementing the suffix tree and array for a better understanding of the data structure.

2. Started looking at the dissimilarity measures and how they help in improving the algorithm.

3. Have downloaded various sequence example files for ready input.

 

References:

 

1. KABOOM! A new suffix-array based algorithm for clustering expression data Scott Hazelhurst, Zsuzsanna Lipt

2. Burkhardt, S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., and Vingron, M. (1999). q-gram based database searching using a suffix array (QUASAR).

3. Hazelhurst, S. (2008). Algorithms for clustering EST sequences: the wcd tool.

 

Will be updated as and when required.