A suffix-array
based algorithm for clustering expression data.
Abstract:
With the advent
of Second generation sequencing technology like pyrosequencing, the research on
expression data has been revived. For example, such kind of sequencing
technology provides the opportunity to study the transcriptomes of organisms
for which good quality genomes are not known.
Clustering such
data remains a significant challenge and a much larger data sets and a whole
lot of new different error profiles add upon it. The previously used
all-versus-all comparison of sequence are not at all optimum for large data
sets. Therefore to eliminate this we
need to develop a new filter which can reduce this complexity of operation.
Thus we
introduce a new filter for string similarity based on multiple long exact
matches between 2 strings. The additional constraint is that the matches should
be apart by a sufficient distance. This alleviates the drawbacks of computation encountered in the
all-versus-all filter.
The main goal here is to find a partitional
clustering such that sequences derived from the same gene are members of the
same cluster. Given a set S of strings over {A,C,G,T} (sequences
derived from expression data), we want to find a clustering, i.e. a partition,
of S such that two sequences end up in the same cluster if and only if they
have been derived from the same gene.
KABOOM Filter:
It is a (k, x, y)
multiword filter. This filter defines the two sequences to be similar if they
share x substrings of length k with the additional constraint that the distance
between the first and last is at least y. This is the essence of the algorithm.
Our algorithm
implements the following :
1.Introducing
new heuristic filter for sequence similarity.
2.Implementing
this heuristic and eliminates the need for all-versus-all comparison.
If time permits we
compare our algorithm to other current tools and show that it is very competitive
both with respect to quality and run time.
Team Members:
Ameya Bondre, Anuj
Shroff
Actions
tasks:
Program would be
implemented in JAVA.
1. Understanding the
functioning of Suffix arrays and Suffix Trees.
2. In our approach we
use the dissimilarity measure. Initial research of the project would go in
understanding them and choosing the best out of them for example q-gram
distance and windowed d^2.
3. Because edit
distance or windowed d^2 is quadratic in the length of the two strings, often
filters are used. To get an efficient filter we need to study the few filters H
filter and k-word similarity filter so that we can develop a new filter which
is a combination of both these filters.
4. Implementing the
actual algorithm as mentioned in the paper and reference from the internet.
5. Print the
clusters on standard output in compact form.
6. Try to print
the above result as a histogram
7. Try to
compare the above method with other existing tools for clustering expression
data.
Assumptions:
Our algorithm
makes a significant use of a modified suffix arrays.
We are focusing
on Expression clustering for which a reference genome is not known (also called
ab initio or de novo clustering).
Work Distribution:
The project will
be a group effort and each project member will have equal share of their
participation in all aspects of the project. Individual work distribution will
be uploaded after each action is performed.
Work done:
1. Started
implementing the suffix tree and array for a better understanding of the data
structure.
2. Started
looking at the dissimilarity measures and how they help in improving the
algorithm.
3. Have
downloaded various sequence example files for ready input.
References:
1. KABOOM! A new suffix-array based algorithm
for clustering expression data Scott Hazelhurst, Zsuzsanna Lipt
2. Burkhardt,
S., Crauser, A., Ferragina, P., Lenhof, H.-P., Rivals, E., and Vingron, M.
(1999). q-gram based database
searching using a suffix array (QUASAR).
3. Hazelhurst,
S. (2008). Algorithms for clustering EST sequences: the wcd tool.
Will be updated
as and when required.