Wednesday August 30th, 2006
CSE Room 305
12:00 - 1:00 PM
|
Reference-Based
Indexing of Sequence Databases |
|
Jayendra Venkateswaran |
|
We
consider the problem of similarity search in a very large sequence database
with edit distance as the similarity measure. Given limited main memory, our
goal is to develop a reference-based index that reduces the number of costly
edit distance computations in order to answer a query. The idea in reference-based indexing is to
select a small set of reference sequences that serve as a surrogate for the
other sequences in the database. We consider two novel strategies for
selecting references as well as a new strategy for assigning references to
database sequences. Our experimental results show that our selection and
assignment methods far outperform competitive methods. For example, our
methods prune up to 20 times as many sequences as the Omni method, and as
many as 30 times as many sequences as frequency vectors. Our methods also
scale nicely for databases containing many and/or very long sequences. |