We consider the problem of similarity search in a very large
sequence database with edit distance as the similarity measure.
Given limited main memory, our goal is to develop
a reference-based index that reduces the number of costly
edit distance computations in order to answer a query. The
idea in reference-based indexing is to select a small set of
reference sequences that serve as a surrogate for the other
sequences in the database. We consider two novel strategies
for selecting references as well as a new strategy for assigning
references to database sequences. Our experimental results
show that our selection and assignment methods far outperform
competitive methods. For example, our methods prune
up to 20 times as many sequences as the Omni method, and
as many as 30 times as many sequences as frequency vectors.
Our methods also scale nicely for databases containing
many and/or very long sequences.