Approximate Matching in Seqeunce Databases using Distance-based Indexing
JAYENDRA VENKATESWARAN

We consider the problem of similarity search in sequence
databases with edit distance as the similarity measure. The dynamic
programming solution to the problem of finding the best alignment between
two sequences of lengths m and n has O(m.n) time and space complexity.  In
this paper, we consider the problem of reducing the number of string
comparisons during database searches. We select a small fraction of the
dataset referred to as 'reference objects'. The distances of strings in
the database with the reference objects are pre-computed. This is an
one-time cost for the database. Given a query sequence, the search
algorithm first computes its distance from the reference objects. Then the
pre-computed distances are used to prune the strings that are not
in the resultant set without false dismissals. We propose two heuristics
in selection of the reference objects. In addition to representing the
part of the database containing similar database strings, they increase
the pruning of distance calculations.