MAP: Searching Large Genome Databases

Most of the important biological applications require comparison of large genome strings.Current techniques suffer from both disk I/O and computational cost because of extensive memory requirements and large candidate sets. We propose an efficient technique for alignment of large genome strings. Our technique precomputes the associations between the database strings and the query string. These associations are used to prune the database-query substring pairs that do not contain similar regions. We use a hash table to compare the unpruned regions of the query and database strings. The cost of the ensuing search is determined by how the hash table is constructed. We present a dynamic strategy that optimizes the random disk I/O needed for accessing the hash table. Our technique gives the user the freedom to select specific regions within the search space and assign different accuracy levels to these regions prior to search. It also provides the user a coarse grain visualization of the similarity pattern quickly before the actual search. The experimental results show that our technique reduces the cost of genome alignment significantly.