CAP5510 Bioinformatics, Fall 2009

(Project-Milestone)

 

Name: Jangsung Lee

ID: 4616-9181

 

Project Title

Sequence Compression.

 

Team Members

Individual Project.

 

Abstract

Since only four nucleotides exist, customized compression techniques for the DNA sequence may possess a higher compression ratio than those of the existing techniques. Loss compression methods, such as DCT, may get a higher ratio than lossless methods. However, because data should be consistent, I will use entropy compression method which is Huffman encoding, and several techniques, which are focused on the DNA sequence for the better performance.

 

A brief itemized plan of action

I will implement Huffman encoding for a compression of DNA sequences. If four nucleotides have the same portion in the sequences, the probability is 0.25% for each letter. By using Huffman encoding, each letter can be encoded as ¡®00¡¯,¡¯01¡¯,¡¯11¡¯,¡¯10¡¯. Because an alphabet represented as 7bits, Huffman encoding saves 5bits for each nucleotide. However, if one of the letters occupies more than 50% of the entire sequence, two same consecutive letters can be added in a Huffman code. For example, if A occurs more than 50% of the sequence, AA sequence can be added to reduce more bits. The proper number and length of sequence is selected dynamically by the probability table. The probability table will be achieved by experiments. Also, a balanced binary tree will be used. The cost of creating this tree is more expensive than an unbalanced binary tree. However, a balanced binary tree can save decoding time, if number of sequences(leaf nodes) is relatively large. Decision for using a balanced binary tree can be made by the result of experiment. Finally, I will check compression ratio as well as encoding&decoding time for each case by comparing other compress applications.

 

Planned workload distribution

It is an individual Project.

 

List of papers

1. A Simple Statistical Algorithm for Biological Sequence Compression

2. Compression of Strings with Approximate Repeats

3. An Efficient Homology Search Method for DNA Sequence Databases

4. DNACompress: fast and effective DNA sequence compression