CAP5510 Bioinformatics,
Fall 2009
(Project-Milestone)
Name: Jangsung Lee
ID: 4616-9181
Project Title
Sequence
Compression.
Team Members
Individual Project.
Abstract
Since
only four nucleotides exist, customized compression techniques for the DNA
sequence may possess a higher compression ratio than those of the existing techniques.
Loss compression methods, such as DCT, may get a higher ratio than lossless
methods. However, because data should be consistent, I will use entropy compression
method which is Huffman encoding, and several techniques, which are focused on the
DNA sequence for the better performance.
A
brief itemized plan of action
I will implement Huffman encoding for a
compression of DNA sequences. If four nucleotides have
the same
portion in the sequences, the probability is 0.25% for
each letter. By using Huffman encoding, each letter can be encoded as ¡®00¡¯,¡¯01¡¯,¡¯11¡¯,¡¯10¡¯.
Because an alphabet represented as 7bits, Huffman encoding saves 5bits for each
nucleotide. However, if one of the letters occupies more than 50% of the entire sequence, two same consecutive
letters can be added in a Huffman code. For example, if A
occurs more than 50% of the sequence, AA sequence can be added to reduce more
bits.
The proper number and length of sequence is selected dynamically by the
probability table. The probability table will be achieved by experiments. Also,
a balanced binary tree will be used. The cost of creating this tree is more
expensive than an unbalanced binary tree. However, a balanced binary tree can
save decoding time, if number of sequences(leaf nodes)
is relatively large. Decision for using a balanced binary tree can be made by the
result of experiment. Finally, I will check compression
ratio as well as encoding&decoding time for each
case by comparing other compress applications.
Planned
workload distribution
It is an individual Project.
List
of papers
1. A Simple Statistical
Algorithm for Biological Sequence Compression
2. Compression of Strings with Approximate Repeats
3.
An Efficient Homology Search Method for DNA Sequence
Databases
4. DNACompress: fast and effective
DNA sequence compression