Tuesday, August 10, 2010

Solutions for package dealing with large alignment sequences

I have several solutions in mind. Actually the fourth solution is my favorate.

1. There is already a package called Bio::DB::Fasta in bioperl by Licoln Stein. This package indexes fasta file, and save the index into local DB file using Any_DBM. It includes another package in the same file called Bio::PrimarySeq::Fasta, which gives a similar interface to Bio::LocatableSeq and reads the database to retrieve sequences. This package is easy to use and just needs a few minor improvement. But it can only work on Fasta file with strict formatting requirements. So, if you want to use this package for other alignment formats, you have to transform the alignment file to fasta format first, which is not very practical.

2. Any_DBM tends to load everything into a local database. We can load all the sequences (but not the annotation) in the alignment file into a local database using Any_DBM. However, this may depend on the capacity of the database package. As far as I know, Berkeley DB only supports database file smaller than 2GB. But as you know, human genome is actually larger than that size. So it may still collapse on large sequence file.

3. In next-gen sequencing, alignment sequences are transformed using Burrows–Wheeler transform(BWT), e.g. in SAM and BAM. I know the algorithm of BWT, but I don't quite understand how it happens after it, e.g. how to save into binary file. Since SAM is a annotation rich format, it is not very practical to request every other alignment file transforming to SAM first and then use the BioPerl wrapper to convert into BAM … It is better to make a new package in BioPerl using BWT, but it is not very practical either.

4. Then I came into the idea that. When we read the sequences, we can just load the annotation information into memory (e.g. at least NSE), then we save all the indexes for the sequences use the Any_DBM file (as in the Bio::DB::Fasta) or even just in the memory (since the indexes will be much smaller than the actual sequences). Then when we call $seq->seq, we first read the index from DBM file, then seek() and read() from alignment file. This is very memory efficient and easy to implement, and probably fast during reading. It is just we need to implement a new SeqIO for every sequence format. But in the SeqIO, the only thing need to change is to save $index into $seq->seq() instead of the actual sequence. All the new SeqIO packages will share the same interface and database reading/writing methods.

5. The last idea comes from my colleague. There is a package called File::Map in perl (not standard). It uses mmap to make some connections between address space and your file, instead of loading all the file into memory. It is said to be fast and memory efficient. So we may try to use this and/or combine with file indexing. But as it is not a standard Perl package, this may limit the use of the new package.

My favorate fourth idea actually comes from Licoln’s Bio::DB::Fasta, I can try to edit this package to make it work for .fas and .aln alignment files. Generally it will be based on the existing code in Bio::DB::Fasta, but I may need to make it into new packages.

Bio::DB::SeqIO.pm (interface for new DB based SeqIO)
Bio::DB::SeqIO::Fasta.pm (read and index fasta file, major code will come from Bio::DB::Fasta)
Bio::DB::SeqIO::Clustal.pm (read and index clustal file, similar interface with Bio::DB::SeqIO::Fasta, but may change a few IO methods)

Bio::DB::Seq (similar interface to Bio::LocatableSeq, Bio::AlignIO and/or Bio::SeqIO can load sequences into this package instead of Bio::LocatableSeq).

No comments:

Post a Comment