GSOC2010. BioPerl Alignment Subsystem Refactoring

Monday, September 6, 2010

The final code is here

http://code.google.com/p/google-summer-of-code-2010-obf/downloads/detail?name=Jun_Yin.tar.gz

Tuesday, August 24, 2010

The project is successfully finished. Online HOWTO is here:

http://www.bioperl.org/wiki/HOWTO:Multiple_Sequence_Alignment

For future plan, I will continue with the proposed idea of constructing a package of loading sequences in a memory efficient way.

Wednesday, August 18, 2010

Changes in the Bio::SimpleAlign

Here is a summary of the changes in Bio::SimpleAlign

https://spreadsheets.google.com/ccc?key=tOS-XQwKWdrw1geS4SvTg1A&hl=en&authkey=CP28jYYE#gid=0

Saturday, August 14, 2010

Bio::LocatableSeq end checking inconsistency (problem unsolved) conversation with Chris Fields on IRC

This end problem is still a tricky unsolved issue. Here is the discussion of Chris and me in the IRC channel. The ideas here may be used for further improvement.

01[17:22] Hi, pyrimidine. It seems now SimpleAlign has passed nearly all the old tests,except the end position test
01[17:23] The reason is in my email to the bioperl mail list... it is not easy to solve
[17:24] <@pyrimidine> my feeling is, truthfully, that the sequence should be immutable
[17:25] <@pyrimidine> and that any calls to modify the seq using $ls->seq('ATTTGA') beyond the constructor should ie
[17:25] <@pyrimidine> *die
[17:25] <@pyrimidine> this solves a LOT of problems
01[17:25] yep ... sounds cool :D
01[17:26] Then everytime when you modify a sequence, you can start a new sequence object. This may solve some problems
[17:26] <@pyrimidine> any methods that modify the sequence should return a new LocatableSeq with the modifications
[17:26] <@pyrimidine> yes
[17:26] <@pyrimidine> beat me to it
[17:27] <@pyrimidine> :)
[17:29] <@pyrimidine> I posed that question to Heikki L. at BOSC (switching to mainly immutable objects) and his initial though was 'isn't that a lot of work'
[17:29] <@pyrimidine> then, after thinking it over, he realized it solves a lot of state-based problems in bioperl
01[17:30] yep, that is true. It makes sense
[17:30] <@pyrimidine> one issue: setting the sequence and start or end is probably fine, but setting seq, start, and end is tricky
01[17:31] Because when you modify the sequence, the sequence is changed, and it is no sense to use the old information
[17:31] <@pyrimidine> right
05[17:31] -ChanServ- [#gsoc] Welcome to the Google Summer of Code channel! Please note that this channel is logged.
[17:32] <@pyrimidine> but,
[17:32] <@pyrimidine> $ls = Bio::LocatableSeq->new(-seq => 'ATGA-AG', -start => 2, -end => 7)
[17:32] <@pyrimidine> is a bit of a snare
[17:33] <@pyrimidine> end is defined based on the sequence itself
01[17:33] yep
[17:33] <@pyrimidine> so, I think passing in end should be more a validation step instead of a set
01[17:34] This is one is ok, it will still pass the check from $ls->end($ls->end)
[17:35] <@pyrimidine> calling end() and length() should probably just use the ungapped length (with end offset by the length)
[17:35] <@pyrimidine> oops
[17:35] <@pyrimidine> end offset by the start and the length
01[17:35] yep, this is the current check in end()
[17:36] <@pyrimidine> so, what do we do in this situtation:
[17:36] <@pyrimidine> $ls = Bio::LocatableSeq->new(-seq => '', -start => 2, -end => 7)
[17:36] <@pyrimidine> (a virtual sequence)
[17:36] <@pyrimidine> I tend to think, in this case, we don't use LocatableSeq
[17:37] <@pyrimidine> have something else
[17:37] <@pyrimidine> and keep the logic for that separate
01[17:38] ok
[17:38] <@pyrimidine> this popped up with MUMmer I think
01[17:39] Yes, we should have the location information stored in a seperate object in some case
01[17:40] Location for every residue in the current sequence from the original sequence
01[17:40] But thi will make things even more complicated
[17:40] <@pyrimidine> well, in that case you could switch to using features
01[17:41] But anyway, it is the basic idea I have for the new Bio::DB::Seq, a location based module to store alignment sequences
[17:41] <@pyrimidine> sounds good

Bio::LocatableSeq end checking inconsistency (problem unsolved)

The code (Bio::SimpleAlign) I re-implemented now has passed nearly all the test, except a few tests on seq/start-end testing. But here comes a problem. This may be an old issue, that the Bio::LocatableSeq end assignment and checking are inconsistent.

The current end checking method is based on:
$end=$seq->_ungapped_len+$seq->start-1
However, this checking may not fit the real world case.

The inconsistency usually happens when a few columns of the sequence are removed.

For example:
my $a = Bio::LocatableSeq->new(
-id => 'a',
-strand => 1,
-seq => '-tcgatc-atcgatcg',
-start => 30,
-end => 43
);

If we remove the 1st, 8th and the last columns

$a->seq() will be ‘tcgatcatcgatc’
$a->_ungapped_len==12

Actually, in the real world, the first residue will still be 30 (the old $seq->start), and the last residue is the residue before the 43 (the old $seq->end), thus 42.

But if you call a validation, the calculation is $a->_ungapped_len+$a->start-1=12+30-1=41
So the reassignment of the $seq->end will not pass the validation…

So unless you save the information to a new sequence object, the original position information will be lost anyway. But in some cases, we have to change the sequence in its original sequence object ….

What is your suggestion on this issue?
A. pass the test and lose the information #convenient in coding but the start-end annotation is not right any more
B. keep the information and forget the test #the object will still remember where the last residue was in the original sequence. But is it really meaningful at all? Because all the other residues may come from nowhere
C. Neither of above #any other suggestions?

Tuesday, August 10, 2010

Solutions for package dealing with large alignment sequences

I have several solutions in mind. Actually the fourth solution is my favorate.

1. There is already a package called Bio::DB::Fasta in bioperl by Licoln Stein. This package indexes fasta file, and save the index into local DB file using Any_DBM. It includes another package in the same file called Bio::PrimarySeq::Fasta, which gives a similar interface to Bio::LocatableSeq and reads the database to retrieve sequences. This package is easy to use and just needs a few minor improvement. But it can only work on Fasta file with strict formatting requirements. So, if you want to use this package for other alignment formats, you have to transform the alignment file to fasta format first, which is not very practical.

2. Any_DBM tends to load everything into a local database. We can load all the sequences (but not the annotation) in the alignment file into a local database using Any_DBM. However, this may depend on the capacity of the database package. As far as I know, Berkeley DB only supports database file smaller than 2GB. But as you know, human genome is actually larger than that size. So it may still collapse on large sequence file.

3. In next-gen sequencing, alignment sequences are transformed using Burrows–Wheeler transform(BWT), e.g. in SAM and BAM. I know the algorithm of BWT, but I don't quite understand how it happens after it, e.g. how to save into binary file. Since SAM is a annotation rich format, it is not very practical to request every other alignment file transforming to SAM first and then use the BioPerl wrapper to convert into BAM … It is better to make a new package in BioPerl using BWT, but it is not very practical either.

4. Then I came into the idea that. When we read the sequences, we can just load the annotation information into memory (e.g. at least NSE), then we save all the indexes for the sequences use the Any_DBM file (as in the Bio::DB::Fasta) or even just in the memory (since the indexes will be much smaller than the actual sequences). Then when we call $seq->seq, we first read the index from DBM file, then seek() and read() from alignment file. This is very memory efficient and easy to implement, and probably fast during reading. It is just we need to implement a new SeqIO for every sequence format. But in the SeqIO, the only thing need to change is to save $index into $seq->seq() instead of the actual sequence. All the new SeqIO packages will share the same interface and database reading/writing methods.

5. The last idea comes from my colleague. There is a package called File::Map in perl (not standard). It uses mmap to make some connections between address space and your file, instead of loading all the file into memory. It is said to be fast and memory efficient. So we may try to use this and/or combine with file indexing. But as it is not a standard Perl package, this may limit the use of the new package.

My favorate fourth idea actually comes from Licoln’s Bio::DB::Fasta, I can try to edit this package to make it work for .fas and .aln alignment files. Generally it will be based on the existing code in Bio::DB::Fasta, but I may need to make it into new packages.

Bio::DB::SeqIO.pm (interface for new DB based SeqIO)
Bio::DB::SeqIO::Fasta.pm (read and index fasta file, major code will come from Bio::DB::Fasta)
Bio::DB::SeqIO::Clustal.pm (read and index clustal file, similar interface with Bio::DB::SeqIO::Fasta, but may change a few IO methods)

Bio::DB::Seq (similar interface to Bio::LocatableSeq, Bio::AlignIO and/or Bio::SeqIO can load sequences into this package instead of Bio::LocatableSeq).

Friday, August 6, 2010

structure for packages retrieving online alignment sequences

The aim of this package is to provide convenient methods to retrieving online alignment sequences for the BioPerl users. The alignment sequences are converted into Bio::SimpleAlign object after the retrieval, which will be easy to manipulate and write to local disk.

Here is the structure of the packages:
Packages
Bio::DB::Align (interface, and calling other packages)
Bio::DB::Align::Pfam (retrieving alignment from Pfam)
Bio::DB::Align::Rfam (retrieving alignment from Rfam)
Bio::DB::Align:Prosite (retrieving alignment from Prosite)
Bio::DB::Align:ProtClustDB (retrieving alignment from Entrez Protein Clusters Database)

Usually four methods are provided for each package:
Methods
get_Aln_by_id (retrieving alignment by id and returns Bio::SimpleAlign object)
get_Aln_by_acc (retrieving alignment by acession and returns Bio::SimpleAlign object) (Rfam and Prosite only supports this method)
id2acc (id to accession conversion)
acc2id (accession to id conversion)

These packages are built dependent on LWP::UserAgent, HTTP::Request and Bio::DB::GenericWebAgent. Bio::DB::Align::ProtClustDB is dependent on Bio::DB::EUtilities.