GSOC2010. BioPerl Alignment Subsystem Refactoring: June 2010

Friday, June 18, 2010

Updated summary of Bio::Align 1806

Here is an updated summary of methods in Bio::Align.
https://spreadsheets.google.com/ccc?key=tOORpiSl-WZCdA_oGQMKD9w#gid=0

When defining a method in the package, several things need to be considered:
1. Arguments
2. Return value
3. Whether to change the object calling the method

And, sometimes,
4. backward compatibility

So the summary is given by these attributes. It is not really human readable at the moment, but it will be the base of the new HOWTO of Bio::Align

Now, most of the new proposed methods are finished. By 22/06, all the new methods will be finished, thus reach the first milestone.

Thursday, June 17, 2010

Naming conventions in BioPerl

I didnot do much coding today, instead, I read some chapters on OOP in Perl Programming again, and the programing conventions in BioPerl.
http://www.bioperl.org/wiki/Advanced_BioPerl

SO there are some facts need to be considered in the BioPerl package, especially naming conventions.

1. For methods returning objects, the name should be in Capital.
For example, each_Seq instead of each_seq
2. For methods returning a list of results, the name should be in plural form.
For example, select_Seqs instead of select
3. Not to use "each" in the name of methods, which may cause confusion
For example, next_LocatableSeq instead of each_seq.
(But it seems each_seq may be better to understand. I need to think on this point)
4. The methods implemented in the Bio::SimpleAlign should also be named in Bio::Align::AlignI, because AlignI is the interface package

Monday, June 14, 2010

Conventions in Bio::Align package (Updating 1606)

Several conventions should be set up in case of potential conflicts among methods or packages.

1. The position of sequences/columns in the alignment should be 1 based.

2. The selection of sequences/columns should be list based, for example, (1,5,8..10,21..50).

3. Special charectors should be read/written by methods in the Bio::Align package. For example:
$aln->gap_char;
$aln->match_char;
$aln->missing_char;
$aln->mask_char;
The selection of characters can only be made from [0-9A-Za-z\*\-\.=~\\/\?], as configured by Bio::PrimarySeq::seq.

4. The ordering of the sequences in the alignment should be based on $seq->{'_order'}

5. The length of the sequences should be the same in all methods, at the moment, they can be calculated from $aln->length, CORE::length, and $seq->length(), each of them is calculated differently.

The length of the sequence should be defined as the number of alphabetic characters. And, the length of the alignment should be defined as the longest length of the sequences(including special characters, e.g. '-','?')

6. The name of the function should be clearer. For example, each_seq should be each_Seq to show it is retrieving sequence objects instead of sequences themselves.

Another option for memory efficient way

I just happen to find the Bio::Root::Storable package in BioPerl. I havenot really tried it, but it is worthwhile to try, which will possibly provide a solution to read in sequences in a memory efficient way, e.g. deposit the object in the local disk, not memory.

Example code of the new alignment method/package

I just found there is already a package "Bio::DB::Fasta" implemented in BioPerl to load the fasta file by indexing (Any_DBM). So, by using this package, we may implement a method in Bio::AlignIO to generate Bio::PrimarySeq objects for the fasta sequences, and they may inherit most of the methods in Bio::SimpleAlign.

At the moment, this code still doesn't work, because add_seq function in Bio::SimpleAlign does not support Bio::PrimarySeq object. This may be solved in future.

#!/usr/bin/perl -w
use strict;
use Bio::AlignIO;
use Bio::DB::Fasta;
use Bio::SimpleAlign;

my $in=Bio::AlignIO->new(-file=>"clustalw2-pumpkin_aa_edi.fst",
'-format'=>'fasta');

my $aln=$in->next_locatable_aln;

print $aln->num_sequences;
print $aln->percentage_identity;

foreach my $seq ($aln->each_seq()) {
#$seq will only load sequences, when we call $seq->seq()
#do something

}

#############
#New method in the Bio::AlignIO::Fasta
#############
use Bio::DB::Fasta;

sub next_locatable_aln {
my $self = shift;

my $aln = Bio::SimpleAlign->new();

my $db=Bio::DB::Fasta->new($self->{"_file"});
my $stream=$db->get_PrimarySeq_stream();
foreach my $seq ($stream->next_seq()) {
$aln->add_seq($seq);
}
return $aln->num_sequences;
}

Wednesday, June 9, 2010

Adjusted project plan

The project plan is adjusted according to the current working progress and understanding of the project.

07/06-23/06 (Milestone 1)
Implementation of the methods proposed for Bio::Align
24/06-04/07
Conference in Vision Camp Young Researcher Symposium, Germany
05/07-12/07 (Milestone 2)
Major refactor of Bio::Align package. Moving the methods to the right package.
12/07-16/07 (Milestone 3)
Implement methods retrieving online sequence file, e.g. pfam. These methods may fall into Bio::DB
19/07-30/07 (Milestone 4)
Implement new Bio::Seq package, storing sequences in local file instead of in the memory
02/08-06/08 (Documentation)
Write an online HOWTO giving an introduction of how to use Bio::Align
09/08-20/08 (Final)
Final testing, improvement of the documentation

Git working environment

After carefully reading the above mentioned git introductions, I have created git working environment. The commands are:

$git clone http://github.com/bioperl/bioperl-live.git
$git remote add upstream http://github.com/bioperl/bioperl-live.git
$git remote add jyinrepo git@github.com:yinjun111/bioperl-live.git
$git checkout -b workingbranch

After this, I will work on workingbranch.

The commands during the work can be:
$git commit -m "xxx"
$git push
$git update
$git pull upstream master #update the clone

Thursday, June 3, 2010

Version control using Git

A summary of Git tutorials

How to set up an ssh-key
http://help.github.com/msysgit-key-setup/

How to fork a project
http://help.github.com/forking/

Easy version control with Git
http://net.tutsplus.com/tutorials/other/easy-version-control-with-git/

BioPerl in GitHub
http://www.bioperl.org/wiki/Using_Git

Git tutorial
http://www.kernel.org/pub/software/scm/git/docs/gittutorial.html
http://www.kernel.org/pub/software/scm/git/docs/gittutorial-2.html

Pro Git
http://progit.org/book/

Wednesday, June 2, 2010

An open question of how to be memory efficient

One objective of the project is to write a next_locatable_aln and next_locatable_seq function to read in the sequence line-by-line, not to read in all the sequences at the same time. But it seems it is not possible to do it simply but read one sequence at one time, because most of the methods in Bio::SimpleAlign needs to know the attributes of all sequences at one time. So an alternative way is to read in the sequence and save it to a temporary file.

Here are a few solutions to implement that:

1. use DB_File;
2. use Storable;
3. use Tie::File;

It seems Storable and Tie::File packages are better choices. But I havenot tried it. Just keep a record here. The implementation will be done later.

GSOC2010. BioPerl Alignment Subsystem Refactoring