Monday, September 6, 2010

The final code is here

Tuesday, August 24, 2010

Online HOWTO

The project is successfully finished. Online HOWTO is here:

For future plan, I will continue with the proposed idea of constructing a package of loading sequences in a memory efficient way.

Wednesday, August 18, 2010

Changes in the Bio::SimpleAlign

Here is a summary of the changes in Bio::SimpleAlign

Saturday, August 14, 2010

Bio::LocatableSeq end checking inconsistency (problem unsolved) conversation with Chris Fields on IRC

This end problem is still a tricky unsolved issue. Here is the discussion of Chris and me in the IRC channel. The ideas here may be used for further improvement.

01[17:22] Hi, pyrimidine. It seems now SimpleAlign has passed nearly all the old tests,except the end position test
01[17:23] The reason is in my email to the bioperl mail list... it is not easy to solve
[17:24] <@pyrimidine> my feeling is, truthfully, that the sequence should be immutable
[17:25] <@pyrimidine> and that any calls to modify the seq using $ls->seq('ATTTGA') beyond the constructor should ie
[17:25] <@pyrimidine> *die
[17:25] <@pyrimidine> this solves a LOT of problems
01[17:25] yep ... sounds cool :D
01[17:26] Then everytime when you modify a sequence, you can start a new sequence object. This may solve some problems
[17:26] <@pyrimidine> any methods that modify the sequence should return a new LocatableSeq with the modifications
[17:26] <@pyrimidine> yes
[17:26] <@pyrimidine> beat me to it
[17:27] <@pyrimidine> :)
[17:29] <@pyrimidine> I posed that question to Heikki L. at BOSC (switching to mainly immutable objects) and his initial though was 'isn't that a lot of work'
[17:29] <@pyrimidine> then, after thinking it over, he realized it solves a lot of state-based problems in bioperl
01[17:30] yep, that is true. It makes sense
[17:30] <@pyrimidine> one issue: setting the sequence and start or end is probably fine, but setting seq, start, and end is tricky
01[17:31] Because when you modify the sequence, the sequence is changed, and it is no sense to use the old information
[17:31] <@pyrimidine> right
05[17:31] -ChanServ- [#gsoc] Welcome to the Google Summer of Code channel! Please note that this channel is logged.
[17:32] <@pyrimidine> but,
[17:32] <@pyrimidine> $ls = Bio::LocatableSeq->new(-seq => 'ATGA-AG', -start => 2, -end => 7)
[17:32] <@pyrimidine> is a bit of a snare
[17:33] <@pyrimidine> end is defined based on the sequence itself
01[17:33] yep
[17:33] <@pyrimidine> so, I think passing in end should be more a validation step instead of a set
01[17:34] This is one is ok, it will still pass the check from $ls->end($ls->end)
[17:35] <@pyrimidine> calling end() and length() should probably just use the ungapped length (with end offset by the length)
[17:35] <@pyrimidine> oops
[17:35] <@pyrimidine> end offset by the start and the length
01[17:35] yep, this is the current check in end()
[17:36] <@pyrimidine> so, what do we do in this situtation:
[17:36] <@pyrimidine> $ls = Bio::LocatableSeq->new(-seq => '', -start => 2, -end => 7)
[17:36] <@pyrimidine> (a virtual sequence)
[17:36] <@pyrimidine> I tend to think, in this case, we don't use LocatableSeq
[17:37] <@pyrimidine> have something else
[17:37] <@pyrimidine> and keep the logic for that separate
01[17:38] ok
[17:38] <@pyrimidine> this popped up with MUMmer I think
01[17:39] Yes, we should have the location information stored in a seperate object in some case
01[17:40] Location for every residue in the current sequence from the original sequence
01[17:40] But thi will make things even more complicated
[17:40] <@pyrimidine> well, in that case you could switch to using features
01[17:41] But anyway, it is the basic idea I have for the new Bio::DB::Seq, a location based module to store alignment sequences
[17:41] <@pyrimidine> sounds good

Bio::LocatableSeq end checking inconsistency (problem unsolved)

The code (Bio::SimpleAlign) I re-implemented now has passed nearly all the test, except a few tests on seq/start-end testing. But here comes a problem. This may be an old issue, that the Bio::LocatableSeq end assignment and checking are inconsistent.

The current end checking method is based on:
However, this checking may not fit the real world case.

The inconsistency usually happens when a few columns of the sequence are removed.

For example:
my $a = Bio::LocatableSeq->new(
-id => 'a',
-strand => 1,
-seq => '-tcgatc-atcgatcg',
-start => 30,
-end => 43

If we remove the 1st, 8th and the last columns

$a->seq() will be ‘tcgatcatcgatc’

Actually, in the real world, the first residue will still be 30 (the old $seq->start), and the last residue is the residue before the 43 (the old $seq->end), thus 42.

But if you call a validation, the calculation is $a->_ungapped_len+$a->start-1=12+30-1=41
So the reassignment of the $seq->end will not pass the validation…

So unless you save the information to a new sequence object, the original position information will be lost anyway. But in some cases, we have to change the sequence in its original sequence object ….

What is your suggestion on this issue?
A. pass the test and lose the information #convenient in coding but the start-end annotation is not right any more
B. keep the information and forget the test #the object will still remember where the last residue was in the original sequence. But is it really meaningful at all? Because all the other residues may come from nowhere
C. Neither of above #any other suggestions?

Tuesday, August 10, 2010

Solutions for package dealing with large alignment sequences

I have several solutions in mind. Actually the fourth solution is my favorate.

1. There is already a package called Bio::DB::Fasta in bioperl by Licoln Stein. This package indexes fasta file, and save the index into local DB file using Any_DBM. It includes another package in the same file called Bio::PrimarySeq::Fasta, which gives a similar interface to Bio::LocatableSeq and reads the database to retrieve sequences. This package is easy to use and just needs a few minor improvement. But it can only work on Fasta file with strict formatting requirements. So, if you want to use this package for other alignment formats, you have to transform the alignment file to fasta format first, which is not very practical.

2. Any_DBM tends to load everything into a local database. We can load all the sequences (but not the annotation) in the alignment file into a local database using Any_DBM. However, this may depend on the capacity of the database package. As far as I know, Berkeley DB only supports database file smaller than 2GB. But as you know, human genome is actually larger than that size. So it may still collapse on large sequence file.

3. In next-gen sequencing, alignment sequences are transformed using Burrows–Wheeler transform(BWT), e.g. in SAM and BAM. I know the algorithm of BWT, but I don't quite understand how it happens after it, e.g. how to save into binary file. Since SAM is a annotation rich format, it is not very practical to request every other alignment file transforming to SAM first and then use the BioPerl wrapper to convert into BAM … It is better to make a new package in BioPerl using BWT, but it is not very practical either.

4. Then I came into the idea that. When we read the sequences, we can just load the annotation information into memory (e.g. at least NSE), then we save all the indexes for the sequences use the Any_DBM file (as in the Bio::DB::Fasta) or even just in the memory (since the indexes will be much smaller than the actual sequences). Then when we call $seq->seq, we first read the index from DBM file, then seek() and read() from alignment file. This is very memory efficient and easy to implement, and probably fast during reading. It is just we need to implement a new SeqIO for every sequence format. But in the SeqIO, the only thing need to change is to save $index into $seq->seq() instead of the actual sequence. All the new SeqIO packages will share the same interface and database reading/writing methods.

5. The last idea comes from my colleague. There is a package called File::Map in perl (not standard). It uses mmap to make some connections between address space and your file, instead of loading all the file into memory. It is said to be fast and memory efficient. So we may try to use this and/or combine with file indexing. But as it is not a standard Perl package, this may limit the use of the new package.

My favorate fourth idea actually comes from Licoln’s Bio::DB::Fasta, I can try to edit this package to make it work for .fas and .aln alignment files. Generally it will be based on the existing code in Bio::DB::Fasta, but I may need to make it into new packages. (interface for new DB based SeqIO) (read and index fasta file, major code will come from Bio::DB::Fasta) (read and index clustal file, similar interface with Bio::DB::SeqIO::Fasta, but may change a few IO methods)

Bio::DB::Seq (similar interface to Bio::LocatableSeq, Bio::AlignIO and/or Bio::SeqIO can load sequences into this package instead of Bio::LocatableSeq).

Friday, August 6, 2010

structure for packages retrieving online alignment sequences

The aim of this package is to provide convenient methods to retrieving online alignment sequences for the BioPerl users. The alignment sequences are converted into Bio::SimpleAlign object after the retrieval, which will be easy to manipulate and write to local disk.

Here is the structure of the packages:
Bio::DB::Align (interface, and calling other packages)
Bio::DB::Align::Pfam (retrieving alignment from Pfam)
Bio::DB::Align::Rfam (retrieving alignment from Rfam)
Bio::DB::Align:Prosite (retrieving alignment from Prosite)
Bio::DB::Align:ProtClustDB (retrieving alignment from Entrez Protein Clusters Database)

Usually four methods are provided for each package:
get_Aln_by_id (retrieving alignment by id and returns Bio::SimpleAlign object)
get_Aln_by_acc (retrieving alignment by acession and returns Bio::SimpleAlign object) (Rfam and Prosite only supports this method)
id2acc (id to accession conversion)
acc2id (accession to id conversion)

These packages are built dependent on LWP::UserAgent, HTTP::Request and Bio::DB::GenericWebAgent. Bio::DB::Align::ProtClustDB is dependent on Bio::DB::EUtilities.

Thursday, August 5, 2010

Bio::DB::Align::ProtClustDB implemented

Bio::DB::Align::ProtClustDB has a similar interface with Bio::DB::Align::Pfam, which support retrieving alignment file from Prosite.

It supports:

get_Aln_by_id #retrieve the alignment using ID, and returns a Bio::SimpleAlign object
get_Aln_by_acc #retrieve the alignment using accession, and returns a Bio::SimpleAlign object
id2acc #conversion of id to accession
acc2id #conversion of accession to id

Tuesday, August 3, 2010

Bio::DB::Align::Prosite implemented

Bio::DB::Align::Prosite has a similar interface with Bio::DB::Align::Rfam, which support retrieving alignment file from Prosite.

Friday, July 30, 2010

Bio::DB::Align::Rfam implemented

Bio::DB::Align::Rfam is implemented. However, a test using pfam alignment format failed. It seems there is something wrong with Bio::AlignIO::pfam. It needs to be solved later.

Now Bio::DB::Align::Rfam provides the sequence retrieval method by accession.

get_Aln_by_acc #retrieve the alignment using accession, and returns a Bio::SimpleAlign object

An example

my $dbobj=Bio::DB::Align->new(-db=>"rfam");
my $aln=$dbobj->get_Aln_by_acc("RF0001");
my $aln2=$dbobj->get_Aln_by_acc(-accession=>"RF0001",-alignment=>"full");

print $aln->length();

foreach my $seq ($aln->each_Seq) {
#do something

Friday, July 23, 2010

Bio::DB::Align::Pfam implemented

Bio::DB::Align::Pfam is implemented. This package uses the Pfam RESTful service to retrieve alignment sequences from Pfam.

Now this packages include these methods

get_Aln_by_id #retrieve the alignment using ID, and returns a Bio::SimpleAlign object
get_Aln_by_acc #retrieve the alignment using accession, and returns a Bio::SimpleAlign object
id2acc #conversion of id to accession

An example

my $dbobj=Bio::DB::Align->new(-db=>"pfam");
my $aln=$dbobj->get_Aln_by_acc("PF0001");
my $aln2=$dbobj->get_Aln_by_acc(-accession=>"PF0001",-alignment=>"full");

print $aln->length();

foreach my $seq ($aln->each_Seq) {
#do something

Wednesday, July 21, 2010

How can I access resources on Uniprot programmatically

Here is the answer from Uniprot.It can be accessed through RESTful service.

Friday, July 16, 2010

Summary of Bio::SimpleAlign

A summary of what is done for Bio::SimpleAlign

I have listed which method is new, which method is renamed, which method is rewritten ...

Wednesday, July 14, 2010

First trial run for Bio::DB::Align::Pfam succeeded

I just got a trial run for some methods retrieving Pfam alignment sequences. It succeeded.

At the moment, this package (Bio::DB::Align::Pfam) aims at retrieving alignment sequences from Pfam using the accession number provided by the user. Calling the package can be:

my $obj = Bio::DB::Align::Pfam->new(); #or Bio::DB::Align->new(-database=>"Pfam") ???
my $aln = $obj->get_Aln_by_acc(-accession=>"PF00653",-alnType=>"seed",-format=>"fasta",-order=>"t",-case=>"l",-gap=>"default");

or simply as

my $aln = $obj->get_Aln_by_acc("PF00653");

The package will return a Bio::SimpleAlign object by simply one or two commands.

Tuesday, July 13, 2010

A possbile structure for Bio::DB::Align

The next task is to implement a package retrieving online alignment sequences and return a Bio::SimpleAlign object. The possible way of doing that is to implement these packages:

Bio::DB::Align (folder) (Interface showing methods need to be implemented) (Implementation of retrieving alignment data from Pfam) (Implementation of retrieving alignment data from Uniprot)

Probably, in AlignI, the methods are

In the implementation, e.g., the package will implement both Bio::DB::Align::AlignI for alignment retrieving methods and Bio::DB::GenericWebAgent for web related methods.

For simplicity, Bio::DB::Align::Pfam and Bio::DB::Align::Uniprot will only implement alignment related methods. The package retrieving other information from these two databases can be implmented later as Bio::DB::Pfam and Bio::DB::Uniprot

Friday, July 9, 2010

The structure of Bio::SimpleAlign

Alignment modifier methods
Alignment selection methods
Change sequences within the MSA
Consensus sequences
MSA attributes
Alignment descriptors
Sequence names
methods implementing Bio::FeatureHolderI
methods for Bio::AnnotatableI

Major improvements on Bio::SimpleAlign

The cleaning for Bio::SimpleAlign is 99% finished, though a few tests may be needed. Here are the major improvements on Bio::SimpleAlign

1. MSA modifying and selection methods are more consistent and easier to use. I have enabled multiple/reverse selections for all sequences/columns selection methods, and change the names to be more understandable.
2. gap chars/missing chars are more consistent in the package
3. Some redundant methods are removed. The methods are moved to more reasonable categories.
4. Some methods are renamed. Methods selecting/giving objects are capitalized, e.g. each_seq to each_Seq. (Though each_Seq and add_Seq may cause inconsistency in some packages related with Bio::SimpleAlign, because they are so widely used. I will find them and replace them).

Thursday, July 8, 2010

Retrieving online alignment sequences

Pfam database provides a web service to retrieve information. I may start from writing a wrapper for that.

This may request an implementation of a Bio::DB::Pfam package.

Tuesday, July 6, 2010

Enable multiple/toggle selection feature in Bio::SimpleAlign

One major new feature in Bio::SimpleAlign is to enable multiple and toggle selection for methods selecting sequences or columns. The new ways of doing can be :


Or you can toggle selection(reverse selection) using:

The coordinates of the sequences or columns are 1-based. This new feature will make the selection much easier for users.And, it will affect methods such as:


Monday, July 5, 2010

Back to work

Back from the conference and vacation.

Continue coding now~~ o.o

Friday, June 18, 2010

Updated summary of Bio::Align 1806

Here is an updated summary of methods in Bio::Align.

When defining a method in the package, several things need to be considered:
1. Arguments
2. Return value
3. Whether to change the object calling the method

And, sometimes,
4. backward compatibility

So the summary is given by these attributes. It is not really human readable at the moment, but it will be the base of the new HOWTO of Bio::Align

Now, most of the new proposed methods are finished. By 22/06, all the new methods will be finished, thus reach the first milestone.

Thursday, June 17, 2010

Naming conventions in BioPerl

I didnot do much coding today, instead, I read some chapters on OOP in Perl Programming again, and the programing conventions in BioPerl.

SO there are some facts need to be considered in the BioPerl package, especially naming conventions.

1. For methods returning objects, the name should be in Capital.
For example, each_Seq instead of each_seq
2. For methods returning a list of results, the name should be in plural form.
For example, select_Seqs instead of select
3. Not to use "each" in the name of methods, which may cause confusion
For example, next_LocatableSeq instead of each_seq.
(But it seems each_seq may be better to understand. I need to think on this point)
4. The methods implemented in the Bio::SimpleAlign should also be named in Bio::Align::AlignI, because AlignI is the interface package

Monday, June 14, 2010

Conventions in Bio::Align package (Updating 1606)

Several conventions should be set up in case of potential conflicts among methods or packages.

1. The position of sequences/columns in the alignment should be 1 based.

2. The selection of sequences/columns should be list based, for example, (1,5,8..10,21..50).

3. Special charectors should be read/written by methods in the Bio::Align package. For example:
The selection of characters can only be made from [0-9A-Za-z\*\-\.=~\\/\?], as configured by Bio::PrimarySeq::seq.

4. The ordering of the sequences in the alignment should be based on $seq->{'_order'}

5. The length of the sequences should be the same in all methods, at the moment, they can be calculated from $aln->length, CORE::length, and $seq->length(), each of them is calculated differently.

The length of the sequence should be defined as the number of alphabetic characters. And, the length of the alignment should be defined as the longest length of the sequences(including special characters, e.g. '-','?')

6. The name of the function should be clearer. For example, each_seq should be each_Seq to show it is retrieving sequence objects instead of sequences themselves.

Another option for memory efficient way

I just happen to find the Bio::Root::Storable package in BioPerl. I havenot really tried it, but it is worthwhile to try, which will possibly provide a solution to read in sequences in a memory efficient way, e.g. deposit the object in the local disk, not memory.

Example code of the new alignment method/package

I just found there is already a package "Bio::DB::Fasta" implemented in BioPerl to load the fasta file by indexing (Any_DBM). So, by using this package, we may implement a method in Bio::AlignIO to generate Bio::PrimarySeq objects for the fasta sequences, and they may inherit most of the methods in Bio::SimpleAlign.

At the moment, this code still doesn't work, because add_seq function in Bio::SimpleAlign does not support Bio::PrimarySeq object. This may be solved in future.

#!/usr/bin/perl -w
use strict;
use Bio::AlignIO;
use Bio::DB::Fasta;
use Bio::SimpleAlign;

my $in=Bio::AlignIO->new(-file=>"clustalw2-pumpkin_aa_edi.fst",

my $aln=$in->next_locatable_aln;

print $aln->num_sequences;
print $aln->percentage_identity;

foreach my $seq ($aln->each_seq()) {
#$seq will only load sequences, when we call $seq->seq()
#do something


#New method in the Bio::AlignIO::Fasta
use Bio::DB::Fasta;

sub next_locatable_aln {
my $self = shift;

my $aln = Bio::SimpleAlign->new();

my $db=Bio::DB::Fasta->new($self->{"_file"});
my $stream=$db->get_PrimarySeq_stream();
foreach my $seq ($stream->next_seq()) {
return $aln->num_sequences;

Wednesday, June 9, 2010

Adjusted project plan

The project plan is adjusted according to the current working progress and understanding of the project.

07/06-23/06 (Milestone 1)
Implementation of the methods proposed for Bio::Align
Conference in Vision Camp Young Researcher Symposium, Germany
05/07-12/07 (Milestone 2)
Major refactor of Bio::Align package. Moving the methods to the right package.
12/07-16/07 (Milestone 3)
Implement methods retrieving online sequence file, e.g. pfam. These methods may fall into Bio::DB
19/07-30/07 (Milestone 4)
Implement new Bio::Seq package, storing sequences in local file instead of in the memory
02/08-06/08 (Documentation)
Write an online HOWTO giving an introduction of how to use Bio::Align
09/08-20/08 (Final)
Final testing, improvement of the documentation

Git working environment

After carefully reading the above mentioned git introductions, I have created git working environment. The commands are:

$git clone
$git remote add upstream
$git remote add jyinrepo
$git checkout -b workingbranch

After this, I will work on workingbranch.

The commands during the work can be:
$git commit -m "xxx"
$git push
$git update
$git pull upstream master #update the clone

Thursday, June 3, 2010

Version control using Git

A summary of Git tutorials

How to set up an ssh-key

How to fork a project

Easy version control with Git

BioPerl in GitHub

Git tutorial

Pro Git

Wednesday, June 2, 2010

An open question of how to be memory efficient

One objective of the project is to write a next_locatable_aln and next_locatable_seq function to read in the sequence line-by-line, not to read in all the sequences at the same time. But it seems it is not possible to do it simply but read one sequence at one time, because most of the methods in Bio::SimpleAlign needs to know the attributes of all sequences at one time. So an alternative way is to read in the sequence and save it to a temporary file.

Here are a few solutions to implement that:

1. use DB_File;
2. use Storable;
3. use Tie::File;

It seems Storable and Tie::File packages are better choices. But I havenot tried it. Just keep a record here. The implementation will be done later.

Thursday, May 27, 2010

New structure of Bio::Align

A new summary of Bio::Align modules is in

Wednesday, May 19, 2010

New structure of Bio::SimpleAlign 2

See update in the google doc

Wednesday, May 12, 2010

New structure of Bio::SimpleAlign

I have finished summarizing the methods in Bio::SimpleAlign. The methods are classified into several categories. The classification borrows ideas from Bio::SimpleAlign's original classification, the menu in JalView and my own understanding.

The methods are listed in:

The next step is to sort out how the methods can be classified into different packages. Fucntions reading/writing internal alignment attributes can be transfered to Bio::Align::AlignI. Functions modifying, selecting, and calculating alignment sequences/features can be moved to Bio::Align::Utilities, or made into new packages.
Several new methods will be added.

A text based report will be given in the next couple of days.

General ideas of th project

The general ideas of the the project are in two folds:

1. Re-classify the Bio::SimpleAlign module into several new modules.

The current Bio::SimpleAlign is a very complicated module, comprising of more than 80 methods. The first aim of the project is to classify the methods into several small modules. The basic idea it to keep the "generic methods", which are methods reading/writing alignment features in the basic module. Then, move the other "outer methods", e.g. methods editing/calculating sequence residues and/or sub-alignment into seperate modules. Most of the methods in the current module will be kept, and a few new methods will be added.

Programing difficulty: Easy to Medium

2. Alignment-oriented Bio::Align::AlignI module

The new Bio::Align::AlignI module will be refactored to be alignment oriented. Generally, when we load the alignment file/DB (next_aln()), we only read the alignment feature into memory. The actual sequences will be read line by line when we need them (next_seq()).

Programing difficulty: Medium to Difficult

The two points above can be the starting points of the project. At the moment, it is limited to global alignment (e.g. clustalw output file). These two points are highly related. The second point deals with the general input protocol, and the first point deals with the methods manipulating the alignment.

Later in the project, or in the near future, assembly file (SAM/ACE) and local alignment(BLAST) file and may be considered. This may concern the refactor of Bio::Assembly and other packages.

The next post will be a summary of the new structure of Bio::SimpleAlign.

Friday, April 30, 2010

Documentation reading

In order to know the working process of previous GSOC student, I have read through Chase Miller's project plan. It is a really well structured plan. I learned quite a few things from it. If the new package is a wrapper based package, it may have a similar structure to NeXML package.

PhyloSoC:BioPerl integration of the NeXML exchange standard and Bio::Phylo toolkit/nexml module design

For the next few days, I will keep reading the documents.

At the moment (5 May to 15 May), I am reading PerlFaq

The other Perl book "Higher Order Perl" is on the way.

Tuesday, April 27, 2010

GSOC started!

GSOC started!

Project: (BioPerl) Alignment Subsystem Refactoring
Student: Jun Yin
Mentors: Chris Fields, Mark Jensen

Hopefully it will be a productive and joyful summer!