GSOC2010. BioPerl Alignment Subsystem Refactoring

Wednesday, June 9, 2010

Git working environment

After carefully reading the above mentioned git introductions, I have created git working environment. The commands are:

$git clone http://github.com/bioperl/bioperl-live.git
$git remote add upstream http://github.com/bioperl/bioperl-live.git
$git remote add jyinrepo git@github.com:yinjun111/bioperl-live.git
$git checkout -b workingbranch

After this, I will work on workingbranch.

The commands during the work can be:
$git commit -m "xxx"
$git push
$git update
$git pull upstream master #update the clone

Thursday, June 3, 2010

Version control using Git

A summary of Git tutorials

How to set up an ssh-key
http://help.github.com/msysgit-key-setup/

How to fork a project
http://help.github.com/forking/

Easy version control with Git
http://net.tutsplus.com/tutorials/other/easy-version-control-with-git/

BioPerl in GitHub
http://www.bioperl.org/wiki/Using_Git

Git tutorial
http://www.kernel.org/pub/software/scm/git/docs/gittutorial.html
http://www.kernel.org/pub/software/scm/git/docs/gittutorial-2.html

Pro Git
http://progit.org/book/

Wednesday, June 2, 2010

An open question of how to be memory efficient

One objective of the project is to write a next_locatable_aln and next_locatable_seq function to read in the sequence line-by-line, not to read in all the sequences at the same time. But it seems it is not possible to do it simply but read one sequence at one time, because most of the methods in Bio::SimpleAlign needs to know the attributes of all sequences at one time. So an alternative way is to read in the sequence and save it to a temporary file.

Here are a few solutions to implement that:

1. use DB_File;
2. use Storable;
3. use Tie::File;

It seems Storable and Tie::File packages are better choices. But I havenot tried it. Just keep a record here. The implementation will be done later.

Thursday, May 27, 2010

New structure of Bio::Align

A new summary of Bio::Align modules is in

http://spreadsheets.google.com/ccc?key=0AssLTcJFJMbXdERTd1VPeFhKM3JRM2x4UHpUVVVpNFE&hl=en

Wednesday, May 19, 2010

New structure of Bio::SimpleAlign 2

See update in the google doc

http://spreadsheets.google.com/ccc?key=0AssLTcJFJMbXdFp3Smg1S3JaYzBKNUcxTmQ0STBNTXc&hl=en

Wednesday, May 12, 2010

New structure of Bio::SimpleAlign

I have finished summarizing the methods in Bio::SimpleAlign. The methods are classified into several categories. The classification borrows ideas from Bio::SimpleAlign's original classification, the menu in JalView and my own understanding.

The methods are listed in:

http://spreadsheets.google.com/ccc?key=0AssLTcJFJMbXdFp3Smg1S3JaYzBKNUcxTmQ0STBNTXc&hl=en

The next step is to sort out how the methods can be classified into different packages. Fucntions reading/writing internal alignment attributes can be transfered to Bio::Align::AlignI. Functions modifying, selecting, and calculating alignment sequences/features can be moved to Bio::Align::Utilities, or made into new packages.
Several new methods will be added.

A text based report will be given in the next couple of days.

General ideas of th project

The general ideas of the the project are in two folds:

1. Re-classify the Bio::SimpleAlign module into several new modules.

The current Bio::SimpleAlign is a very complicated module, comprising of more than 80 methods. The first aim of the project is to classify the methods into several small modules. The basic idea it to keep the "generic methods", which are methods reading/writing alignment features in the basic module. Then, move the other "outer methods", e.g. methods editing/calculating sequence residues and/or sub-alignment into seperate modules. Most of the methods in the current module will be kept, and a few new methods will be added.

Programing difficulty: Easy to Medium

2. Alignment-oriented Bio::Align::AlignI module

The new Bio::Align::AlignI module will be refactored to be alignment oriented. Generally, when we load the alignment file/DB (next_aln()), we only read the alignment feature into memory. The actual sequences will be read line by line when we need them (next_seq()).

Programing difficulty: Medium to Difficult

The two points above can be the starting points of the project. At the moment, it is limited to global alignment (e.g. clustalw output file). These two points are highly related. The second point deals with the general input protocol, and the first point deals with the methods manipulating the alignment.

Later in the project, or in the near future, assembly file (SAM/ACE) and local alignment(BLAST) file and may be considered. This may concern the refactor of Bio::Assembly and other packages.

The next post will be a summary of the new structure of Bio::SimpleAlign.