Wednesday, June 2, 2010

An open question of how to be memory efficient

One objective of the project is to write a next_locatable_aln and next_locatable_seq function to read in the sequence line-by-line, not to read in all the sequences at the same time. But it seems it is not possible to do it simply but read one sequence at one time, because most of the methods in Bio::SimpleAlign needs to know the attributes of all sequences at one time. So an alternative way is to read in the sequence and save it to a temporary file.

Here are a few solutions to implement that:

1. use DB_File;
2. use Storable;
3. use Tie::File;

It seems Storable and Tie::File packages are better choices. But I havenot tried it. Just keep a record here. The implementation will be done later.

1 comment:

  1. So, this is something that will have to scale to possibly millions of sequences.

    First: the 'database' backend should be abstracted out of the equation, used to just grab the proper sequences and so on needed for the work. In this way you can implement this backend however you want, just give it consistent methods (interface). The default should probably be an in-memory store of some kind, whereas others can be tied to a simple DB via DBI, etc, are lazy, etc. Start with something that works (in-memory) and work out from there. You can wrap particular tools (Bio::Samtools, Bio::BigFile, etc) as needed. Or use the already set up tools Mark Jensen and others have worked on if it fills the need.

    Storable will just serialize the data, correct? So one would use it in conjunction with DB_File or others. Tie::File is a possibility maybe for a lazy parser, not sure how well it scales to very large files.

    ReplyDelete