Saturday, August 14, 2010

Bio::LocatableSeq end checking inconsistency (problem unsolved)

The code (Bio::SimpleAlign) I re-implemented now has passed nearly all the test, except a few tests on seq/start-end testing. But here comes a problem. This may be an old issue, that the Bio::LocatableSeq end assignment and checking are inconsistent.

The current end checking method is based on:
$end=$seq->_ungapped_len+$seq->start-1
However, this checking may not fit the real world case.

The inconsistency usually happens when a few columns of the sequence are removed.

For example:
my $a = Bio::LocatableSeq->new(
-id => 'a',
-strand => 1,
-seq => '-tcgatc-atcgatcg',
-start => 30,
-end => 43
);

If we remove the 1st, 8th and the last columns

$a->seq() will be ‘tcgatcatcgatc’
$a->_ungapped_len==12

Actually, in the real world, the first residue will still be 30 (the old $seq->start), and the last residue is the residue before the 43 (the old $seq->end), thus 42.

But if you call a validation, the calculation is $a->_ungapped_len+$a->start-1=12+30-1=41
So the reassignment of the $seq->end will not pass the validation…

So unless you save the information to a new sequence object, the original position information will be lost anyway. But in some cases, we have to change the sequence in its original sequence object ….

What is your suggestion on this issue?
A. pass the test and lose the information #convenient in coding but the start-end annotation is not right any more
B. keep the information and forget the test #the object will still remember where the last residue was in the original sequence. But is it really meaningful at all? Because all the other residues may come from nowhere
C. Neither of above #any other suggestions?

No comments:

Post a Comment