comparison variant_effect_predictor/Bio/Tools/Blast/Sbjct.pm @ 0:1f6dce3d34e0

Uploaded
author mahtabm
date Thu, 11 Apr 2013 02:01:53 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:1f6dce3d34e0
1 #------------------------------------------------------------------------------
2 # PACKAGE : Bio::Tools::Blast::Sbjct
3 # AUTHOR : Steve Chervitz (sac@bioperl.org)
4 # CREATED : 7 October 1996
5 # STATUS : Alpha
6 # REVISION: $Id: Sbjct.pm,v 1.20 2002/10/22 07:38:48 lapp Exp $
7 #
8 # For the latest version and documentation, visit the distribution site:
9 # http://genome-www.stanford.edu/perlOOP/bioperl/blast/
10 #
11 # To generate documentation, run this module through pod2html
12 # (preferably from Perl v5.004 or better).
13 #
14 # Copyright (c) 1996-2000 Steve Chervitz. All Rights Reserved.
15 # This module is free software; you can redistribute it and/or
16 # modify it under the same terms as Perl itself.
17 #------------------------------------------------------------------------------
18
19 package Bio::Tools::Blast::Sbjct;
20
21 use Bio::Root::Global qw(:devel);
22 use Bio::Root::Object ();
23
24 @ISA = qw( Bio::Root::Object Exporter );
25
26 use strict;
27 use vars qw($ID %SUMMARY_OFFSET $Revision);
28 $ID = 'Bio::Tools::Blast::Sbjct';
29 $Revision = '$Id: Sbjct.pm,v 1.20 2002/10/22 07:38:48 lapp Exp $'; #'
30
31 my $_prog = '';
32 my $_signif_fmt = '';
33
34 ## POD Documentation:
35
36 =head1 NAME
37
38 Bio::Tools::Blast::Sbjct - Bioperl BLAST "Hit" object
39
40 =head1 SYNOPSIS
41
42 =head2 Object Creation
43
44 The construction of HSP objects is handled by B<Bio::Tools::Blast>.
45 You should not need to use this package directly. See L<_initialize()|_initialize>
46 for a description of constructor parameters.
47
48 require Bio::Tools::Blast::Sbjct;
49
50 $hit = new Bio::Tools::Blast::Sbjct (-DATA =>\@hitData,
51 -PARENT =>$self,
52 -NAME =>5,
53 -RANK =>5,
54 -RANK_BY =>'order',
55 -MAKE =>'query' (or 'sbjct'),
56 -OVERLAP =>2,
57 -PROGRAM =>'TBLASTN'
58 );
59
60 @hitData includes the summary line for the hit as element [0], plus
61 all lines from the HSP alignment section of the BLAST report for
62 the present hit.
63
64 =head1 INSTALLATION
65
66 This module is included with the central Bioperl distribution:
67
68 http://bio.perl.org/Core/Latest
69 ftp://bio.perl.org/pub/DIST
70
71 Follow the installation instructions included in the README file.
72
73 =head1 DESCRIPTION
74
75 The Bio::Tools::Blast::Sbjct.pm module encapsulates data and methods for
76 parsing and manipulating "hits" from a BLAST report.
77 This module is a utility module used by the Bio::Tools::Blast.pm
78 and is not intended for separate use.
79
80 In Blast lingo, the "sbjct" sequences are all the sequences
81 in a target database which were compared against a "query" sequence.
82 The terms "sbjct" and "hit" will be used interchangeably in this and related modules.
83
84 This module supports BLAST versions 1.x and 2.x, gapped and ungapped.
85
86
87 =head2 HSP Tiling and Ambiguous Alignments
88
89 If a Blast hit has more than one HSP, the Bio::Tools::Blast::Sbjct.pm
90 object has the ability to merge overlapping HSPs into contiguous
91 blocks. This permits the Sbjct object to sum data across all HSPs
92 without counting data in the overlapping regions multiple times, which
93 would happen if data from each overlapping HSP are simply summed. HSP
94 tiling is performed automatically when methods of the Sbjct object
95 that rely on tiled data are invoked. These include
96 L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>,
97 L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>,
98 L<num_unaligned_query()|num_unaligned_query>, L<num_unaligned_hit()|num_unaligned_hit>.
99
100 It also permits the assessment of an "ambiguous alignment" if the
101 query (or sbjct) sequences from different HSPs overlap. The existence
102 of an overlap could indicate a biologically interesting region in the
103 sequence, such as a repeated domain. The Sbjct object uses the
104 -OVERLAP parameter to determine when two sequences overlap; if this is
105 set to 2 -- the default -- then any two sbjct or query HSP sequences
106 must overlap by more than two residues to get merged into the same
107 contig and counted as an overlap. See the L<BUGS | BUGS> section below for
108 "issues" with HSP tiling.
109
110
111 The results of the HSP tiling is reported with the following ambiguity codes:
112
113 'q' = Query sequence contains multiple sub-sequences matching
114 a single region in the sbjct sequence.
115
116 's' = Sbjct sequence contains multiple sub-sequences matching
117 a single region in the query sequence.
118
119 'qs' = Both query and sbjct sequences contain more than one
120 sub-sequence with similarity to the other sequence.
121
122
123 For addition information about ambiguous BLAST alignments, see
124 L<_tile_hsps()|_tile_hsps> and
125
126 http://www-genome.stanford.edu/Sacch3D/help/ambig_aln.html
127
128 =head1 DEPENDENCIES
129
130 Bio::Tools::Blast::Sbjct.pm is a concrete class that inherits from B<Bio::Root::Object>
131 and relies on two other modules:
132
133 =over 4
134
135 =item B<Bio::Tools::Blast::HSP>
136
137 Encapsulates a single high-scoring segment pair within a hit.
138
139 =item B<Bio::Tools::Blast>
140
141 Provides a container for Sbjct.pm objects.
142
143 =back
144
145
146 Bio::Tools::Blast::Sbjct.pm does not currently inherit from
147 Bio::Root::Vector.pm since Bio::Root::Vector.pm may be re-designed to
148 make it usable via delegation. Thus, a Blast.pm object would manage a
149 vector of Sbjct.pm objects. Stay tuned.
150
151
152 =head1 BUGS
153
154 One consequence of the HSP tiling is that methods that rely on HSP
155 tiling such as L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>
156 etc. may report misleading numbers when C<-OVERLAP> is set to a large
157 number. For example, say we have two HSPs and the query sequence tile
158 as follows:
159
160 1 8 22 30 40 60
161 Full seq: ------------------------------------------------------------
162 * ** * **
163 HSP1: --------------- (6 identical matches)
164 ** ** **
165 HSP2: ------------- (6 identical matches)
166
167
168 If C<-OVERLAP> is set to some number over 4, HSP1 and HSP2 will not be
169 tiled into a single contig and their numbers of identical matches will
170 be added, giving a total of 12, not 10 if they had be combined into
171 one contig. This can lead to number greater than 1.0 for methods
172 L<frac_identical()|frac_identical> and L<frac_conserved()|frac_conserved>. This is less of an issue
173 with gapped Blast since it tends to combine HSPs that would be listed
174 separately without gapping. (Fractions E<gt>1.0 can be viewed as a
175 signal for an interesting alignment that warrants further inspection,
176 thus turning this bug into a feature).
177
178 Using large values for C<-OVERLAP> can lead to incorrect numbers
179 reported by methods that rely on HSP tiling but can be useful if you
180 care more about detecting ambiguous alignments. Setting C<-OVERLAP>
181 to zero will lead to the most accurate numbers for the
182 tiling-dependent methods but will be useless for detecting overlapping
183 HSPs since all HSPs will appear to overlap.
184
185
186 =head1 SEE ALSO
187
188 Bio::Tools::Blast::HSP.pm - Blast HSP object.
189 Bio::Tools::Blast.pm - Blast object.
190 Bio::Root::Object.pm - Proposed base class for all Bioperl objects.
191
192 Links:
193
194 http://bio.perl.org/Core/POD/Tools/Blast/HSP.pm.html
195
196 http://bio.perl.org/Projects/modules.html - Online module documentation
197 http://bio.perl.org/Projects/Blast/ - Bioperl Blast Project
198 http://bio.perl.org/ - Bioperl Project Homepage
199
200
201 =head1 FEEDBACK
202
203 =head2 Mailing Lists
204
205 User feedback is an integral part of the evolution of this and other
206 Bioperl modules. Send your comments and suggestions preferably to one
207 of the Bioperl mailing lists. Your participation is much appreciated.
208
209 bioperl-l@bioperl.org - General discussion
210 http://bio.perl.org/MailList.html - About the mailing lists
211
212 =head2 Reporting Bugs
213
214 Report bugs to the Bioperl bug tracking system to help us keep track
215 the bugs and their resolution. Bug reports can be submitted via email
216 or the web:
217
218 bioperl-bugs@bio.perl.org
219 http://bugzilla.bioperl.org/
220
221 =head1 AUTHOR
222
223 Steve Chervitz, sac@bioperl.org
224
225 See the L<FEEDBACK | FEEDBACK> section for where to send bug reports and comments.
226
227 =head1 COPYRIGHT
228
229 Copyright (c) 1996-2000 Steve Chervitz. All Rights Reserved.
230 This module is free software; you can redistribute it and/or
231 modify it under the same terms as Perl itself.
232
233 =cut
234
235
236
237 #
238 ##
239 ###
240 #### END of main POD documentation.
241 ###
242 ##
243 #'
244
245
246 =head1 APPENDIX
247
248 Methods beginning with a leading underscore are considered private
249 and are intended for internal use by this module. They are
250 B<not> considered part of the public interface and are described here
251 for documentation purposes only.
252
253 =cut
254
255 #####################################################################################
256 ## CONSTRUCTOR ##
257 #####################################################################################
258
259 =head2 _initialize
260
261 Usage : n/a; automatically called by Bio::Root::Object::new()
262 : Bio::Tools::Blast::Sbjct.pm objects are constructed
263 : automatically by Bio::Tools::Blast.pm, so there is no need
264 : for direct consumption.
265 Purpose : Initializes key varaiables and calls methods to parse a single Blast hit.
266 : Constructs Bio::Tools::Blast::HSP.pm objects for each
267 : high-scoring segment pair (HSP).
268 : Calls superclass constructor first (Bio::Root::Object.pm).
269 Returns : n/a
270 Argument : Named Parameters passed from new()
271 : ALL TAGS MUST BE UPPERCASE (does not call _rearrange()).
272 : -DATA => array reference holding all data for a single hit.
273 : The first element should hold the description
274 : line (from the desctiption section at the top of
275 : the report), remaining lines should hold all lines
276 : within the HSP alignment listing section of report.
277 : -PARENT => object reference to a Bio::Tools::Blast.pm object.
278 : -NAME => string (typically same as -RANK, just a temporary
279 : name to use until the actual name of hit is parsed),
280 : -RANK => integer,
281 : -RANK_BY => 'order',
282 : -OVERLAP => integer (maximum overlap between adjacent
283 : HSPs when tiling)
284 : -PROGRAM => string (type of Blast blastp, blastn, etc).
285
286 See Also : L<_set_id()|_set_id>, L<_set_hsps()|_set_hsps>, L<_tile_hsps()|_tile_hsps>, B<Bio::Root::Object::new()>, B<Bio::Tools::Blast::_set_hit_db()>
287
288 =cut
289
290 #-------------------
291 sub _initialize {
292 #-------------------
293 my( $self, %param ) = @_;
294
295 # $make not currently used.
296 my $make = $self->SUPER::_initialize( %param );
297
298 # Set various class data.
299 $_prog = $param{-PROGRAM} || '';
300 $_signif_fmt = $param{-SIGNIF_FMT};
301
302 $self->{'_rank'} = $param{-RANK} || '';
303 $self->_set_id( $param{-DATA}->[0]);
304 $self->_set_hsps( @{$param{-DATA}} );
305
306 $self->{'_overlap'} = $param{-OVERLAP} || 0;
307 }
308
309 #--------------
310 sub destroy {
311 #--------------
312 my $self=shift;
313 if($self->{'_hsps'}) {
314 foreach($self->hsps) {
315 $_->destroy;
316 undef $_;
317 }
318 undef $self->{'_hsps'};
319 }
320 $DEBUG==2 && print STDERR "DESTROYING $self ${\$self->name}";
321 $self->SUPER::destroy;
322 }
323
324 #####################################################################################
325 ## ACCESSORS ##
326 #####################################################################################
327
328
329 =head2 rank
330
331 Usage : $sbjct->rank( integer or string );
332 Purpose : Sets/Gets the rank of the current Sbjct object relative to
333 : other Sbjct objects managed by a given Blast object.
334 Example : $sbjct->rank(1);
335 Returns : The current rank value.
336 Argument : Integer or string to be used for ranking the hit
337 : relative to other hits.
338 Throws : n/a
339 Comments : The rank usually corresponds to the order the listing
340 : of hits in the BLAST report from lowest to highest p-value.
341 : Rank need not be restricted to this value.
342 : rank() may be provided by a delegated or inherited
343 : iterator class in the future (such as Bio::Root::Vector.pm).
344
345 =cut
346
347 #-----------
348 sub rank {
349 #-----------
350 my $self = shift;
351 if(@_) {$self->{'_rank'} = shift; }
352 $self->{'_rank'};
353 }
354
355
356
357 =head2 _set_id
358
359 Usage : n/a; automatically called by _initialize()
360 Purpose : Sets the name of the Sbjct sequence from the BLAST summary line.
361 : The identifier is assumed to be the first
362 : chunk of non-whitespace characters in the description line
363 : Does not assume any semantics in the structure of the identifier
364 : (Formerly, this method attempted to extract database name from
365 : the seq identifiers, but this was prone to break).
366 Returns : n/a
367 Argument : String containing description line of the hit from Blast report
368 : or first line of an alignment section.
369 Throws : Warning if cannot locate sequence ID.
370
371 See Also : L<_initialize()|_initialize>, B<Bio::Tools::Blast::database()>
372
373 =cut
374
375 #---------------
376 sub _set_id {
377 #---------------
378 my( $self, $desc ) = @_;
379 my ($seqID1, $seqID2, $dbID, @seqDat);
380
381 local $_ = $desc;
382 my @linedat = split();
383 my $data = $linedat[0];
384
385 # New strategy: Assume only that the ID is the first white space
386 # delimited chunk. Not attempting to extract database name.
387 # Clients will have to interpret it as necessary.
388 if($data =~ /^(\S+)\s*/) {
389 $self->name($1);
390 }
391 else {
392 $self->warn("Can't locate sequence identifier in summary line.", "Line = $data");
393 $data = 'Unknown sequence ID' if not $data;
394 $self->name($data);
395 }
396 $self->{'_db'} = '-';
397
398 # Old strategy: assumes semantics in the identifier
399 # and tries to separate out database and id components.
400 # Too fancy and fragile! SAC, 2000-02-18
401
402 # # Proceeding from more standard (NCBI-like) to less standard.
403 # if($data =~ /(\S+?)[:\|]+(\S+?)[:\|]+(\S*)/) {
404 # # matches: database|id1|id2 or database||id1||id2 or database:id1:id2
405 # $dbID = $1;
406 # $seqID1 = $2;
407 # $seqID2 = $3;
408 # if($seqID2 eq $seqID1) { undef($seqID2); }
409 #
410 # } elsif($data =~ /(\S+?)[:\|]+(\S+)/) {
411 # # matches: database|id1 or database:id1
412 # $dbID = $1;
413 # $seqID1 = $2;
414 #
415 # } elsif($data =~ /^(\S+)\s+([gb|emb|dbj|sp|pir])\s+(\S+)*/) {
416 # # matches: id1 database id2
417 # $seqID1 = $1;
418 # $dbID = $2;
419 # $seqID2 = $3;
420 #
421 # } elsif($data =~ /^(\S+)\s*/) {
422 # $seqID1 = $1;
423 # }
424 #
425 # ## Combine the multiple IDs.
426 # $seqID2 = scalar($seqID2) ? "/$seqID2" : '';
427 #
428 # if( !scalar $seqID1) {
429 # $self->warn("Can't locate sequence identifier in summary line.", "Line = $data");
430 # $self->name('Unknown sequence ID');
431 # } else {
432 # $self->name($seqID1.$seqID2);
433 # }
434 # $self->{'_db'} = $dbID || '-';
435 }
436
437
438 =head2 _set_hsps
439
440 Usage : n/a; called automatically during object construction.
441 Purpose : Creates HSP.pm objects for each HSP in a BLAST hit alignment.
442 : Also collects the full description of the hit from the
443 : HSP alignment section.
444 Returns : n/a
445 Argument : List of strings containing raw BLAST report data for
446 : a single hit's HSP alignment data.
447 Throws : Warnings for each HSP.pm object that fails to be constructed.
448 : Exception if no HSP.pm objects can be constructed.
449 : Exception if can't parse length data for hit sequence.
450 Comments : Requires Bio::Tools::Blast::HSP.pm.
451 : Sets the description using the full string present in
452 : the alignment data.
453 : Also sets Expect and P-values for the Sbjct object by
454 : copying from the HSP object.
455 : While this sacrifices some memory efficiency, it
456 : improves access speed for these critical data.
457
458 See Also : L<_initialize()|_initialize>, L<_set_desc()|_set_desc>
459
460 =cut
461
462 #--------------
463 sub _set_hsps {
464 #--------------
465
466 my( $self, @data ) = @_;
467 my $start = 0;
468 my $hspCount = 0;
469
470 require Bio::Tools::Blast::HSP;
471
472 # printf STDERR "$ID _set_hsps(). DATA (%d lines) =\n@data\n", scalar(@data); <STDIN>;
473
474 my( @hspData, @hspList, @errs, @bad_names );
475 my($line, $set_desc, @desc);
476 $set_desc = 0;
477
478 hit_loop:
479 foreach $line( @data ) {
480
481 if( $line =~ /^\s*Length = ([\d,]+)/ ) {
482 $self->_set_desc(@desc);
483 $set_desc = 1;
484 ($self->{'_length'} = $1) =~ s/,//g; # get rid of commas
485 next hit_loop;
486 } elsif( !$set_desc) {
487 $line =~ s/^\s+|\s+$//g;
488 push @desc, $line;
489 next hit_loop;
490 } elsif( $line =~ /^\s*Score/ ) {
491 ## This block is for setting multiple HSPs.
492
493 if( not scalar @hspData ) {
494 $start = 1;
495 push @hspData, $line;
496 next hit_loop;
497
498 } elsif( scalar @hspData) {
499 $hspCount++;
500 $DEBUG and do{ print STDERR +( $hspCount % 10 ? "+" : "+\n" ); };
501
502 # print STDERR "\n$ID: setting HSP: ${\$self->name}\n";
503 my $hspObj = eval { new Bio::Tools::Blast::HSP(-DATA =>\@hspData,
504 -PARENT =>$self,
505 -NAME =>$hspCount,
506 -PROGRAM =>$_prog,
507 );
508 };
509 if($@) {
510 # print "$ID: ERROR:\n$@";<STDIN>;
511 push @errs, $@;
512 push @bad_names, "#$hspCount";
513 $hspObj->destroy if ref $hspObj;
514 undef $hspObj;
515 } else {
516 push @hspList, $hspObj;
517 if (!defined($self->{'_expect'}) || $hspObj->expect() < $self->{'_expect'}) {
518 $self->{'_expect'} = $hspObj->expect();
519 }
520 if (!defined($self->{'_p'}) || $hspObj->p() < $self->{'_p'}) {
521 $self->{'_p'} = $hspObj->p();
522 }
523 }
524 @hspData = ();
525 push @hspData, $line;
526 next;
527 } else {
528 push @hspData, $line;
529 }
530 } elsif( $start ) {
531 ## This block is for setting the last HSP (which may be the first as well!).
532 if( $line =~ /^(end|>|Parameters|CPU|Database:)/ ) {
533 $hspCount++;
534 $DEBUG and do{ print STDERR +( $hspCount % 10 ? "+" : "+\n" ); };
535
536 # print STDERR "\n$ID: setting HSP: ${\$self->name}\n";
537
538 my $hspObj = eval { new Bio::Tools::Blast::HSP(-DATA =>\@hspData,
539 -PARENT =>$self,
540 -NAME =>$hspCount,
541 -PROGRAM =>$_prog,
542 );
543 };
544 if($@) {
545 # print "$ID: ERROR:\n$@";<STDIN>;
546 push @errs, $@;
547 push @bad_names, "#$hspCount";
548 $hspObj->destroy if ref $hspObj;
549 undef $hspObj;
550 } else {
551 push @hspList, $hspObj;
552 if (!defined($self->{'_expect'}) || $hspObj->expect() < $self->{'_expect'}) {
553 $self->{'_expect'} = $hspObj->expect();
554 }
555 if (!defined($self->{'_p'}) || $hspObj->p() < $self->{'_p'}) {
556 $self->{'_p'} = $hspObj->p();
557 }
558 }
559 } else {
560 push @hspData, $line;
561 }
562 }
563 }
564
565 # print STDERR "\n--------> Done building HSPs for ${\$self->name}\n";
566
567 $self->{'_length'} or $self->throw( "Can't determine hit sequence length.");
568
569 # Adjust logical length based on BLAST flavor.
570 if($_prog =~ /TBLAST[NX]/) {
571 $self->{'_logical_length'} = $self->{'_length'} / 3;
572 }
573
574 # Handling errors as done in Blast.pm. (as of version 0.073)
575
576 if(@errs) {
577 my ($str);
578 # When there are many errors, in most of the cases, they are
579 # caused by the same problem. Only need to see full data for
580 # the first one.
581 if(@errs > 2) {
582 $str = "SHOWING FIRST EXCEPTION ONLY:\n$errs[0]";
583 $self->clear_err(); # clearing the existing set of errors.
584 # Not necessary, unless the -RECORD_ERR =>1
585 # constructor option was used for Blast object.
586 } else {
587 $str = join("\n",@errs);
588 }
589
590 if( not scalar @hspList) {
591 $self->throw("Failed to create any HSP objects for $hspCount potential HSP(s).",
592 "\n\nTRAPPED EXCEPTION(S):\n$str\nEND TRAPPED EXCEPTION(S)\n"
593 );
594 } else {
595 $self->warn(sprintf("Could not create HSP objects for %d HSP(s): %s", scalar(@errs), join(', ',@bad_names)),
596 "\n\nTRAPPED EXCEPTION(S):\n$str\nEND TRAPPED EXCEPTION(S)\n"
597 );
598 }
599
600 } else {
601 $self->{'_hsps'} = \@hspList;
602 }
603 }
604
605 =head2 _set_desc
606
607 Usage : n/a; called automatically by _set_hsps()
608 Purpose : Sets the description of the hit sequence.
609 : For sequence without descriptions, sets description to "-".
610 Argument : Array containing description (multiple lines).
611 Comments : _set_hsps() calls this method with the data from the
612 : HSP alignment listing, which contains the complete description.
613 : (Formerly, this was called from the _set_desc_data() method initially.)
614
615 See Also : _set_hsps()
616
617 =cut
618
619 #--------------
620 sub _set_desc {
621 #--------------
622 my( $self, @desc ) = @_;
623 my( $desc);
624
625 # print "$ID: RAW DESC:\n@desc";<STDIN>;
626
627 $desc = join(" ", @desc);
628
629 if($desc) {
630 $desc =~ s/^\s*\S+\s+//; # remove the sequence ID(s)
631 $desc =~ s/^[\s!]+//;
632 $desc =~ s/ \d+$//;
633 $desc =~ s/\.+$//;
634 $self->{'_desc'} = $desc || '-';
635 } else {
636 $self->{'_desc'} = '-';
637 }
638
639 # print "$ID: _set_desc = $desc";<STDIN>;
640 }
641
642
643 =head2 _tile_hsps
644
645 Usage : n/a; called automatically during object construction or
646 : as needed by methods that rely on having tiled data.
647 Purpose : Collect statistics about the aligned sequences in a set of HSPs.
648 : Calculates the following data across all HSPs:
649 : -- total alignment length
650 : -- total identical residues
651 : -- total conserved residues
652 Returns : n/a
653 Argument : n/a
654 Throws : n/a
655 Status : Experimental
656 Comments :
657 : This method performs more careful summing of data across
658 : all HSPs in the Sbjct object. Simply summing the data from all HSPs
659 : will overestimate the actual length of the alignment if there is
660 : overlap between different HSPs (often the case).
661 : The strategy is to tile the HSPs and sum over the
662 : contigs, collecting data separately from overlapping and
663 : non-overlapping regions of each HSP. To facilitate this, the
664 : HSP.pm object now permits extraction of data from sub-sections
665 : of an HSP.
666 :
667 : Additional useful information is collected from the results
668 : of the tiling. It is possible that sub-sequences in
669 : different HSPs will overlap significantly. In this case, it
670 : is impossible to create a single unambiguous alignment by
671 : concatenating the HSPs. The ambiguity may indicate the
672 : presence of multiple, similar domains in one or both of the
673 : aligned sequences. This ambiguity is recorded using the
674 : ambiguous_aln() method.
675 :
676 : This method does not attempt to discern biologically
677 : significant vs. insignificant overlaps. The allowable amount of
678 : overlap can be set with the overlap() method or with the -OVERLAP
679 : parameter used when constructing the Blast & Sbjct objects.
680 :
681 : For a given hit, both the query and the sbjct sequences are
682 : tiled independently.
683 :
684 : -- If only query sequence HSPs overlap,
685 : this may suggest multiple domains in the sbjct.
686 : -- If only sbjct sequence HSPs overlap,
687 : this may suggest multiple domains in the query.
688 : -- If both query & sbjct sequence HSPs overlap,
689 : this suggests multiple domains in both.
690 : -- If neither query & sbjct sequence HSPs overlap,
691 : this suggests either no multiple domains in either
692 : sequence OR that both sequences have the same
693 : distribution of multiple similar domains.
694 :
695 : This method can deal with the special case of when multiple
696 : HSPs exactly overlap.
697 :
698 : Efficiency concerns:
699 : Speed will be an issue for sequences with numerous HSPs.
700 :
701 Bugs : Currently, _tile_hsps() does not properly account for
702 : the number of non-tiled but overlapping HSPs, which becomes a problem
703 : as overlap() grows. Large values overlap() may thus lead to
704 : incorrect statistics for some hits. For best results, keep overlap()
705 : below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and
706 : Ambiguous Alignments" section.
707
708 See Also : L<_adjust_contigs()|_adjust_contigs>, L<ambiguous_aln()|ambiguous_aln>, L<overlap()|overlap>, L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>, L<num_unaligned_query()|num_unaligned_query>, L<num_unaligned_hit()|num_unaligned_hit>, L<HSP Tiling and Ambiguous Alignments>
709
710 =cut
711
712 #--------------
713 sub _tile_hsps {
714 #--------------
715 my $self = shift;
716 # my $gapped = $self->parent->gapped || 0; # no special treatment
717
718 $self->{'_tile_hsps'} = 1;
719 $self->{'_gaps_query'} = 0;
720 $self->{'_gaps_sbjct'} = 0;
721
722 ## Simple summation scheme. Valid if there is only one HSP.
723 if((defined($self->{'_n'}) and $self->{'_n'} == 1) or $self->num_hsps == 1) {
724 my $hsp = $self->hsp;
725 $self->{'_length_aln_query'} = $hsp->length('query');
726 $self->{'_length_aln_sbjct'} = $hsp->length('sbjct');
727 $self->{'_length_aln_total'} = $hsp->length('total');
728 ($self->{'_totalIdentical'},$self->{'_totalConserved'}) = $hsp->matches();
729 $self->{'_gaps_query'} = $hsp->gaps('query');
730 $self->{'_gaps_sbjct'} = $hsp->gaps('sbjct');
731
732 # print "_tile_hsps(): single HSP, easy stats.\n";
733 return;
734 } else {
735 # print STDERR "$ID: _tile_hsps: summing multiple HSPs\n";
736 $self->{'_length_aln_query'} = 0;
737 $self->{'_length_aln_sbjct'} = 0;
738 $self->{'_length_aln_total'} = 0;
739 $self->{'_totalIdentical'} = 0;
740 $self->{'_totalConserved'} = 0;
741 }
742
743 ## More than one HSP. Must tile HSPs.
744 # printf "\nTiling HSPs for %s (BLAST: %s)\n",$self->name, $self->parent->name;
745 my($hsp, $qstart, $qstop, $sstart, $sstop);
746 my(@qcontigs, @scontigs);
747 my $qoverlap = 0;
748 my $soverlap = 0;
749 my $max_overlap = $self->{'_overlap'};
750
751 foreach $hsp ($self->hsps()) {
752 # printf " HSP: %s\n%s\n",$hsp->name, $hsp->str('query');
753 # printf " Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'), $hsp->length(-TYPE=>'cons'), $hsp->length(-TYPE=>'cons',-START=>0,-STOP=>10); <STDIN>;
754 ($qstart, $qstop) = $hsp->range('query');
755 ($sstart, $sstop) = $hsp->range('sbjct');
756
757 my ($qgaps, $sgaps) = $hsp->gaps();
758 $self->{'_gaps_query'} += $qgaps;
759 $self->{'_gaps_sbjct'} += $sgaps;
760
761 $self->{'_length_aln_total'} += $hsp->length;
762 ## Collect contigs in the query sequence.
763 $qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop, \@qcontigs, $max_overlap);
764
765 ## Collect contigs in the sbjct sequence (needed for domain data and gapped Blast).
766 $soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop, \@scontigs, $max_overlap);
767
768 ## Collect overall start and stop data for query and sbjct over all HSPs.
769 if(not defined $self->{'_queryStart'}) {
770 $self->{'_queryStart'} = $qstart;
771 $self->{'_queryStop'} = $qstop;
772 $self->{'_sbjctStart'} = $sstart;
773 $self->{'_sbjctStop'} = $sstop;
774 } else {
775 $self->{'_queryStart'} = ($qstart < $self->{'_queryStart'} ? $qstart : $self->{'_queryStart'});
776 $self->{'_queryStop'} = ($qstop > $self->{'_queryStop'} ? $qstop : $self->{'_queryStop'});
777 $self->{'_sbjctStart'} = ($sstart < $self->{'_sbjctStart'} ? $sstart : $self->{'_sbjctStart'});
778 $self->{'_sbjctStop'} = ($sstop > $self->{'_sbjctStop'} ? $sstop : $self->{'_sbjctStop'});
779 }
780 }
781
782 ## Collect data across the collected contigs.
783
784 # print "\nQUERY CONTIGS:\n";
785 # print " gaps = $self->{'_gaps_query'}\n";
786
787 foreach(@qcontigs) {
788 # print " query contig: $_->{'start'} - $_->{'stop'}\n";
789 # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n";
790 $self->{'_length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1;
791 $self->{'_totalIdentical'} += $_->{'iden'};
792 $self->{'_totalConserved'} += $_->{'cons'};
793 }
794
795 ## Collect data for sbjct contigs. Important for gapped Blast.
796 ## The totalIdentical and totalConserved numbers will be the same
797 ## as determined for the query contigs.
798
799 # print "\nSBJCT CONTIGS:\n";
800 # print " gaps = $self->{'_gaps_sbjct'}\n";
801
802 foreach(@scontigs) {
803 # print " sbjct contig: $_->{'start'} - $_->{'stop'}\n";
804 # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n";
805 $self->{'_length_aln_sbjct'} += $_->{'stop'} - $_->{'start'} + 1;
806 }
807 # <STDIN>;
808
809 if($qoverlap) {
810 if($soverlap) { $self->ambiguous_aln('qs');
811 # print "\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n";
812 }
813 else { $self->ambiguous_aln('q');
814 # print "\n*** AMBIGUOUS ALIGNMENT: Query\n\n";
815 }
816 } elsif($soverlap) {
817 $self->ambiguous_aln('s');
818 # print "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n";
819 }
820
821 # Adjust length based on BLAST flavor.
822 my $prog = $self->parent->program;
823 if($prog eq 'TBLASTN') {
824 $self->{'_length_aln_sbjct'} /= 3;
825 } elsif($prog eq 'BLASTX' ) {
826 $self->{'_length_aln_query'} /= 3;
827 } elsif($prog eq 'TBLASTX') {
828 $self->{'_length_aln_query'} /= 3;
829 $self->{'_length_aln_sbjct'} /= 3;
830 }
831 }
832
833
834
835 =head2 _adjust_contigs
836
837 Usage : n/a; called automatically during object construction.
838 Purpose : Builds HSP contigs for a given BLAST hit.
839 : Utility method called by _tile_hsps()
840 Returns :
841 Argument :
842 Throws : Exceptions propagated from Bio::Tools::Blast::HSP::matches()
843 : for invalid sub-sequence ranges.
844 Status : Experimental
845 Comments : This method does not currently support gapped alignments.
846 : Also, it does not keep track of the number of HSPs that
847 : overlap within the amount specified by overlap().
848 : This will lead to significant tracking errors for large
849 : overlap values.
850
851 See Also : L<overlap()|overlap>, L<_tile_hsps()|_tile_hsps>, B<Bio::Tools::Blast::HSP>::matches
852
853 =cut
854
855 #-------------------
856 sub _adjust_contigs {
857 #-------------------
858 my ($seqType, $hsp, $start, $stop, $contigs_ref, $max_overlap) = @_;
859
860 my $overlap = 0;
861 my ($numID, $numCons);
862
863 # print "Testing $seqType data: HSP (${\$hsp->name}); $start, $stop\n";
864 foreach(@$contigs_ref) {
865 # print " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
866
867 ## Test special case of a nested HSP. Skip it.
868 if($start >= $_->{'start'} and $stop <= $_->{'stop'}) {
869 # print "----> Nested HSP. Skipping.\n";
870 $overlap = 1;
871 next;
872 }
873
874 ## Test for overlap at beginning of contig.
875 if($start < $_->{'start'} and $stop > ($_->{'start'} + $max_overlap)) {
876 # print "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";<STDIN>;
877 # Collect stats over the non-overlapping region.
878 eval {
879 ($numID, $numCons) = $hsp->matches(-SEQ =>$seqType,
880 -START =>$start,
881 -STOP =>$_->{'start'}-1);
882 };
883 if($@) { warn "\a\n$@\n"; }
884 else {
885 $_->{'start'} = $start; # Assign a new start coordinate to the contig
886 $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
887 $_->{'cons'} += $numCons;
888 $overlap = 1;
889 }
890 }
891
892 ## Test for overlap at end of contig.
893 if($stop > $_->{'stop'} and $start < ($_->{'stop'} - $max_overlap)) {
894 # print "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n";<STDIN>;
895 # Collect stats over the non-overlapping region.
896 eval {
897 ($numID,$numCons) = $hsp->matches(-SEQ =>$seqType,
898 -START =>$_->{'stop'},
899 -STOP =>$stop);
900 };
901 if($@) { warn "\a\n$@\n"; }
902 else {
903 $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig
904 $_->{'iden'} += $numID; # and add new data to #identical, #conserved.
905 $_->{'cons'} += $numCons;
906 $overlap = 1;
907 }
908 }
909 $overlap && do {
910 # print " New Contig data:\n";
911 # print " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n";
912 last;
913 };
914 }
915 ## If there is no overlap, add the complete HSP data.
916 !$overlap && do {
917 # print "No overlap. Adding new contig.\n";
918 ($numID,$numCons) = $hsp->matches(-SEQ=>$seqType);
919 push @$contigs_ref, {'start'=>$start, 'stop'=>$stop,
920 'iden'=>$numID, 'cons'=>$numCons };
921 };
922 # <STDIN>;
923 $overlap;
924 }
925
926
927 =head2 ambiguous_aln
928
929 Usage : $ambig_code = $sbjct_object->ambiguous_aln();
930 Purpose : Sets/Gets ambiguity code data member.
931 Example : (see usage)
932 Returns : String = 'q', 's', 'qs', '-'
933 : 'q' = query sequence contains overlapping sub-sequences
934 : while sbjct does not.
935 : 's' = sbjct sequence contains overlapping sub-sequences
936 : while query does not.
937 : 'qs' = query and sbjct sequence contains overlapping sub-sequences
938 : relative to each other.
939 : '-' = query and sbjct sequence do not contains multiple domains
940 : relative to each other OR both contain the same distribution
941 : of similar domains.
942 Argument : n/a
943 Throws : n/a
944 Status : Experimental
945
946 See Also : L<_tile_hsps()|_tile_hsps>, L<HSP Tiling and Ambiguous Alignments>
947
948 =cut
949
950 #--------------------
951 sub ambiguous_aln {
952 #--------------------
953 my $self = shift;
954 if(@_) { $self->{'_ambiguous_aln'} = shift; }
955 $self->{'_ambiguous_aln'} || '-';
956 }
957
958
959
960 =head2 overlap
961
962 Usage : $blast_object->overlap( [integer] );
963 Purpose : Gets/Sets the allowable amount overlap between different HSP sequences.
964 Example : $blast_object->overlap(5);
965 : $overlap = $blast_object->overlap();
966 Returns : Integer.
967 Argument : integer.
968 Throws : n/a
969 Status : Experimental
970 Comments : Any two HSPs whose sequences overlap by less than or equal
971 : to the overlap() number of resides will be considered separate HSPs
972 : and will not get tiled by _adjust_contigs().
973
974 See Also : L<_adjust_contigs()|_adjust_contigs>, L<BUGS | BUGS>
975
976 =cut
977
978 #-------------
979 sub overlap {
980 #-------------
981 my $self = shift;
982 if(@_) { $self->{'_overlap'} = shift; }
983 defined $self->{'_overlap'} ? $self->{'_overlap'} : 0;
984 }
985
986
987
988
989 =head2 score
990
991 Usage : $sbjct_object->score();
992 Purpose : Gets the BLAST score of the best HSP for the current Blast hit.
993 Example : $score = $sbjct_object->score();
994 Returns : Integer
995 Argument : n/a
996 Throws : n/a
997
998 See Also : L<bits()|bits>
999
1000 =cut
1001
1002 #----------
1003 sub score {
1004 #----------
1005 my $self = shift;
1006
1007 # The check for $self->{'_score'} is a remnant from the 'query' mode days
1008 # in which the sbjct object would collect data from the description line only.
1009
1010 my ($score);
1011 if(not defined($self->{'_score'})) {
1012 $score = $self->hsp->score;
1013 } else {
1014 $score = $self->{'_score'};
1015 }
1016 return $score;
1017 }
1018
1019
1020
1021 =head2 bits
1022
1023 Usage : $sbjct_object->bits();
1024 Purpose : Gets the BLAST bit score of the best HSP for the current Blast hit.
1025 Example : $bits = $sbjct_object->bits();
1026 Returns : Integer
1027 Argument : n/a
1028 Throws : Exception if bit score is not set.
1029 Comments : For BLAST1, the non-bit score is listed in the summary line.
1030
1031 See Also : L<score()|score>
1032
1033 =cut
1034
1035 #---------
1036 sub bits {
1037 #---------
1038 my $self = shift;
1039
1040 # The check for $self->{'_bits'} is a remnant from the 'query' mode days
1041 # in which the sbjct object would collect data from the description line only.
1042
1043 my ($bits);
1044 if(not defined($self->{'_bits'})) {
1045 $bits = $self->hsp->bits;
1046 } else {
1047 $bits = $self->{'_bits'};
1048 }
1049 return $bits;
1050 }
1051
1052
1053
1054 =head2 n
1055
1056 Usage : $sbjct_object->n();
1057 Purpose : Gets the N number for the current Blast hit.
1058 : This is the number of HSPs in the set which was ascribed
1059 : the lowest P-value (listed on the description line).
1060 : This number is not the same as the total number of HSPs.
1061 : To get the total number of HSPs, use num_hsps().
1062 Example : $n = $sbjct_object->n();
1063 Returns : Integer
1064 Argument : n/a
1065 Throws : Exception if HSPs have not been set (BLAST2 reports).
1066 Comments : Note that the N parameter is not reported in gapped BLAST2.
1067 : Calling n() on such reports will result in a call to num_hsps().
1068 : The num_hsps() method will count the actual number of
1069 : HSPs in the alignment listing, which may exceed N in
1070 : some cases.
1071
1072 See Also : L<num_hsps()|num_hsps>
1073
1074 =cut
1075
1076 #-----
1077 sub n {
1078 #-----
1079 my $self = shift;
1080
1081 # The check for $self->{'_n'} is a remnant from the 'query' mode days
1082 # in which the sbjct object would collect data from the description line only.
1083
1084 my ($n);
1085 if(not defined($self->{'_n'})) {
1086 $n = $self->hsp->n;
1087 } else {
1088 $n = $self->{'_n'};
1089 }
1090 $n ||= $self->num_hsps;
1091
1092 return $n;
1093 }
1094
1095
1096
1097 =head2 frame
1098
1099 Usage : $sbjct_object->frame();
1100 Purpose : Gets the reading frame for the hit sequence (TBLASTN/X only).
1101 Example : $frame = $sbjct_object->frame();
1102 Returns : Integer (-3 .. +3).
1103 Argument : n/a
1104 Throws : Exception if HSPs have not been set (BLAST2 reports).
1105
1106 See Also : L<hsps()|hsps>
1107
1108 =cut
1109
1110 #----------
1111 sub frame {
1112 #----------
1113 my $self = shift;
1114
1115 # The check for $self->{'_frame'} is a remnant from the 'query' mode days
1116 # in which the sbjct object would collect data from the description line only.
1117
1118 my ($frame);
1119 if(not defined($self->{'_frame'})) {
1120 $frame = $self->hsp->frame;
1121 } else {
1122 $frame = $self->{'_frame'};
1123 }
1124 return $frame;
1125 }
1126
1127
1128
1129
1130
1131 =head2 p
1132
1133 Usage : $sbjct_object->p( [format] );
1134 Purpose : Get the P-value for the given BLAST hit.
1135 : (Note that P-values are not provided with NCBI Blast2 reports).
1136 Example : $p = $sbjct->p;
1137 : $p = $sbjct->p('exp'); # get exponent only.
1138 : ($num, $exp) = $sbjct->p('parts'); # split sci notation into parts
1139 Returns : Float or scientific notation number (the raw P-value, DEFAULT).
1140 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
1141 : 2-element list (float, int) if format == 'parts' and P-value
1142 : is in scientific notation (See Comments).
1143 Argument : format: string of 'raw' | 'exp' | 'parts'
1144 : 'raw' returns value given in report. Default. (1.2e-34)
1145 : 'exp' returns exponent value only (34)
1146 : 'parts' returns the decimal and exponent as a
1147 : 2-element list (1.2, -34) (See Comments).
1148 Throws : Exception if the P-value is not defined, which will occur
1149 : with any NCBI Blast2 report.
1150 Comments : Using the 'parts' argument is not recommended since it will not
1151 : work as expected if the P-value is not in scientific notation.
1152 : That is, floats are not converted into sci notation before
1153 : splitting into parts.
1154
1155 See Also : L<expect()|expect>, L<signif()|signif>, L<get_exponent()|get_exponent>
1156
1157 =cut
1158
1159 #--------
1160 sub p {
1161 #--------
1162 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
1163 my ($self, $fmt) = @_;
1164
1165 my ($val);
1166 $fmt ||= $_signif_fmt;
1167
1168 # $val can be zero.
1169 if(not defined($val = $self->{'_p'})) {
1170 ## P-value not defined, must be a NCBI Blast2 report.
1171 my $note = '';
1172 if($self->parent->_layout() == 2) {
1173 $note = "Blast2 does not report P-values. Use expect() instead.";
1174 }
1175 $self->throw("Can't get P-value: undefined.", $note);
1176 }
1177
1178 return $val if not $fmt or $fmt =~ /^raw/i;
1179 ## Special formats: exponent-only or as list.
1180 return &get_exponent($val) if $fmt =~ /^exp/i;
1181 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
1182
1183 ## Default: return the raw P-value.
1184 return $val;
1185 }
1186
1187
1188
1189 =head2 expect
1190
1191 Usage : $sbjct_object->expect( [format] );
1192 Purpose : Get the Expect value for the given BLAST hit.
1193 Example : $e = $sbjct->expect;
1194 : $e = $sbjct->expect('exp'); # get exponent only.
1195 : ($num, $exp) = $sbjct->expect('parts'); # split sci notation into parts
1196 Returns : Float or scientific notation number (the raw expect value, DEFAULT).
1197 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
1198 : 2-element list (float, int) if format == 'parts' and Expect
1199 : is in scientific notation (see Comments).
1200 Argument : format: string of 'raw' | 'exp' | 'parts'
1201 : 'raw' returns value given in report. Default. (1.2e-34)
1202 : 'exp' returns exponent value only (34)
1203 : 'parts' returns the decimal and exponent as a
1204 : 2-element list (1.2, -34) (see Comments).
1205 Throws : Exception if the Expect value is not defined.
1206 Comments : Using the 'parts' argument is not recommended since it will not
1207 : work as expected if the expect value is not in scientific notation.
1208 : That is, floats are not converted into sci notation before
1209 : splitting into parts.
1210
1211 See Also : L<p()|p>, L<signif()|signif>, L<get_exponent()|get_exponent>
1212
1213 =cut
1214
1215 #-----------
1216 sub expect {
1217 #-----------
1218 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
1219 my ($self, $fmt) = @_;
1220
1221 my $val = $self->{'_expect'};
1222 $fmt ||= $_signif_fmt;
1223
1224 # $val can be zero.
1225 defined($val) or $self->throw("Can't get Expect value: HSPs may not have been set.");
1226
1227 return $val if not $fmt or $fmt =~ /^raw/i;
1228 ## Special formats: exponent-only or as list.
1229 return &get_exponent($val) if $fmt =~ /^exp/i;
1230 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
1231
1232 ## Default: return the raw Expect-value.
1233 return $val;
1234 }
1235
1236
1237
1238 =head2 signif
1239
1240 Usage : $sbjct_object->signif( [format] );
1241 Purpose : Get the P or Expect value for the given BLAST hit.
1242 : The value returned is the one which is reported in the description
1243 : section of the Blast report. For Blast1 and WU-Blast2, this
1244 : is a P-value, for Blast2, it is an Expect value.
1245 Example : $obj->signif() # returns 1.3e-34
1246 : $obj->signif('exp') # returns -34
1247 : $obj->signif('parts') # returns (1.3, -34)
1248 Returns : Float or scientific notation number (the raw P/Expect value, DEFAULT).
1249 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
1250 : 2-element list (float, int) if format == 'parts' and P/Expect value
1251 : is in scientific notation (see Comments).
1252 Argument : format: string of 'raw' | 'exp' | 'parts'
1253 : 'raw' returns value given in report. Default. (1.2e-34)
1254 : 'exp' returns exponent value only (34)
1255 : 'parts' returns the decimal and exponent as a
1256 : 2-element list (1.2, -34) (see Comments).
1257 Throws : n/a
1258 Status : Deprecated. Use p() or expect().
1259 Comments : The signif() method provides a way to deal with the fact that
1260 : Blast1 and Blast2 formats differ in what is reported in the
1261 : description lines of each hit in the Blast report. The signif()
1262 : method frees any client code from having to know if this is a P-value
1263 : or an Expect value, making it easier to write code that can process
1264 : both Blast1 and Blast2 reports. This is not necessarily a good thing, since
1265 : one should always know when one is working with P-values or
1266 : Expect values (hence the deprecated status).
1267 : Use of expect() is recommended since all hits will have an Expect value.
1268 :
1269 : Using the 'parts' argument is not recommended since it will not
1270 : work as expected if the expect value is not in scientific notation.
1271 : That is, floats are not converted into sci notation before
1272 : splitting into parts.
1273
1274 See Also : L<p()|p>, L<expect()|expect>, L<get_exponent()|get_exponent>
1275
1276 =cut
1277
1278 #-------------
1279 sub signif {
1280 #-------------
1281 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
1282 my ($self, $fmt) = @_;
1283
1284 my $val = defined($self->{'_p'}) ? $self->{'_p'} : $self->{'_expect'};
1285 $fmt ||= $_signif_fmt;
1286
1287 # $val can be zero.
1288 defined($val) or $self->throw("Can't get P- or Expect value: HSPs may not have been set.");
1289
1290 return $val if not $fmt or $fmt =~ /^raw/i;
1291 ## Special formats: exponent-only or as list.
1292 return &get_exponent($val) if $fmt =~ /^exp/i;
1293 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
1294
1295 ## Default: return the raw P/Expect-value.
1296 return $val;
1297 }
1298
1299
1300
1301 =head2 desc
1302
1303 Usage : $sbjct_object->desc( [integer] );
1304 Purpose : Get the description for the given BLAST hit.
1305 Example : (see usage)
1306 Returns : String
1307 Argument : Integer (optional) indicating the desired length of the
1308 : description string to be returned.
1309 Throws : n/a
1310
1311 See Also : L<_set_desc()|_set_desc>
1312
1313 =cut
1314
1315 #---------
1316 sub desc {
1317 #---------
1318 my( $self, $len ) = @_;
1319 $len = (defined $len) ? $len : (CORE::length $self->{'_desc'});
1320 substr( $self->{'_desc'}, 0 ,$len );
1321 }
1322
1323
1324
1325 =head2 database
1326
1327 Usage : $sbjct_object->database();
1328 Purpose : Get the name of the database for the hit sequence.
1329 Example : (see usage)
1330 Returns : String
1331 Argument : n/a
1332 Throws : n/a
1333 Status : Deprecated. Use Bio::Tools::Blast::database()
1334 Extracting database name from the seq identifier is error prone.
1335 Comments : Database id should be the same for all hits in a given
1336 : BLAST report, however, they do not always have the same
1337 : name as the database name extraced by the Blast.pm object.
1338 : The Sbjct.pm database id is obtained from the summary line.
1339
1340 =cut
1341
1342 #--------------
1343 sub database {
1344 my $self = shift;
1345 $self->warn("Bio::Tools::Sbjct::database() is deprecated.\nNo useful information is provided by this method.\nUse Bio::Tools::Blast::database().\n");
1346 return $self->{'_db'};
1347 }
1348 #--------------
1349
1350
1351
1352
1353 =head2 hsps
1354
1355 Usage : $sbjct_object->hsps();
1356 Purpose : Get a list containing all HSP objects.
1357 : Get the numbers of HSPs for the current hit.
1358 Example : @hsps = $sbjct_object->hsps();
1359 : $num = $sbjct_object->hsps(); # alternatively, use num_hsps()
1360 Returns : Array context : list of Bio::Tools::Blast::HSP.pm objects.
1361 : Scalar context: integer (number of HSPs).
1362 : (Equivalent to num_hsps()).
1363 Argument : n/a. Relies on wantarray
1364 Throws : Exception if the HSPs have not been collected.
1365
1366 See Also : L<hsp()|hsp>, L<num_hsps()|num_hsps>, L<_set_hsps()|_set_hsps>
1367
1368 =cut
1369
1370 #---------
1371 sub hsps {
1372 #---------
1373 my $self = shift;
1374
1375 if (not ref $self->{'_hsps'}) {
1376 $self->throw("Can't get HSPs: data not collected.");
1377 }
1378
1379 return wantarray
1380 # returning list containing all HSPs.
1381 ? @{$self->{'_hsps'}}
1382 # returning number of HSPs.
1383 : scalar(@{$self->{'_hsps'}});
1384 }
1385
1386
1387
1388 =head2 hsp
1389
1390 Usage : $sbjct_object->hsp( [string] );
1391 Purpose : Get a single HSP.pm object for the present Sbjct.pm object.
1392 Example : $hspObj = $sbjct_object->hsp; # same as 'best'
1393 : $hspObj = $sbjct_object->hsp('best');
1394 : $hspObj = $sbjct_object->hsp('worst');
1395 Returns : Object reference for a Bio::Tools::Blast::HSP.pm object.
1396 Argument : String (or no argument).
1397 : No argument (default) = highest scoring HSP (same as 'best').
1398 : 'best' or 'first' = highest scoring HSP.
1399 : 'worst' or 'last' = lowest scoring HSP.
1400 Throws : Exception if the HSPs have not been collected.
1401 : Exception if an unrecognized argument is used.
1402
1403 See Also : L<hsps()|hsps>, L<num_hsps()|num_hsps>, L<_set_hsps()|_set_hsps>
1404
1405 =cut
1406
1407 #----------
1408 sub hsp {
1409 #----------
1410 my( $self, $option ) = @_;
1411 $option ||= 'best';
1412
1413 if (not ref $self->{'_hsps'}) {
1414 $self->throw("Can't get HSPs: data not collected.");
1415 }
1416
1417 my @hsps = @{$self->{'_hsps'}};
1418
1419 return $hsps[0] if $option =~ /best|first|1/i;
1420 return $hsps[$#hsps] if $option =~ /worst|last/i;
1421
1422 $self->throw("Can't get HSP for: $option",
1423 "Valid arguments: 'best', 'worst'");
1424 }
1425
1426
1427
1428 =head2 num_hsps
1429
1430 Usage : $sbjct_object->num_hsps();
1431 Purpose : Get the number of HSPs for the present Blast hit.
1432 Example : $nhsps = $sbjct_object->num_hsps();
1433 Returns : Integer
1434 Argument : n/a
1435 Throws : Exception if the HSPs have not been collected.
1436
1437 See Also : L<hsps()|hsps>
1438
1439 =cut
1440
1441 #-------------
1442 sub num_hsps {
1443 #-------------
1444 my $self = shift;
1445
1446 if (not defined $self->{'_hsps'}) {
1447 $self->throw("Can't get HSPs: data not collected.");
1448 }
1449
1450 return scalar(@{$self->{'_hsps'}});
1451 }
1452
1453
1454
1455 =head2 length
1456
1457 Usage : $sbjct_object->length();
1458 Purpose : Get the total length of the hit sequence.
1459 Example : $len = $sbjct_object->length();
1460 Returns : Integer
1461 Argument : n/a
1462 Throws : n/a
1463 Comments : Developer note: when using the built-in length function within
1464 : this module, call it as CORE::length().
1465
1466 See Also : L<logical_length()|logical_length>, L<length_aln()|length_aln>
1467
1468 =cut
1469
1470 #-----------
1471 sub length {
1472 #-----------
1473 my $self = shift;
1474 $self->{'_length'};
1475 }
1476
1477
1478 =head2 logical_length
1479
1480 Usage : $sbjct_object->logical_length( [seq_type] );
1481 : (mostly intended for internal use).
1482 Purpose : Get the logical length of the hit sequence.
1483 : If the Blast is a TBLASTN or TBLASTX, the returned length
1484 : is the length of the would-be amino acid sequence (length/3).
1485 : For all other BLAST flavors, this function is the same as length().
1486 Example : $len = $sbjct_object->logical_length();
1487 Returns : Integer
1488 Argument : seq_type = 'query' or 'sbjct' (default = 'query')
1489 Throws : n/a
1490 Comments : This is important for functions like frac_aligned_query()
1491 : which need to operate in amino acid coordinate space when dealing
1492 : with [T]BLAST[NX] type reports.
1493
1494 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>
1495
1496 =cut
1497
1498 #--------------------
1499 sub logical_length {
1500 #--------------------
1501 my $self = shift;
1502 my $seqType = shift || 'query';
1503
1504 # Return logical sbjct length
1505 $seqType eq 'sbjct' and return
1506 $self->{'_logical_length'} || $self->{'_length'};
1507
1508 # Otherwise, return logical query length
1509 my $qlen = $self->parent->length;
1510
1511 # Adjust length based on BLAST flavor.
1512 my $prog = $self->parent->program;
1513 if($prog =~ /T?BLASTX/ ) {
1514 $qlen /= 3;
1515 }
1516 return $qlen;
1517 }
1518
1519
1520
1521 =head2 length_aln
1522
1523 Usage : $sbjct_object->length_aln( [seq_type] );
1524 Purpose : Get the total length of the aligned region for query or sbjct seq.
1525 : This number will include all HSPs
1526 Example : $len = $sbjct_object->length_aln(); # default = query
1527 : $lenAln = $sbjct_object->length_aln('query');
1528 Returns : Integer
1529 Argument : seq_Type = 'query' | 'sbjct' (Default = 'query')
1530 Throws : Exception if the argument is not recognized.
1531 Comments : This method will report the logical length of the alignment,
1532 : meaning that for TBLAST[NX] reports, the length is reported
1533 : using amino acid coordinate space (i.e., nucleotides / 3).
1534 :
1535 : This method requires that all HSPs be tiled. If they have not
1536 : already been tiled, they will be tiled first.
1537 : If you don't want the tiled data, iterate through each HSP
1538 : calling length() on each (use hsps() to get the HSPs).
1539
1540 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>, L<gaps()|gaps>, L<_tile_hsps()|_tile_hsps>, B<Bio::Tools::Blast::HSP::length()>
1541
1542 =cut
1543
1544 #---------------'
1545 sub length_aln {
1546 #---------------
1547 my( $self, $type ) = @_;
1548
1549 $type ||= 'query';
1550
1551 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1552
1553 my $data = $self->{'_length_aln_'.$type};
1554
1555 ## If we don't have data, figure out what went wrong.
1556 if(!$data) {
1557 $self->throw("Can't get length aln for sequence type \"$type\"",
1558 "Valid types are 'query', 'sbjct'");
1559 }
1560 $data;
1561 }
1562
1563
1564 =head2 gaps
1565
1566 Usage : $sbjct_object->gaps( [seq_type] );
1567 Purpose : Get the number of gaps in the aligned query, sbjct, or both sequences.
1568 : Data is summed across all HSPs.
1569 Example : $qgaps = $sbjct_object->gaps('query');
1570 : $sgaps = $sbjct_object->gaps('sbjct');
1571 : $tgaps = $sbjct_object->gaps(); # default = total (query + sbjct)
1572 Returns : scalar context: integer
1573 : array context without args: two-element list of integers
1574 : (queryGaps, sbjctGaps)
1575 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1576 Argument : seq_type: 'query' | 'sbjct' | 'total' | 'list' (default = 'total')
1577 Throws : n/a
1578 Comments : If you need data for each HSP, use hsps() and then interate
1579 : through each HSP object.
1580 : This method requires that all HSPs be tiled. If they have not
1581 : already been tiled, they will be tiled first.
1582 : Not relying on wantarray since that will fail in situations
1583 : such as printf "%d", $hit->gaps() in which you might expect to
1584 : be printing the total gaps, but evaluates to array context.
1585
1586 See Also : L<length_aln()|length_aln>
1587
1588 =cut
1589
1590 #----------
1591 sub gaps {
1592 #----------
1593 my( $self, $seqType ) = @_;
1594
1595 $seqType ||= (wantarray ? 'list' : 'total');
1596
1597 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1598
1599 $seqType = lc($seqType);
1600
1601 if($seqType =~ /list|array/i) {
1602 return ($self->{'_gaps_query'}, $self->{'_gaps_sbjct'});
1603 }
1604
1605 if($seqType eq 'total') {
1606 return ($self->{'_gaps_query'} + $self->{'_gaps_sbjct'}) || 0;
1607 } else {
1608 return $self->{'_gaps_'.$seqType} || 0;
1609 }
1610 }
1611
1612
1613
1614 =head2 matches
1615
1616 Usage : $sbjct_object->matches( [class] );
1617 Purpose : Get the total number of identical or conserved matches
1618 : (or both) across all HSPs.
1619 : (Note: 'conservative' matches are indicated as 'positives'
1620 : in the Blast report.)
1621 Example : ($id,$cons) = $sbjct_object->matches(); # no argument
1622 : $id = $sbjct_object->matches('id');
1623 : $cons = $sbjct_object->matches('cons');
1624 Returns : Integer or a 2-element array of integers
1625 Argument : class = 'id' | 'cons' OR none.
1626 : If no argument is provided, both identical and conservative
1627 : numbers are returned in a two element list.
1628 : (Other terms can be used to refer to the conservative
1629 : matches, e.g., 'positive'. All that is checked is whether or
1630 : not the supplied string starts with 'id'. If not, the
1631 : conservative matches are returned.)
1632 Throws : Exception if the requested data cannot be obtained.
1633 Comments : If you need data for each HSP, use hsps() and then interate
1634 : through the HSP objects.
1635 : Does not rely on wantarray to return a list. Only checks for
1636 : the presence of an argument (no arg = return list).
1637
1638 See Also : B<Bio::Tools::Blast::HSP::matches()>, L<hsps()|hsps>
1639
1640 =cut
1641
1642 #---------------
1643 sub matches {
1644 #---------------
1645 my( $self, $arg) = @_;
1646 my(@data,$data);
1647
1648 if(!$arg) {
1649 @data = ($self->{'_totalIdentical'}, $self->{'_totalConserved'});
1650
1651 return @data if @data;
1652
1653 } else {
1654
1655 if($arg =~ /^id/i) {
1656 $data = $self->{'_totalIdentical'};
1657 } else {
1658 $data = $self->{'_totalConserved'};
1659 }
1660 return $data if $data;
1661 }
1662
1663 ## Something went wrong if we make it to here.
1664 $self->throw("Can't get identical or conserved data: no data.");
1665 }
1666
1667
1668 =head2 start
1669
1670 Usage : $sbjct->start( [seq_type] );
1671 Purpose : Gets the start coordinate for the query, sbjct, or both sequences
1672 : in the Sbjct object. If there is more than one HSP, the lowest start
1673 : value of all HSPs is returned.
1674 Example : $qbeg = $sbjct->start('query');
1675 : $sbeg = $sbjct->start('sbjct');
1676 : ($qbeg, $sbeg) = $sbjct->start();
1677 Returns : scalar context: integer
1678 : array context without args: list of two integers (queryStart, sbjctStart)
1679 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1680 Argument : In scalar context: seq_type = 'query' or 'sbjct'
1681 : (case insensitive). If not supplied, 'query' is used.
1682 Throws : n/a
1683 Comments : This method requires that all HSPs be tiled. If there is more than one
1684 : HSP and they have not already been tiled, they will be tiled first.
1685 : Remember that the start and end coordinates of all HSPs are
1686 : normalized so that start < end. Strand information can only be
1687 : obtained on an HSP-by-HSP basis by calling $hsp->strand().
1688
1689 See Also : L<end()|end>, L<range()|range>, L<HSP Tiling and Ambiguous Alignments>, B<Bio::Tools::Blast::HSP::start>()
1690
1691 =cut
1692
1693 #----------
1694 sub start {
1695 #----------
1696 my ($self, $seqType) = @_;
1697
1698 $seqType ||= (wantarray ? 'list' : 'query');
1699
1700 # If there is only one HSP, defer this call to the solitary HSP.
1701 if($self->num_hsps == 1) {
1702 return $self->hsp->start($seqType);
1703 } else {
1704 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1705 if($seqType =~ /list|array/i) {
1706 return ($self->{'_queryStart'}, $self->{'_sbjctStart'});
1707 } else {
1708 ## Sensitive to member name changes.
1709 $seqType = "_\L$seqType\E";
1710 return $self->{$seqType.'Start'};
1711 }
1712 }
1713 }
1714
1715
1716 =head2 end
1717
1718 Usage : $sbjct->end( [seq_type] );
1719 Purpose : Gets the end coordinate for the query, sbjct, or both sequences
1720 : in the Sbjct object. If there is more than one HSP, the largest end
1721 : value of all HSPs is returned.
1722 Example : $qend = $sbjct->end('query');
1723 : $send = $sbjct->end('sbjct');
1724 : ($qend, $send) = $sbjct->end();
1725 Returns : scalar context: integer
1726 : array context without args: list of two integers (queryEnd, sbjctEnd)
1727 : Array context can be "induced" by providing an argument of 'list' or 'array'.
1728 Argument : In scalar context: seq_type = 'query' or 'sbjct'
1729 : (case insensitive). If not supplied, 'query' is used.
1730 Throws : n/a
1731 Comments : This method requires that all HSPs be tiled. If there is more than one
1732 : HSP and they have not already been tiled, they will be tiled first.
1733 : Remember that the start and end coordinates of all HSPs are
1734 : normalized so that start < end. Strand information can only be
1735 : obtained on an HSP-by-HSP basis by calling $hsp->strand().
1736
1737 See Also : L<start()|start>, L<range()|range>, L<HSP Tiling and Ambiguous Alignments>, B<Bio::Tools::Blast::HSP::end>()
1738
1739 =cut
1740
1741 #----------
1742 sub end {
1743 #----------
1744 my ($self, $seqType) = @_;
1745
1746 $seqType ||= (wantarray ? 'list' : 'query');
1747
1748 # If there is only one HSP, defer this call to the solitary HSP.
1749 if($self->num_hsps == 1) {
1750 return $self->hsp->end($seqType);
1751 } else {
1752 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1753 if($seqType =~ /list|array/i) {
1754 return ($self->{'_queryStop'}, $self->{'_sbjctStop'});
1755 } else {
1756 ## Sensitive to member name changes.
1757 $seqType = "_\L$seqType\E";
1758 return $self->{$seqType.'Stop'};
1759 }
1760 }
1761 }
1762
1763 =head2 range
1764
1765 Usage : $sbjct->range( [seq_type] );
1766 Purpose : Gets the (start, end) coordinates for the query or sbjct sequence
1767 : in the HSP alignment.
1768 Example : ($qbeg, $qend) = $sbjct->range('query');
1769 : ($sbeg, $send) = $sbjct->range('sbjct');
1770 Returns : Two-element array of integers
1771 Argument : seq_type = string, 'query' or 'sbjct' (default = 'query')
1772 : (case insensitive).
1773 Throws : n/a
1774
1775 See Also : L<start()|start>, L<end()|end>
1776
1777 =cut
1778
1779 #----------
1780 sub range {
1781 #----------
1782 my ($self, $seqType) = @_;
1783 $seqType ||= 'query';
1784 return ($self->start($seqType), $self->end($seqType));
1785 }
1786
1787
1788 =head2 frac_identical
1789
1790 Usage : $sbjct_object->frac_identical( [seq_type] );
1791 Purpose : Get the overall fraction of identical positions across all HSPs.
1792 : The number refers to only the aligned regions and does not
1793 : account for unaligned regions in between the HSPs, if any.
1794 Example : $frac_iden = $sbjct_object->frac_identical('query');
1795 Returns : Float (2-decimal precision, e.g., 0.75).
1796 Argument : seq_type: 'query' | 'sbjct' | 'total'
1797 : default = 'total' (but see comments below).
1798 Throws : n/a
1799 Comments : Different versions of Blast report different values for the total
1800 : length of the alignment. This is the number reported in the
1801 : denominators in the stats section:
1802 : "Identical = 34/120 Positives = 67/120".
1803 : BLAST-GP uses the total length of the alignment (with gaps)
1804 : WU-BLAST uses the length of the query sequence (without gaps).
1805 : Therefore, when called without an argument or an argument of 'total',
1806 : this method will report different values depending on the
1807 : version of BLAST used.
1808 :
1809 : To get the fraction identical among only the aligned residues,
1810 : ignoring the gaps, call this method with an argument of 'query'
1811 : or 'sbjct'.
1812 :
1813 : If you need data for each HSP, use hsps() and then iterate
1814 : through the HSP objects.
1815 : This method requires that all HSPs be tiled. If they have not
1816 : already been tiled, they will be tiled first.
1817
1818 See Also : L<frac_conserved()|frac_conserved>, L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, L<_tile_hsps()|_tile_hsps>
1819
1820 =cut
1821
1822 #------------------
1823 sub frac_identical {
1824 #------------------
1825 my ($self, $seqType) = @_;
1826 $seqType ||= 'total';
1827
1828 ## Sensitive to member name format.
1829 $seqType = lc($seqType);
1830
1831 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1832
1833 sprintf( "%.2f", $self->{'_totalIdentical'}/$self->{'_length_aln_'.$seqType});
1834 }
1835
1836
1837
1838 =head2 frac_conserved
1839
1840 Usage : $sbjct_object->frac_conserved( [seq_type] );
1841 Purpose : Get the overall fraction of conserved positions across all HSPs.
1842 : The number refers to only the aligned regions and does not
1843 : account for unaligned regions in between the HSPs, if any.
1844 Example : $frac_cons = $sbjct_object->frac_conserved('sbjct');
1845 Returns : Float (2-decimal precision, e.g., 0.75).
1846 Argument : seq_type: 'query' | 'sbjct' | 'total'
1847 : default = 'total' (but see comments below).
1848 Throws : n/a
1849 Comments : Different versions of Blast report different values for the total
1850 : length of the alignment. This is the number reported in the
1851 : denominators in the stats section:
1852 : "Identical = 34/120 Positives = 67/120".
1853 : BLAST-GP uses the total length of the alignment (with gaps)
1854 : WU-BLAST uses the length of the query sequence (without gaps).
1855 : Therefore, when called without an argument or an argument of 'total',
1856 : this method will report different values depending on the
1857 : version of BLAST used.
1858 :
1859 : To get the fraction conserved among only the aligned residues,
1860 : ignoring the gaps, call this method with an argument of 'query'
1861 : or 'sbjct'.
1862 :
1863 : If you need data for each HSP, use hsps() and then interate
1864 : through the HSP objects.
1865 : This method requires that all HSPs be tiled. If they have not
1866 : already been tiled, they will be tiled first.
1867
1868 See Also : L<frac_identical()|frac_identical>, L<matches()|matches>, L<_tile_hsps()|_tile_hsps>
1869
1870 =cut
1871
1872 #--------------------
1873 sub frac_conserved {
1874 #--------------------
1875 my ($self, $seqType) = @_;
1876 $seqType ||= 'total';
1877
1878 ## Sensitive to member name format.
1879 $seqType = lc($seqType);
1880
1881 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1882
1883 sprintf( "%.2f", $self->{'_totalConserved'}/$self->{'_length_aln_'.$seqType});
1884 }
1885
1886
1887
1888
1889 =head2 frac_aligned_query
1890
1891 Usage : $sbjct_object->frac_aligned_query();
1892 Purpose : Get the fraction of the query sequence which has been aligned
1893 : across all HSPs (not including intervals between non-overlapping
1894 : HSPs).
1895 Example : $frac_alnq = $sbjct_object->frac_aligned_query();
1896 Returns : Float (2-decimal precision, e.g., 0.75).
1897 Argument : n/a
1898 Throws : n/a
1899 Comments : If you need data for each HSP, use hsps() and then interate
1900 : through the HSP objects.
1901 : To compute the fraction aligned, the logical length of the query
1902 : sequence is used, meaning that for [T]BLASTX reports, the
1903 : full length of the query sequence is converted into amino acids
1904 : by dividing by 3. This is necessary because of the way
1905 : the lengths of aligned sequences are computed.
1906 : This method requires that all HSPs be tiled. If they have not
1907 : already been tiled, they will be tiled first.
1908
1909 See Also : L<frac_aligned_hit()|frac_aligned_hit>, L<_tile_hsps()|_tile_hsps>, L<logical_length()|logical_length>, L<length_aln()|length_aln>
1910
1911 =cut
1912
1913 #----------------------
1914 sub frac_aligned_query {
1915 #----------------------
1916 my $self = shift;
1917
1918 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1919
1920 sprintf( "%.2f", $self->{'_length_aln_query'}/$self->logical_length('query'));
1921 }
1922
1923
1924
1925 =head2 frac_aligned_hit
1926
1927 Usage : $sbjct_object->frac_aligned_hit();
1928 Purpose : Get the fraction of the hit (sbjct) sequence which has been aligned
1929 : across all HSPs (not including intervals between non-overlapping
1930 : HSPs).
1931 Example : $frac_alnq = $sbjct_object->frac_aligned_hit();
1932 Returns : Float (2-decimal precision, e.g., 0.75).
1933 Argument : n/a
1934 Throws : n/a
1935 Comments : If you need data for each HSP, use hsps() and then interate
1936 : through the HSP objects.
1937 : To compute the fraction aligned, the logical length of the sbjct
1938 : sequence is used, meaning that for TBLAST[NX] reports, the
1939 : full length of the sbjct sequence is converted into amino acids
1940 : by dividing by 3. This is necessary because of the way
1941 : the lengths of aligned sequences are computed.
1942 : This method requires that all HSPs be tiled. If they have not
1943 : already been tiled, they will be tiled first.
1944
1945 See Also : L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, L<_tile_hsps()|_tile_hsps>, L<logical_length()|logical_length>, L<length_aln()|length_aln>
1946
1947 =cut
1948
1949 #--------------------
1950 sub frac_aligned_hit {
1951 #--------------------
1952 my $self = shift;
1953
1954 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1955
1956 sprintf( "%.2f", $self->{'_length_aln_sbjct'}/$self->logical_length('sbjct'));
1957 }
1958
1959 # Safety-net methods for those who try don't read or remember the API.
1960 # Redirecting to the proper method. These 'sbjct' versions may be more
1961 # consistent with the API of this module since there are numerous other
1962 # instances of using 'sbjct' in arguments. However, 'sbjct' is a bit tech-ee.
1963
1964 #-----------------------
1965 sub frac_aligned_sbjct { my $self=shift; $self->frac_aligned_hit(@_); }
1966 #-----------------------
1967 sub num_unaligned_sbjct { my $self=shift; $self->num_unaligned_hit(@_); }
1968 #-----------------------
1969
1970
1971
1972 =head2 num_unaligned_hit
1973
1974 Usage : $sbjct_object->num_unaligned_hit();
1975 Purpose : Get the number of the unaligned residues in the hit sequence.
1976 : Sums across all all HSPs.
1977 Example : $num_unaln = $sbjct_object->num_unaligned_hit();
1978 Returns : Integer
1979 Argument : n/a
1980 Throws : n/a
1981 Comments : See notes regarding logical lengths in the comments for frac_aligned_hit().
1982 : They apply here as well.
1983 : If you need data for each HSP, use hsps() and then interate
1984 : through the HSP objects.
1985 : This method requires that all HSPs be tiled. If they have not
1986 : already been tiled, they will be tiled first.
1987
1988 See Also : L<num_unaligned_query()|num_unaligned_query>, L<_tile_hsps()|_tile_hsps>, L<frac_aligned_hit()|frac_aligned_hit>
1989
1990 =cut
1991
1992 #---------------------
1993 sub num_unaligned_hit {
1994 #---------------------
1995 my $self = shift;
1996
1997 $self->_tile_hsps() if not $self->{'_tile_hsps'};
1998
1999 my $num = $self->logical_length('sbjct') - $self->{'_length_aln_sbjct'};
2000 ($num < 0 ? 0 : $num );
2001 }
2002
2003
2004 =head2 num_unaligned_query
2005
2006 Usage : $sbjct_object->num_unaligned_query();
2007 Purpose : Get the number of the unaligned residues in the query sequence.
2008 : Sums across all all HSPs.
2009 Example : $num_unaln = $sbjct_object->num_unaligned_query();
2010 Returns : Integer
2011 Argument : n/a
2012 Throws : n/a
2013 Comments : See notes regarding logical lengths in the comments for frac_aligned_query().
2014 : They apply here as well.
2015 : If you need data for each HSP, use hsps() and then interate
2016 : through the HSP objects.
2017 : This method requires that all HSPs be tiled. If they have not
2018 : already been tiled, they will be tiled first.
2019
2020 See Also : L<num_unaligned_hit()|num_unaligned_hit>, L<_tile_hsps()|_tile_hsps>, L<frac_aligned_query()|frac_aligned_query>
2021
2022 =cut
2023
2024 #-----------------------
2025 sub num_unaligned_query {
2026 #-----------------------
2027 my $self = shift;
2028
2029 $self->_tile_hsps() if not $self->{'_tile_hsps'};
2030
2031 my $num = $self->logical_length('query') - $self->{'_length_aln_query'};
2032 ($num < 0 ? 0 : $num );
2033 }
2034
2035
2036
2037 =head2 seq_inds
2038
2039 Usage : $hit->seq_inds( seq_type, class, collapse );
2040 Purpose : Get a list of residue positions (indices) across all HSPs
2041 : for identical or conserved residues in the query or sbjct sequence.
2042 Example : @ind = $hit->seq_inds('query', 'identical');
2043 : @ind = $hit->seq_inds('sbjct', 'conserved');
2044 : @ind = $hit->seq_inds('sbjct', 'conserved', 1);
2045 Returns : Array of integers
2046 : May include ranges if collapse is non-zero.
2047 Argument : seq_type = 'query' or 'sbjct' (default = query)
2048 : class = 'identical' or 'conserved' (default = identical)
2049 : (can be shortened to 'id' or 'cons')
2050 : (actually, anything not 'id' will evaluate to 'conserved').
2051 : collapse = boolean, if non-zero, consecutive positions are merged
2052 : using a range notation, e.g., "1 2 3 4 5 7 9 10 11"
2053 : collapses to "1-5 7 9-11". This is useful for
2054 : consolidating long lists. Default = no collapse.
2055 Throws : n/a.
2056
2057 See Also : B<Bio::Tools::Blast::HSP::seq_inds()>
2058
2059 =cut
2060
2061 #-------------
2062 sub seq_inds {
2063 #-------------
2064 my ($self, $seq, $class, $collapse) = @_;
2065
2066 $seq ||= 'query';
2067 $class ||= 'identical';
2068 $collapse ||= 0;
2069
2070 my (@inds, $hsp);
2071 foreach $hsp ($self->hsps) {
2072 # This will merge data for all HSPs together.
2073 push @inds, $hsp->seq_inds($seq, $class);
2074 }
2075
2076 # Need to remove duplicates and sort the merged positions.
2077 if(@inds) {
2078 my %tmp = map { $_, 1 } @inds;
2079 @inds = sort {$a <=> $b} keys %tmp;
2080 }
2081
2082 require Bio::Tools::Blast::HSP;
2083
2084 $collapse ? &Bio::Tools::Blast::HSP::collapse_nums(@inds) : @inds;
2085 }
2086
2087
2088 #####################################################################################
2089 ## INSTANCE METHODS ##
2090 #####################################################################################
2091
2092
2093 =head2 display
2094
2095 Usage : $sbjct_object->display( %named_parameters );
2096 Purpose : Display information about Bio::Tools::Blast::Sbjct.pm data members
2097 Example : $object->display(-SHOW=>'stats');
2098 Argument : Named parameters: -SHOW => 'hsp',
2099 : -WHERE => filehandle (default = STDOUT)
2100 Returns : n/a
2101 Status : Deprecated, Buggy.
2102 : Use Blast::table() or Blast::table_tiled() instead.
2103
2104 See Also : L<_display_stats()|_display_stats>, L<_display_hsps()|_display_hsps>, B<Bio::Root::Object>::display
2105
2106 =cut
2107
2108 #------------
2109 sub display {
2110 #------------
2111 my( $self, %param) = @_;
2112
2113 $param{-HEADER} = 0;
2114 $self->SUPER::display(%param);
2115
2116 $self->show =~ /hsp/i and $self->_display_hsps( %param);
2117 }
2118
2119
2120 =head2 _display_stats
2121
2122 Usage : n/a; called automatically by display()
2123 Purpose : Display information about Bio::Tools::Blast.pm data members.
2124 : Not tab-delimited.
2125 : Prints the rank, name, database, score, p, n, length
2126 : of the hit sequence, length of the aligned region,
2127 : fraction identical, fraction conserved, and the fraction aligned
2128 : for both the query and hit sequences.
2129 Example : n/a
2130 Argument : one argument = filehandle object.
2131 Returns : printf call.
2132 Status : Deprecated, Buggy.
2133 : Use Blast::table() or Blast::table_tiled() instead.
2134
2135 See Also : L<display()|display>
2136
2137 =cut
2138
2139 #-------------------
2140 sub _display_stats {
2141 #-------------------
2142 my( $self, $OUT) = @_;
2143 my $layout = $self->parent->_layout();
2144
2145 if($layout == 1) {
2146 printf( $OUT "%-3d %-20s %-11s %-5d %-5d %-9.1e %-9.1e %-4d %-3d %-5d %-5d %-5s %-6.2f %-6.2f %-4d(%.2f) %-4d(%.2f)\n",
2147 $self->rank(), $self->name(),
2148 ($self->database() || 'UNKNOWN DB') ,
2149 $self->score(),$self->bits(),$self->p(),$self->expect(),
2150 $self->gaps(), $self->n(),
2151 $self->length(), $self->length_aln('query'),
2152 $self->ambiguous_aln(),
2153 $self->frac_aligned_query, $self->frac_aligned_hit,
2154 $self->matches('iden'), $self->frac_identical('query'),
2155 $self->matches('cons'), $self->frac_conserved('query'));
2156 } else {
2157 printf( $OUT "%-3d %-20s %-11s %-5d %-5d %-9.1e %-4d %-3d %-5d %-5d %-5s %-6.2f %-6.2f %-4d(%.2f) %-4d(%.2f)\n",
2158 $self->rank(), $self->name(),
2159 ($self->database() || 'UNKNOWN DB'),
2160 $self->score(),$self->bits(),$self->expect(),
2161 $self->gaps(), $self->num_hsps,
2162 $self->length(), $self->length_aln('query'),
2163 $self->ambiguous_aln(),
2164 $self->frac_aligned_query, $self->frac_aligned_hit,
2165 $self->matches('iden'), $self->frac_identical('query'),
2166 $self->matches('cons'), $self->frac_conserved('query') );
2167 }
2168
2169 }
2170
2171
2172 =head2 _display_hsps
2173
2174 Usage : n/a; called automatically by display()
2175 Purpose : Display information about each HSP in the current BLAST hit.
2176 Example : n/a
2177 Argument : one argument = filehandle object.
2178 Returns : printf call.
2179 Status : Experimental
2180
2181 See Also : L<display()|display>, B<Bio::Tools::Blast::HSP::display()>
2182
2183 =cut
2184
2185 #----------------
2186 sub _display_hsps {
2187 #----------------
2188 my( $self, %param) = @_;
2189 my $OUT = $self->fh();
2190 my $hspCount = 0;
2191 my $reply = undef;
2192
2193 not defined $self->{'_hsps'} and do{ print $OUT "\nHSP data not loaded.\n\n"; return; };
2194
2195 # print $OUT "\n",$self->num_hsps, " HSPs\n\n";
2196
2197 my($hsp);
2198 foreach $hsp ( $self->hsps() ) {
2199 $hspCount++;
2200 print $OUT "\n ", '-'x25, "< HSP #$hspCount >", '-'x25, "\n";
2201
2202 $hsp->display( %param );
2203
2204 if( $hspCount < $self->num_hsps ) {
2205 print "\n\n\t--------> <RET> FOR NEXT HSP, q TO QUIT <--------\n";
2206 chop( $reply = <STDIN>);
2207 $reply =~ /^q/i and return;
2208 }
2209 }
2210 }
2211
2212
2213
2214 =head2 homol_data
2215
2216 Usage : $data = $sbjct_object->homo_data( %named_params );
2217 Purpose : Gets specific similarity data about all HSPs.
2218 Returns : String
2219 Argument : named parameters forwarded to Bio::Tools::Blast::HSP::homol_data().
2220 Throws : n/a
2221 Status : Experimental
2222 Comments : This is an experimental method used for obtaining an
2223 : indication of:
2224 : 1) how many HSPs are in a Blast alignment
2225 : 2) how strong the similarity is between sequences in the HSP
2226 : 3) the endpoints of the alignment (sequence monomer numbers)
2227 : "Homology data" for each HSP is in the format:
2228 : "<integer> <start> <stop>"
2229 : Data for different HSPs are tab-delimited.
2230
2231 See Also : B<Bio::Tools::Blast::homol_data()>, B<Bio::Tools::Blast::HSP::homol_data()>
2232
2233 =cut
2234
2235 #---------------
2236 sub homol_data {
2237 #---------------
2238 my ($self,%param) = @_;
2239 my $data = $self->name();
2240
2241 foreach ($self->hsps) {
2242 $data .="\t".$_->homol_data(%param);
2243 }
2244 ## Record ambiguous alignment status.
2245 $data .= "\t".$self->ambiguous_aln();
2246 $data;
2247 }
2248
2249
2250 =head2 is_signif
2251
2252 Usage : $sbjct_object->is_signif();
2253 Purpose : Determine if the given BLAST hit is significant.
2254 Example :
2255 Returns : Boolean
2256 Argument : n/a
2257 Throws : n/a
2258 Comments : Uses criteria defined in the parent Blast.pm object
2259 : to assess significance. Currently, only relies on
2260 : P-value and length criteria.
2261 : This mehtod is largely obsolete since are hits are now by
2262 : definition significant.
2263
2264 =cut
2265
2266 #---------------
2267 sub is_signif {
2268 #---------------
2269 my $self = shift;
2270 return ($self->{'_significance'} <= $self->parent->signif and
2271 $self->length > $self->parent->signif_len);
2272 }
2273
2274 #####################################################################################
2275 ## CLASS METHODS ##
2276 #####################################################################################
2277
2278 =head1 CLASS METHODS
2279
2280 =head2 get_exponent
2281
2282 Usage : &get_exponent( number );
2283 Purpose : Determines the power of 10 exponent of an integer, float,
2284 : or scientific notation number.
2285 Example : &get_exponent("4.0e-206");
2286 : &get_exponent("0.00032");
2287 : &get_exponent("10.");
2288 : &get_exponent("1000.0");
2289 : &get_exponent("e+83");
2290 Argument : Float, Integer, or scientific notation number
2291 Returns : Integer representing the exponent part of the number (+ or -).
2292 : If argument == 0 (zero), return value is "-999".
2293 Comments : Exponents are rounded up (less negative) if the mantissa is >= 5.
2294 : Exponents are rounded down (more negative) if the mantissa is <= -5.
2295 : This method probably belongs in a more general utility class.
2296
2297 =cut
2298
2299 #------------------
2300 sub get_exponent {
2301 #------------------
2302 my $data = shift;
2303
2304 my($num, $exp) = split /[eE]/, $data;
2305
2306 if( defined $exp) {
2307 $num = 1 if not $num;
2308 $num >= 5 and $exp++;
2309 $num <= -5 and $exp--;
2310 } elsif( $num == 0) {
2311 $exp = -999;
2312 } elsif( not $num =~ /\./) {
2313 $exp = CORE::length($num) -1;
2314 } else {
2315 $exp = 0;
2316 $num .= '0' if $num =~ /\.$/;
2317 my ($c);
2318 my $rev = 0;
2319 if($num !~ /^0/) {
2320 $num = reverse($num);
2321 $rev = 1;
2322 }
2323 do { $c = chop($num);
2324 $c == 0 && $exp++;
2325 } while( $c ne '.');
2326
2327 $exp = -$exp if $num == 0 and not $rev;
2328 $exp -= 1 if $rev;
2329 }
2330 return $exp;
2331 }
2332
2333 1;
2334 __END__
2335
2336 #####################################################################################
2337 # END OF CLASS #
2338 #####################################################################################
2339
2340
2341 =head1 FOR DEVELOPERS ONLY
2342
2343 =head2 Data Members
2344
2345 Information about the various data members of this module is provided for those
2346 wishing to modify or understand the code. Two things to bear in mind:
2347
2348 =over 4
2349
2350 =item 1 Do NOT rely on these in any code outside of this module.
2351
2352 All data members are prefixed with an underscore to signify that they are private.
2353 Always use accessor methods. If the accessor doesn't exist or is inadequate,
2354 create or modify an accessor (and let me know, too!). (An exception to this might
2355 be for HSP.pm which is more tightly coupled to Sbjct.pm and
2356 may access Sbjct data members directly for efficiency purposes, but probably
2357 should not).
2358
2359 =item 2 This documentation may be incomplete and out of date.
2360
2361 It is easy for these data member descriptions to become obsolete as
2362 this module is still evolving. Always double check this info and search
2363 for members not described here.
2364
2365 =back
2366
2367 An instance of Bio::Tools::Blast::Sbjct.pm is a blessed reference to a hash containing
2368 all or some of the following fields:
2369
2370 FIELD VALUE
2371 --------------------------------------------------------------
2372 _hsps : Array ref for a list of Bio::Tools::Blast::HSP.pm objects.
2373 :
2374 _db : Database identifier from the summary line.
2375 :
2376 _desc : Description data for the hit from the summary line.
2377 :
2378 _length : Total length of the hit sequence.
2379 :
2380 _score : BLAST score.
2381 :
2382 _bits : BLAST score (in bits). Matrix-independent.
2383 :
2384 _p : BLAST P value. Obtained from summary section. (Blast1/WU-Blast only)
2385 :
2386 _expect : BLAST Expect value. Obtained from summary section.
2387 :
2388 _n : BLAST N value (number of HSPs) (Blast1/WU-Blast2 only)
2389 :
2390 _frame : Reading frame for TBLASTN and TBLASTX analyses.
2391 :
2392 _totalIdentical: Total number of identical aligned monomers.
2393 :
2394 _totalConserved: Total number of conserved aligned monomers (a.k.a. "positives").
2395 :
2396 _overlap : Maximum number of overlapping residues between adjacent HSPs
2397 : before considering the alignment to be ambiguous.
2398 :
2399 _ambiguous_aln : Boolean. True if the alignment of all HSPs is ambiguous.
2400 :
2401 _length_aln_query : Length of the aligned region of the query sequence.
2402 :
2403 _length_aln_sbjct : Length of the aligned region of the sbjct sequence.
2404
2405
2406 INHERITED DATA MEMBERS
2407 ----------------------
2408 _name : From Bio::Root::Object.pm. String representing the name of the
2409 : sbjct sequence obtained from the BLAST report.
2410 :
2411 _parent : From Bio::Root::Object.pm. This member contains a reference to the
2412 : Bio::Tools::Blast.pm object to which this hit belongs.
2413
2414
2415 =cut
2416
2417 1;