0
|
1 #-----------------------------------------------------------------
|
|
2 # $Id: BlastHit.pm,v 1.13 2002/10/22 09:36:19 sac Exp $
|
|
3 #
|
|
4 # BioPerl module Bio::Search::Hit::BlastHit
|
|
5 #
|
|
6 # (This module was originally called Bio::Tools::Blast::Sbjct)
|
|
7 #
|
|
8 # Cared for by Steve Chervitz <sac@bioperl.org>
|
|
9 #
|
|
10 # You may distribute this module under the same terms as perl itself
|
|
11 #-----------------------------------------------------------------
|
|
12
|
|
13 ## POD Documentation:
|
|
14
|
|
15 =head1 NAME
|
|
16
|
|
17 Bio::Search::Hit::BlastHit - Bioperl BLAST Hit object
|
|
18
|
|
19 =head1 SYNOPSIS
|
|
20
|
|
21 The construction of BlastHit objects is performed by
|
|
22 Bio::SearchIO::blast::BlastHitFactory in a process that is
|
|
23 orchestrated by the Blast parser (B<Bio::SearchIO::blast::blast>).
|
|
24 The resulting BlastHits are then accessed via
|
|
25 B<Bio::Search::Result::BlastResult>). Therefore, you do not need to
|
|
26 use B<Bio::Search::Hit::BlastHit>) directly. If you need to construct
|
|
27 BlastHits directly, see the new() function for details.
|
|
28
|
|
29 For B<Bio::SearchIO> BLAST parsing usage examples, see the
|
|
30 B<examples/search-blast> directory of the Bioperl distribution.
|
|
31
|
|
32
|
|
33 =head1 DESCRIPTION
|
|
34
|
|
35 The Bio::Search::Hit::BlastHit.pm module encapsulates data and methods
|
|
36 for manipulating "hits" from a BLAST report. A BLAST hit is a
|
|
37 collection of HSPs along with other metadata such as sequence name
|
|
38 and score information. Hit objects are accessed via
|
|
39 B<Bio::Search::Result::BlastResult> objects after parsing a BLAST report using
|
|
40 the B<Bio::SearchIO> system.
|
|
41
|
|
42 In Blast lingo, the "sbjct" sequences are all the sequences
|
|
43 in a target database which were compared against a "query" sequence.
|
|
44 The terms "sbjct" and "hit" will be used interchangeably in this module.
|
|
45 All methods that take 'sbjct' as an argument also support 'hit' as a
|
|
46 synonym.
|
|
47
|
|
48 This module supports BLAST versions 1.x and 2.x, gapped and ungapped,
|
|
49 and PSI-BLAST.
|
|
50
|
|
51
|
|
52 =head2 HSP Tiling and Ambiguous Alignments
|
|
53
|
|
54 If a Blast hit has more than one HSP, the Bio::Search::Hit::BlastHit.pm
|
|
55 object has the ability to merge overlapping HSPs into contiguous
|
|
56 blocks. This permits the BlastHit object to sum data across all HSPs
|
|
57 without counting data in the overlapping regions multiple times, which
|
|
58 would happen if data from each overlapping HSP are simply summed. HSP
|
|
59 tiling is performed automatically when methods of the BlastHit object
|
|
60 that rely on tiled data are invoked. These include
|
|
61 L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>,
|
|
62 L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>,
|
|
63 L<num_unaligned_query()|num_unaligned_query>, L<num_unaligned_hit()|num_unaligned_hit>.
|
|
64
|
|
65 It also permits the assessment of an "ambiguous alignment" if the
|
|
66 query (or sbjct) sequences from different HSPs overlap
|
|
67 (see L<ambiguous_aln()|ambiguous_aln>). The existence
|
|
68 of an overlap could indicate a biologically interesting region in the
|
|
69 sequence, such as a repeated domain. The BlastHit object uses the
|
|
70 C<-OVERLAP> parameter to determine when two sequences overlap; if this is
|
|
71 set to 2 -- the default -- then any two sbjct or query HSP sequences
|
|
72 must overlap by more than two residues to get merged into the same
|
|
73 contig and counted as an overlap. See the L<BUGS | BUGS> section below for
|
|
74 "issues" with HSP tiling.
|
|
75
|
|
76
|
|
77 The results of the HSP tiling is reported with the following ambiguity codes:
|
|
78
|
|
79 'q' = Query sequence contains multiple sub-sequences matching
|
|
80 a single region in the sbjct sequence.
|
|
81
|
|
82 's' = Subject (BlastHit) sequence contains multiple sub-sequences matching
|
|
83 a single region in the query sequence.
|
|
84
|
|
85 'qs' = Both query and sbjct sequences contain more than one
|
|
86 sub-sequence with similarity to the other sequence.
|
|
87
|
|
88
|
|
89 For addition information about ambiguous BLAST alignments, see
|
|
90 L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils> and
|
|
91
|
|
92 http://www-genome.stanford.edu/Sacch3D/help/ambig_aln.html
|
|
93
|
|
94 =head1 DEPENDENCIES
|
|
95
|
|
96 Bio::Search::Hit::BlastHit.pm is a concrete class that inherits from
|
|
97 B<Bio::Root::Root> and B<Bio::Search::Hit::HitI>. and relies on
|
|
98 B<Bio::Search::HSP::BlastHSP>.
|
|
99
|
|
100
|
|
101 =head1 BUGS
|
|
102
|
|
103 One consequence of the HSP tiling is that methods that rely on HSP
|
|
104 tiling such as L<frac_identical()|frac_identical>, L<frac_conserved()|frac_conserved>, L<gaps()|gaps>
|
|
105 etc. may report misleading numbers when C<-OVERLAP> is set to a large
|
|
106 number. For example, say we have two HSPs and the query sequence tile
|
|
107 as follows:
|
|
108
|
|
109 1 8 22 30 40 60
|
|
110 Full seq: ------------------------------------------------------------
|
|
111 * ** * **
|
|
112 HSP1: --------------- (6 identical matches)
|
|
113 ** ** **
|
|
114 HSP2: ------------- (6 identical matches)
|
|
115
|
|
116
|
|
117 If C<-OVERLAP> is set to some number over 4, HSP1 and HSP2 will not be
|
|
118 tiled into a single contig and their numbers of identical matches will
|
|
119 be added, giving a total of 12, not 10 if they had be combined into
|
|
120 one contig. This can lead to number greater than 1.0 for methods
|
|
121 L<frac_identical()|frac_identical> and L<frac_conserved()|frac_conserved>. This is less of an issue
|
|
122 with gapped Blast since it tends to combine HSPs that would be listed
|
|
123 separately without gapping. (Fractions E<gt>1.0 can be viewed as a
|
|
124 signal for an interesting alignment that warrants further inspection,
|
|
125 thus turning this bug into a feature :-).
|
|
126
|
|
127 Using large values for C<-OVERLAP> can lead to incorrect numbers
|
|
128 reported by methods that rely on HSP tiling but can be useful if you
|
|
129 care more about detecting ambiguous alignments. Setting C<-OVERLAP>
|
|
130 to zero will lead to the most accurate numbers for the
|
|
131 tiling-dependent methods but will be useless for detecting overlapping
|
|
132 HSPs since all HSPs will appear to overlap.
|
|
133
|
|
134
|
|
135 =head1 SEE ALSO
|
|
136
|
|
137 Bio::Search::HSP::BlastHSP.pm - Blast HSP object.
|
|
138 Bio::Search::Result::BlastResult.pm - Blast Result object.
|
|
139 Bio::Search::Hit::HitI.pm - Interface implemented by BlastHit.pm
|
|
140 Bio::Root::Root.pm - Base class for BlastHit.pm
|
|
141
|
|
142 Links:
|
|
143
|
|
144 http://bio.perl.org/Core/POD/Search/Hit/Blast/BlastHSP.pm.html
|
|
145
|
|
146 http://bio.perl.org/Projects/modules.html - Online module documentation
|
|
147 http://bio.perl.org/Projects/Blast/ - Bioperl Blast Project
|
|
148 http://bio.perl.org/ - Bioperl Project Homepage
|
|
149
|
|
150
|
|
151 =head1 FEEDBACK
|
|
152
|
|
153 =head2 Mailing Lists
|
|
154
|
|
155 User feedback is an integral part of the evolution of this and other
|
|
156 Bioperl modules. Send your comments and suggestions preferably to one
|
|
157 of the Bioperl mailing lists. Your participation is much appreciated.
|
|
158
|
|
159 bioperl-l@bioperl.org - General discussion
|
|
160 http://bio.perl.org/MailList.html - About the mailing lists
|
|
161
|
|
162 =head2 Reporting Bugs
|
|
163
|
|
164 Report bugs to the Bioperl bug tracking system to help us keep track
|
|
165 the bugs and their resolution. Bug reports can be submitted via email
|
|
166 or the web:
|
|
167
|
|
168 bioperl-bugs@bio.perl.org
|
|
169 http://bio.perl.org/bioperl-bugs/
|
|
170
|
|
171 =head1 AUTHOR
|
|
172
|
|
173 Steve Chervitz E<lt>sac@bioperl.orgE<gt>
|
|
174
|
|
175 See L<the FEEDBACK section | FEEDBACK> for where to send bug reports and comments.
|
|
176
|
|
177 =head1 ACKNOWLEDGEMENTS
|
|
178
|
|
179 This software was originally developed in the Department of Genetics
|
|
180 at Stanford University. I would also like to acknowledge my
|
|
181 colleagues at Affymetrix for useful feedback.
|
|
182
|
|
183 =head1 COPYRIGHT
|
|
184
|
|
185 Copyright (c) 1996-2001 Steve Chervitz. All Rights Reserved.
|
|
186
|
|
187 =head1 DISCLAIMER
|
|
188
|
|
189 This software is provided "as is" without warranty of any kind.
|
|
190
|
|
191 =head1 APPENDIX
|
|
192
|
|
193 The rest of the documentation details each of the object methods.
|
|
194 Internal methods are usually preceded with a _
|
|
195
|
|
196 =cut
|
|
197
|
|
198
|
|
199 # Let the code begin...
|
|
200
|
|
201 package Bio::Search::Hit::BlastHit;
|
|
202
|
|
203 use strict;
|
|
204 use Bio::Search::Hit::HitI;
|
|
205 use Bio::Root::Root;
|
|
206 require Bio::Search::BlastUtils;
|
|
207 use vars qw( @ISA %SUMMARY_OFFSET $Revision);
|
|
208
|
|
209 use overload
|
|
210 '""' => \&to_string;
|
|
211
|
|
212 @ISA = qw( Bio::Root::Root Bio::Search::Hit::HitI );
|
|
213
|
|
214 $Revision = '$Id: BlastHit.pm,v 1.13 2002/10/22 09:36:19 sac Exp $'; #'
|
|
215
|
|
216
|
|
217 =head2 new
|
|
218
|
|
219 Usage : $hit = Bio::Search::Hit::BlastHit->new( %named_params );
|
|
220 : Bio::Search::Hit::BlastHit.pm objects are constructed
|
|
221 : automatically by Bio::SearchIO::BlastHitFactory.pm,
|
|
222 : so there is no need for direct instantiation.
|
|
223 Purpose : Constructs a new BlastHit object and Initializes key variables
|
|
224 : for the hit.
|
|
225 Returns : A Bio::Search::Hit::BlastHit object
|
|
226 Argument : Named Parameters:
|
|
227 : Parameter keys are case-insensitive.
|
|
228 : -RAW_DATA => array reference holding raw BLAST report data
|
|
229 : for a single hit. This includes all lines
|
|
230 : within the HSP alignment listing section of a
|
|
231 : traditional BLAST or PSI-BLAST (non-XML) report,
|
|
232 : starting at (or just after) the leading '>'.
|
|
233 : -HOLD_RAW_DATA => boolean, should -RAW_DATA be saved within the object.
|
|
234 : -QUERY_LEN => Length of the query sequence
|
|
235 : -ITERATION => integer (PSI-BLAST iteration number in which hit was found)
|
|
236 : -OVERLAP => integer (maximum overlap between adjacent
|
|
237 : HSPs when tiling)
|
|
238 : -PROGRAM => string (type of Blast: BLASTP, BLASTN, etc)
|
|
239 : -SIGNIF => significance
|
|
240 : -IS_PVAL => boolean, true if -SIGNIF contains a P-value
|
|
241 : -SCORE => raw BLAST score
|
|
242 : -FOUND_AGAIN => boolean, true if this was a hit from the
|
|
243 : section of a PSI-BLAST with iteration > 1
|
|
244 : containing sequences that were also found
|
|
245 : in iteration 1.
|
|
246 Comments : This object accepts raw Blast report data not because it
|
|
247 : is required for parsing, but in order to retrieve it
|
|
248 : (only available if -HOLD_RAW_DATA is set to true).
|
|
249
|
|
250 See Also : L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<Bio::Root::Root::new()|Bio::Root::Root>
|
|
251
|
|
252 =cut
|
|
253
|
|
254 #-------------------
|
|
255 sub new {
|
|
256 #-------------------
|
|
257 my ($class, @args ) = @_;
|
|
258 my $self = $class->SUPER::new( @args );
|
|
259
|
|
260 my ($raw_data, $signif, $is_pval, $hold_raw);
|
|
261
|
|
262 ($self->{'_blast_program'}, $self->{'_query_length'}, $raw_data, $hold_raw,
|
|
263 $self->{'_overlap'}, $self->{'_iteration'}, $signif, $is_pval,
|
|
264 $self->{'_score'}, $self->{'_found_again'} ) =
|
|
265 $self->_rearrange( [qw(PROGRAM
|
|
266 QUERY_LEN
|
|
267 RAW_DATA
|
|
268 HOLD_RAW_DATA
|
|
269 OVERLAP
|
|
270 ITERATION
|
|
271 SIGNIF
|
|
272 IS_PVAL
|
|
273 SCORE
|
|
274 FOUND_AGAIN )], @args );
|
|
275
|
|
276 # TODO: Handle this in parser. Just pass in name parameter.
|
|
277 $self->_set_id( $raw_data->[0] );
|
|
278
|
|
279 if($is_pval) {
|
|
280 $self->{'_p'} = $signif;
|
|
281 } else {
|
|
282 $self->{'_expect'} = $signif;
|
|
283 }
|
|
284
|
|
285 if( $hold_raw ) {
|
|
286 $self->{'_hit_data'} = $raw_data;
|
|
287 }
|
|
288
|
|
289 return $self;
|
|
290 }
|
|
291
|
|
292 sub DESTROY {
|
|
293 my $self=shift;
|
|
294 #print STDERR "-->DESTROYING $self\n";
|
|
295 }
|
|
296
|
|
297
|
|
298 #=================================================
|
|
299 # Begin Bio::Search::Hit::HitI implementation
|
|
300 #=================================================
|
|
301
|
|
302 =head2 algorithm
|
|
303
|
|
304 Title : algorithm
|
|
305 Usage : $alg = $hit->algorithm();
|
|
306 Function: Gets the algorithm specification that was used to obtain the hit
|
|
307 For BLAST, the algorithm denotes what type of sequence was aligned
|
|
308 against what (BLASTN: dna-dna, BLASTP prt-prt, BLASTX translated
|
|
309 dna-prt, TBLASTN prt-translated dna, TBLASTX translated
|
|
310 dna-translated dna).
|
|
311 Returns : a scalar string
|
|
312 Args : none
|
|
313
|
|
314 =cut
|
|
315
|
|
316 #----------------
|
|
317 sub algorithm {
|
|
318 #----------------
|
|
319 my ($self,@args) = @_;
|
|
320 return $self->{'_blast_program'};
|
|
321 }
|
|
322
|
|
323 =head2 name
|
|
324
|
|
325 Usage : $hit->name([string]);
|
|
326 Purpose : Set/Get a string to identify the hit.
|
|
327 Example : $name = $hit->name;
|
|
328 : $hit->name('M81707');
|
|
329 Returns : String consisting of the hit's name or undef if not set.
|
|
330 Comments : The name is parsed out of the "Query=" line as the first chunk of
|
|
331 non-whitespace text. If you want the rest of the line, use
|
|
332 $hit->description().
|
|
333
|
|
334 See Also: L<accession()|accession>
|
|
335
|
|
336 =cut
|
|
337
|
|
338 #'
|
|
339
|
|
340 #----------------
|
|
341 sub name {
|
|
342 #----------------
|
|
343 my $self = shift;
|
|
344 if (@_) {
|
|
345 my $name = shift;
|
|
346 $name =~ s/^\s+|(\s+|,)$//g;
|
|
347 $self->{'_name'} = $name;
|
|
348 }
|
|
349 return $self->{'_name'};
|
|
350 }
|
|
351
|
|
352 =head2 description
|
|
353
|
|
354 Usage : $hit_object->description( [integer] );
|
|
355 Purpose : Set/Get a description string for the hit.
|
|
356 This is parsed out of the "Query=" line as everything after
|
|
357 the first chunk of non-whitespace text. Use $hit->name()
|
|
358 to get the first chunk (the ID of the sequence).
|
|
359 Example : $description = $hit->description;
|
|
360 : $desc_60char = $hit->description(60);
|
|
361 Argument : Integer (optional) indicating the desired length of the
|
|
362 : description string to be returned.
|
|
363 Returns : String consisting of the hit's description or undef if not set.
|
|
364
|
|
365 =cut
|
|
366
|
|
367 #'
|
|
368
|
|
369 #----------------
|
|
370 sub description {
|
|
371 #----------------
|
|
372 my( $self, $len ) = @_;
|
|
373 $len = (defined $len) ? $len : (CORE::length $self->{'_description'});
|
|
374 return substr( $self->{'_description'}, 0 ,$len );
|
|
375 }
|
|
376
|
|
377 =head2 accession
|
|
378
|
|
379 Title : accession
|
|
380 Usage : $acc = $hit->accession();
|
|
381 Function: Retrieve the accession (if available) for the hit
|
|
382 Returns : a scalar string (empty string if not set)
|
|
383 Args : none
|
|
384 Comments: Accession numbers are extracted based on the assumption that they
|
|
385 are delimited by | characters (NCBI-style). If this is not the case,
|
|
386 use the name() method and parse it as necessary.
|
|
387
|
|
388 See Also: L<name()|name>
|
|
389
|
|
390 =cut
|
|
391
|
|
392 #--------------------
|
|
393 sub accession {
|
|
394 #--------------------
|
|
395 my $self = shift;
|
|
396 if(@_) { $self->{'_accession'} = shift; }
|
|
397 $self->{'_accession'} || '';
|
|
398 }
|
|
399
|
|
400 =head2 raw_score
|
|
401
|
|
402 Usage : $hit_object->raw_score();
|
|
403 Purpose : Gets the BLAST score of the best HSP for the current Blast hit.
|
|
404 Example : $score = $hit_object->raw_score();
|
|
405 Returns : Integer
|
|
406 Argument : n/a
|
|
407 Throws : n/a
|
|
408
|
|
409 See Also : L<bits()|bits>
|
|
410
|
|
411 =cut
|
|
412
|
|
413 #----------
|
|
414 sub raw_score {
|
|
415 #----------
|
|
416 my $self = shift;
|
|
417
|
|
418 # The check for $self->{'_score'} is a remnant from the 'query' mode days
|
|
419 # in which the sbjct object would collect data from the description line only.
|
|
420
|
|
421 my ($score);
|
|
422 if(not defined($self->{'_score'})) {
|
|
423 $score = $self->hsp->score;
|
|
424 } else {
|
|
425 $score = $self->{'_score'};
|
|
426 }
|
|
427 return $score;
|
|
428 }
|
|
429
|
|
430
|
|
431 =head2 length
|
|
432
|
|
433 Usage : $hit_object->length();
|
|
434 Purpose : Get the total length of the hit sequence.
|
|
435 Example : $len = $hit_object->length();
|
|
436 Returns : Integer
|
|
437 Argument : n/a
|
|
438 Throws : n/a
|
|
439 Comments : Developer note: when using the built-in length function within
|
|
440 : this module, call it as CORE::length().
|
|
441
|
|
442 See Also : L<logical_length()|logical_length>, L<length_aln()|length_aln>
|
|
443
|
|
444 =cut
|
|
445
|
|
446 #-----------
|
|
447 sub length {
|
|
448 #-----------
|
|
449 my $self = shift;
|
|
450 return $self->{'_length'};
|
|
451 }
|
|
452
|
|
453 =head2 significance
|
|
454
|
|
455 Equivalent to L<signif()|signif>
|
|
456
|
|
457 =cut
|
|
458
|
|
459 #----------------
|
|
460 sub significance { shift->signif( @_ ); }
|
|
461 #----------------
|
|
462
|
|
463
|
|
464 =head2 next_hsp
|
|
465
|
|
466 Title : next_hsp
|
|
467 Usage : $hsp = $obj->next_hsp();
|
|
468 Function : returns the next available High Scoring Pair object
|
|
469 Example :
|
|
470 Returns : Bio::Search::HSP::BlastHSP or undef if finished
|
|
471 Args : none
|
|
472
|
|
473 =cut
|
|
474
|
|
475 #----------------
|
|
476 sub next_hsp {
|
|
477 #----------------
|
|
478 my $self = shift;
|
|
479
|
|
480 unless($self->{'_hsp_queue_started'}) {
|
|
481 $self->{'_hsp_queue'} = [$self->hsps()];
|
|
482 $self->{'_hsp_queue_started'} = 1;
|
|
483 }
|
|
484 pop @{$self->{'_hsp_queue'}};
|
|
485 }
|
|
486
|
|
487 #=================================================
|
|
488 # End Bio::Search::Hit::HitI implementation
|
|
489 #=================================================
|
|
490
|
|
491
|
|
492 # Providing a more explicit method for getting name of hit
|
|
493 # (corresponds with column name in HitTableWriter)
|
|
494 #----------------
|
|
495 sub hit_name {
|
|
496 #----------------
|
|
497 my $self = shift;
|
|
498 $self->name( @_ );
|
|
499 }
|
|
500
|
|
501 # Older method Delegates to description()
|
|
502 #----------------
|
|
503 sub desc {
|
|
504 #----------------
|
|
505 my $self = shift;
|
|
506 return $self->description( @_ );
|
|
507 }
|
|
508
|
|
509 # Providing a more explicit method for getting description of hit
|
|
510 # (corresponds with column name in HitTableWriter)
|
|
511 #----------------
|
|
512 sub hit_description {
|
|
513 #----------------
|
|
514 my $self = shift;
|
|
515 return $self->description( @_ );
|
|
516 }
|
|
517
|
|
518 =head2 score
|
|
519
|
|
520 Equivalent to L<raw_score()|raw_score>
|
|
521
|
|
522 =cut
|
|
523
|
|
524 #----------------
|
|
525 sub score { shift->raw_score( @_ ); }
|
|
526 #----------------
|
|
527
|
|
528
|
|
529 =head2 hit_length
|
|
530
|
|
531 Equivalent to L<length()|length>
|
|
532
|
|
533 =cut
|
|
534
|
|
535 # Providing a more explicit method for getting length of hit
|
|
536 #----------------
|
|
537 sub hit_length { shift->length( @_ ); }
|
|
538 #----------------
|
|
539
|
|
540
|
|
541 =head2 signif
|
|
542
|
|
543 Usage : $hit_object->signif( [format] );
|
|
544 Purpose : Get the P or Expect value for the best HSP of the given BLAST hit.
|
|
545 : The value returned is the one which is reported in the description
|
|
546 : section of the Blast report. For Blast1 and WU-Blast2, this
|
|
547 : is a P-value, for Blast2, it is an Expect value.
|
|
548 Example : $obj->signif() # returns 1.3e-34
|
|
549 : $obj->signif('exp') # returns -34
|
|
550 : $obj->signif('parts') # returns (1.3, -34)
|
|
551 Returns : Float or scientific notation number (the raw P/Expect value, DEFAULT).
|
|
552 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
|
|
553 : 2-element list (float, int) if format == 'parts' and P/Expect value
|
|
554 : is in scientific notation (see Comments).
|
|
555 Argument : format: string of 'raw' | 'exp' | 'parts'
|
|
556 : 'raw' returns value given in report. Default. (1.2e-34)
|
|
557 : 'exp' returns exponent value only (34)
|
|
558 : 'parts' returns the decimal and exponent as a
|
|
559 : 2-element list (1.2, -34) (see Comments).
|
|
560 Throws : n/a
|
|
561 Comments : The signif() method provides a way to deal with the fact that
|
|
562 : Blast1 and Blast2 formats (and WU- vs. NCBI-BLAST) differ in
|
|
563 : what is reported in the description lines of each hit in the
|
|
564 : Blast report. The signif() method frees any client code from
|
|
565 : having to know if this is a P-value or an Expect value,
|
|
566 : making it easier to write code that can process both
|
|
567 : Blast1 and Blast2 reports. This is not necessarily a good thing,
|
|
568 : since one should always know when one is working with P-values or
|
|
569 : Expect values (hence the deprecated status).
|
|
570 : Use of expect() is recommended since all hits will have an Expect value.
|
|
571 :
|
|
572 : Using the 'parts' argument is not recommended since it will not
|
|
573 : work as expected if the expect value is not in scientific notation.
|
|
574 : That is, floats are not converted into sci notation before
|
|
575 : splitting into parts.
|
|
576
|
|
577 See Also : L<p()|p>, L<expect()|expect>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
|
|
578
|
|
579 =cut
|
|
580
|
|
581 #-------------
|
|
582 sub signif {
|
|
583 #-------------
|
|
584 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
|
|
585 my ($self, $fmt) = @_;
|
|
586
|
|
587 my $val = defined($self->{'_p'}) ? $self->{'_p'} : $self->{'_expect'};
|
|
588
|
|
589 # $val can be zero.
|
|
590 defined($val) or $self->throw("Can't get P- or Expect value: HSPs may not have been set.");
|
|
591
|
|
592 return $val if not $fmt or $fmt =~ /^raw/i;
|
|
593 ## Special formats: exponent-only or as list.
|
|
594 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
|
|
595 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
|
|
596
|
|
597 ## Default: return the raw P/Expect-value.
|
|
598 return $val;
|
|
599 }
|
|
600
|
|
601 #----------------
|
|
602 sub raw_hit_data {
|
|
603 #----------------
|
|
604 my $self = shift;
|
|
605 my $data = '>';
|
|
606 # Need to add blank lines where we've removed them.
|
|
607 foreach( @{$self->{'_hit_data'}} ) {
|
|
608 if( $_ eq 'end') {
|
|
609 $data .= "\n";
|
|
610 }
|
|
611 else {
|
|
612 $data .= /^\s*(Score|Query)/ ? "\n$_" : $_;
|
|
613 }
|
|
614 }
|
|
615 return $data;
|
|
616 }
|
|
617
|
|
618
|
|
619 #=head2 _set_length
|
|
620 #
|
|
621 # Usage : $hit_object->_set_length( "233" );
|
|
622 # Purpose : Set the total length of the hit sequence.
|
|
623 # Example : $hit_object->_set_length( $len );
|
|
624 # Returns : n/a
|
|
625 # Argument : Integer (only when setting). Any commas will be stripped out.
|
|
626 # Throws : n/a
|
|
627 #
|
|
628 #=cut
|
|
629
|
|
630 #-----------
|
|
631 sub _set_length {
|
|
632 #-----------
|
|
633 my ($self, $len) = @_;
|
|
634 $len =~ s/,//g; # get rid of commas
|
|
635 $self->{'_length'} = $len;
|
|
636 }
|
|
637
|
|
638 #=head2 _set_description
|
|
639 #
|
|
640 # Usage : Private method; called automatically during construction
|
|
641 # Purpose : Sets the description of the hit sequence.
|
|
642 # : For sequence without descriptions, does not set any description.
|
|
643 # Argument : Array containing description (multiple lines).
|
|
644 # Comments : Processes the supplied description:
|
|
645 # 1. Join all lines into one string.
|
|
646 # 2. Remove sequence id at the beginning of description.
|
|
647 # 3. Removes junk charactes at begin and end of description.
|
|
648 #
|
|
649 #=cut
|
|
650
|
|
651 #--------------
|
|
652 sub _set_description {
|
|
653 #--------------
|
|
654 my( $self, @desc ) = @_;
|
|
655 my( $desc);
|
|
656
|
|
657 # print STDERR "BlastHit: RAW DESC:\n@desc\n";
|
|
658
|
|
659 $desc = join(" ", @desc);
|
|
660
|
|
661 my $name = $self->name;
|
|
662
|
|
663 if($desc) {
|
|
664 $desc =~ s/^\s*\S+\s+//; # remove the sequence ID(s)
|
|
665 # This won't work if there's no description.
|
|
666 $desc =~ s/^\s*$name//; # ...but this should.
|
|
667 $desc =~ s/^[\s!]+//;
|
|
668 $desc =~ s/ \d+$//;
|
|
669 $desc =~ s/\.+$//;
|
|
670 $self->{'_description'} = $desc;
|
|
671 }
|
|
672
|
|
673 # print STDERR "BlastHit: _set_description = $desc\n";
|
|
674 }
|
|
675
|
|
676 =head2 to_string
|
|
677
|
|
678 Title : to_string
|
|
679 Usage : print $hit->to_string;
|
|
680 Function: Returns a string representation for the Blast Hit.
|
|
681 Primarily intended for debugging purposes.
|
|
682 Example : see usage
|
|
683 Returns : A string of the form:
|
|
684 [BlastHit] <name> <description>
|
|
685 e.g.:
|
|
686 [BlastHit] emb|Z46660|SC9725 S.cerevisiae chromosome XIII cosmid
|
|
687 Args : None
|
|
688
|
|
689 =cut
|
|
690
|
|
691 #----------------
|
|
692 sub to_string {
|
|
693 #----------------
|
|
694 my $self = shift;
|
|
695 return "[BlastHit] " . $self->name . " " . $self->description;
|
|
696 }
|
|
697
|
|
698
|
|
699 #=head2 _set_id
|
|
700 #
|
|
701 # Usage : Private method; automatically called by new()
|
|
702 # Purpose : Sets the name of the BlastHit sequence from the BLAST summary line.
|
|
703 # : The identifier is assumed to be the first
|
|
704 # : chunk of non-whitespace characters in the description line
|
|
705 # : Does not assume any semantics in the structure of the identifier
|
|
706 # : (Formerly, this method attempted to extract database name from
|
|
707 # : the seq identifiers, but this was prone to break).
|
|
708 # Returns : n/a
|
|
709 # Argument : String containing description line of the hit from Blast report
|
|
710 # : or first line of an alignment section (with or without the leading '>').
|
|
711 # Throws : Warning if cannot locate sequence ID.
|
|
712 #
|
|
713 #See Also : L<new()|new>, L<accession()|accession>
|
|
714 #
|
|
715 #=cut
|
|
716
|
|
717 #---------------
|
|
718 sub _set_id {
|
|
719 #---------------
|
|
720 my( $self, $desc ) = @_;
|
|
721
|
|
722 # New strategy: Assume only that the ID is the first white space
|
|
723 # delimited chunk. Not attempting to extract accession & database name.
|
|
724 # Clients will have to interpret it as necessary.
|
|
725 if($desc =~ /^>?(\S+)\s*(.*)/) {
|
|
726 my ($name, $desc) = ($1, $2);
|
|
727 $self->name($name);
|
|
728 $self->{'_description'} = $desc;
|
|
729 # Note that this description comes from the summary section of the
|
|
730 # BLAST report and so may be truncated. The full description will be
|
|
731 # set from the alignment section. We're setting description here in case
|
|
732 # the alignment section isn't being parsed.
|
|
733
|
|
734 # Assuming accession is delimited with | symbols (NCBI-style)
|
|
735 my @pieces = split(/\|/,$name);
|
|
736 my $acc = pop @pieces;
|
|
737 $self->accession( $acc );
|
|
738 }
|
|
739 else {
|
|
740 $self->warn("Can't locate sequence identifier in summary line.", "Line = $desc");
|
|
741 $desc = 'Unknown sequence ID' if not $desc;
|
|
742 $self->name($desc);
|
|
743 }
|
|
744 }
|
|
745
|
|
746
|
|
747 =head2 ambiguous_aln
|
|
748
|
|
749 Usage : $ambig_code = $hit_object->ambiguous_aln();
|
|
750 Purpose : Sets/Gets ambiguity code data member.
|
|
751 Example : (see usage)
|
|
752 Returns : String = 'q', 's', 'qs', '-'
|
|
753 : 'q' = query sequence contains overlapping sub-sequences
|
|
754 : while sbjct does not.
|
|
755 : 's' = sbjct sequence contains overlapping sub-sequences
|
|
756 : while query does not.
|
|
757 : 'qs' = query and sbjct sequence contains overlapping sub-sequences
|
|
758 : relative to each other.
|
|
759 : '-' = query and sbjct sequence do not contains multiple domains
|
|
760 : relative to each other OR both contain the same distribution
|
|
761 : of similar domains.
|
|
762 Argument : n/a
|
|
763 Throws : n/a
|
|
764 Status : Experimental
|
|
765
|
|
766 See Also : L<Bio::Search::BlastUtils::tile_hsps>, L<HSP Tiling and Ambiguous Alignments>
|
|
767
|
|
768 =cut
|
|
769
|
|
770 #--------------------
|
|
771 sub ambiguous_aln {
|
|
772 #--------------------
|
|
773 my $self = shift;
|
|
774 if(@_) { $self->{'_ambiguous_aln'} = shift; }
|
|
775 $self->{'_ambiguous_aln'} || '-';
|
|
776 }
|
|
777
|
|
778
|
|
779
|
|
780 =head2 overlap
|
|
781
|
|
782 Usage : $blast_object->overlap( [integer] );
|
|
783 Purpose : Gets/Sets the allowable amount overlap between different HSP sequences.
|
|
784 Example : $blast_object->overlap(5);
|
|
785 : $overlap = $blast_object->overlap;
|
|
786 Returns : Integer.
|
|
787 Argument : integer.
|
|
788 Throws : n/a
|
|
789 Status : Experimental
|
|
790 Comments : Any two HSPs whose sequences overlap by less than or equal
|
|
791 : to the overlap() number of resides will be considered separate HSPs
|
|
792 : and will not get tiled by Bio::Search::BlastUtils::_adjust_contigs().
|
|
793
|
|
794 See Also : L<Bio::Search::BlastUtils::_adjust_contigs()|Bio::Search::BlastUtils>, L<BUGS | BUGS>
|
|
795
|
|
796 =cut
|
|
797
|
|
798 #-------------
|
|
799 sub overlap {
|
|
800 #-------------
|
|
801 my $self = shift;
|
|
802 if(@_) { $self->{'_overlap'} = shift; }
|
|
803 defined $self->{'_overlap'} ? $self->{'_overlap'} : 0;
|
|
804 }
|
|
805
|
|
806
|
|
807
|
|
808
|
|
809
|
|
810
|
|
811 =head2 bits
|
|
812
|
|
813 Usage : $hit_object->bits();
|
|
814 Purpose : Gets the BLAST bit score of the best HSP for the current Blast hit.
|
|
815 Example : $bits = $hit_object->bits();
|
|
816 Returns : Integer
|
|
817 Argument : n/a
|
|
818 Throws : Exception if bit score is not set.
|
|
819 Comments : For BLAST1, the non-bit score is listed in the summary line.
|
|
820
|
|
821 See Also : L<score()|score>
|
|
822
|
|
823 =cut
|
|
824
|
|
825 #---------
|
|
826 sub bits {
|
|
827 #---------
|
|
828 my $self = shift;
|
|
829
|
|
830 # The check for $self->{'_bits'} is a remnant from the 'query' mode days
|
|
831 # in which the sbjct object would collect data from the description line only.
|
|
832
|
|
833 my ($bits);
|
|
834 if(not defined($self->{'_bits'})) {
|
|
835 $bits = $self->hsp->bits;
|
|
836 } else {
|
|
837 $bits = $self->{'_bits'};
|
|
838 }
|
|
839 return $bits;
|
|
840 }
|
|
841
|
|
842
|
|
843
|
|
844 =head2 n
|
|
845
|
|
846 Usage : $hit_object->n();
|
|
847 Purpose : Gets the N number for the current Blast hit.
|
|
848 : This is the number of HSPs in the set which was ascribed
|
|
849 : the lowest P-value (listed on the description line).
|
|
850 : This number is not the same as the total number of HSPs.
|
|
851 : To get the total number of HSPs, use num_hsps().
|
|
852 Example : $n = $hit_object->n();
|
|
853 Returns : Integer
|
|
854 Argument : n/a
|
|
855 Throws : Exception if HSPs have not been set (BLAST2 reports).
|
|
856 Comments : Note that the N parameter is not reported in gapped BLAST2.
|
|
857 : Calling n() on such reports will result in a call to num_hsps().
|
|
858 : The num_hsps() method will count the actual number of
|
|
859 : HSPs in the alignment listing, which may exceed N in
|
|
860 : some cases.
|
|
861
|
|
862 See Also : L<num_hsps()|num_hsps>
|
|
863
|
|
864 =cut
|
|
865
|
|
866 #-----
|
|
867 sub n {
|
|
868 #-----
|
|
869 my $self = shift;
|
|
870
|
|
871 # The check for $self->{'_n'} is a remnant from the 'query' mode days
|
|
872 # in which the sbjct object would collect data from the description line only.
|
|
873
|
|
874 my ($n);
|
|
875 if(not defined($self->{'_n'})) {
|
|
876 $n = $self->hsp->n;
|
|
877 } else {
|
|
878 $n = $self->{'_n'};
|
|
879 }
|
|
880 $n ||= $self->num_hsps;
|
|
881
|
|
882 return $n;
|
|
883 }
|
|
884
|
|
885
|
|
886
|
|
887 =head2 frame
|
|
888
|
|
889 Usage : $hit_object->frame();
|
|
890 Purpose : Gets the reading frame for the best HSP after HSP tiling.
|
|
891 : This is only valid for BLASTX and TBLASTN/X reports.
|
|
892 Example : $frame = $hit_object->frame();
|
|
893 Returns : Integer (-2 .. +2)
|
|
894 Argument : n/a
|
|
895 Throws : Exception if HSPs have not been set (BLAST2 reports).
|
|
896 Comments : This method requires that all HSPs be tiled. If they have not
|
|
897 : already been tiled, they will be tiled first automatically..
|
|
898 : If you don't want the tiled data, iterate through each HSP
|
|
899 : calling frame() on each (use hsps() to get all HSPs).
|
|
900
|
|
901 See Also : L<hsps()|hsps>
|
|
902
|
|
903 =cut
|
|
904
|
|
905 #----------'
|
|
906 sub frame {
|
|
907 #----------
|
|
908 my $self = shift;
|
|
909
|
|
910 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
911
|
|
912 # The check for $self->{'_frame'} is a remnant from the 'query' mode days
|
|
913 # in which the sbjct object would collect data from the description line only.
|
|
914
|
|
915 my ($frame);
|
|
916 if(not defined($self->{'_frame'})) {
|
|
917 $frame = $self->hsp->frame;
|
|
918 } else {
|
|
919 $frame = $self->{'_frame'};
|
|
920 }
|
|
921 return $frame;
|
|
922 }
|
|
923
|
|
924
|
|
925
|
|
926
|
|
927
|
|
928 =head2 p
|
|
929
|
|
930 Usage : $hit_object->p( [format] );
|
|
931 Purpose : Get the P-value for the best HSP of the given BLAST hit.
|
|
932 : (Note that P-values are not provided with NCBI Blast2 reports).
|
|
933 Example : $p = $sbjct->p;
|
|
934 : $p = $sbjct->p('exp'); # get exponent only.
|
|
935 : ($num, $exp) = $sbjct->p('parts'); # split sci notation into parts
|
|
936 Returns : Float or scientific notation number (the raw P-value, DEFAULT).
|
|
937 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
|
|
938 : 2-element list (float, int) if format == 'parts' and P-value
|
|
939 : is in scientific notation (See Comments).
|
|
940 Argument : format: string of 'raw' | 'exp' | 'parts'
|
|
941 : 'raw' returns value given in report. Default. (1.2e-34)
|
|
942 : 'exp' returns exponent value only (34)
|
|
943 : 'parts' returns the decimal and exponent as a
|
|
944 : 2-element list (1.2, -34) (See Comments).
|
|
945 Throws : Warns if no P-value is defined. Uses expect instead.
|
|
946 Comments : Using the 'parts' argument is not recommended since it will not
|
|
947 : work as expected if the P-value is not in scientific notation.
|
|
948 : That is, floats are not converted into sci notation before
|
|
949 : splitting into parts.
|
|
950
|
|
951 See Also : L<expect()|expect>, L<signif()|signif>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
|
|
952
|
|
953 =cut
|
|
954
|
|
955 #--------
|
|
956 sub p {
|
|
957 #--------
|
|
958 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
|
|
959 my ($self, $fmt) = @_;
|
|
960
|
|
961 my $val = $self->{'_p'};
|
|
962
|
|
963 # $val can be zero.
|
|
964 if(not defined $val) {
|
|
965 # P-value not defined, must be a NCBI Blast2 report.
|
|
966 # Use expect instead.
|
|
967 $self->warn( "P-value not defined. Using expect() instead.");
|
|
968 $val = $self->{'_expect'};
|
|
969 }
|
|
970
|
|
971 return $val if not $fmt or $fmt =~ /^raw/i;
|
|
972 ## Special formats: exponent-only or as list.
|
|
973 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
|
|
974 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
|
|
975
|
|
976 ## Default: return the raw P-value.
|
|
977 return $val;
|
|
978 }
|
|
979
|
|
980
|
|
981
|
|
982 =head2 expect
|
|
983
|
|
984 Usage : $hit_object->expect( [format] );
|
|
985 Purpose : Get the Expect value for the best HSP of the given BLAST hit.
|
|
986 Example : $e = $sbjct->expect;
|
|
987 : $e = $sbjct->expect('exp'); # get exponent only.
|
|
988 : ($num, $exp) = $sbjct->expect('parts'); # split sci notation into parts
|
|
989 Returns : Float or scientific notation number (the raw expect value, DEFAULT).
|
|
990 : Integer if format == 'exp' (the magnitude of the base 10 exponent).
|
|
991 : 2-element list (float, int) if format == 'parts' and Expect
|
|
992 : is in scientific notation (see Comments).
|
|
993 Argument : format: string of 'raw' | 'exp' | 'parts'
|
|
994 : 'raw' returns value given in report. Default. (1.2e-34)
|
|
995 : 'exp' returns exponent value only (34)
|
|
996 : 'parts' returns the decimal and exponent as a
|
|
997 : 2-element list (1.2, -34) (see Comments).
|
|
998 Throws : Exception if the Expect value is not defined.
|
|
999 Comments : Using the 'parts' argument is not recommended since it will not
|
|
1000 : work as expected if the expect value is not in scientific notation.
|
|
1001 : That is, floats are not converted into sci notation before
|
|
1002 : splitting into parts.
|
|
1003
|
|
1004 See Also : L<p()|p>, L<signif()|signif>, L<Bio::Search::BlastUtils::get_exponent()|Bio::Search::BlastUtils>
|
|
1005
|
|
1006 =cut
|
|
1007
|
|
1008 #-----------
|
|
1009 sub expect {
|
|
1010 #-----------
|
|
1011 # Some duplication of logic for p(), expect() and signif() for the sake of performance.
|
|
1012 my ($self, $fmt) = @_;
|
|
1013
|
|
1014 my $val;
|
|
1015
|
|
1016 # For Blast reports that list the P value on the description line,
|
|
1017 # getting the expect value requires fully parsing the HSP data.
|
|
1018 # For NCBI blast, there's no problem.
|
|
1019 if(not defined($self->{'_expect'})) {
|
|
1020 if( defined $self->{'_hsps'}) {
|
|
1021 $self->{'_expect'} = $val = $self->hsp->expect;
|
|
1022 } else {
|
|
1023 # If _expect is not set and _hsps are not set,
|
|
1024 # then this must be a P-value-based report that was
|
|
1025 # run without setting the HSPs (shallow parsing).
|
|
1026 $self->throw("Can't get expect value. HSPs have not been set.");
|
|
1027 }
|
|
1028 } else {
|
|
1029 $val = $self->{'_expect'};
|
|
1030 }
|
|
1031
|
|
1032 # $val can be zero.
|
|
1033 defined($val) or $self->throw("Can't get Expect value.");
|
|
1034
|
|
1035 return $val if not $fmt or $fmt =~ /^raw/i;
|
|
1036 ## Special formats: exponent-only or as list.
|
|
1037 return &Bio::Search::BlastUtils::get_exponent($val) if $fmt =~ /^exp/i;
|
|
1038 return (split (/eE/, $val)) if $fmt =~ /^parts/i;
|
|
1039
|
|
1040 ## Default: return the raw Expect-value.
|
|
1041 return $val;
|
|
1042 }
|
|
1043
|
|
1044
|
|
1045 =head2 hsps
|
|
1046
|
|
1047 Usage : $hit_object->hsps();
|
|
1048 Purpose : Get a list containing all HSP objects.
|
|
1049 : Get the numbers of HSPs for the current hit.
|
|
1050 Example : @hsps = $hit_object->hsps();
|
|
1051 : $num = $hit_object->hsps(); # alternatively, use num_hsps()
|
|
1052 Returns : Array context : list of Bio::Search::HSP::BlastHSP.pm objects.
|
|
1053 : Scalar context: integer (number of HSPs).
|
|
1054 : (Equivalent to num_hsps()).
|
|
1055 Argument : n/a. Relies on wantarray
|
|
1056 Throws : Exception if the HSPs have not been collected.
|
|
1057
|
|
1058 See Also : L<hsp()|hsp>, L<num_hsps()|num_hsps>
|
|
1059
|
|
1060 =cut
|
|
1061
|
|
1062 #---------
|
|
1063 sub hsps {
|
|
1064 #---------
|
|
1065 my $self = shift;
|
|
1066
|
|
1067 if (not ref $self->{'_hsps'}) {
|
|
1068 $self->throw("Can't get HSPs: data not collected.");
|
|
1069 }
|
|
1070
|
|
1071 return wantarray
|
|
1072 # returning list containing all HSPs.
|
|
1073 ? @{$self->{'_hsps'}}
|
|
1074 # returning number of HSPs.
|
|
1075 : scalar(@{$self->{'_hsps'}});
|
|
1076 }
|
|
1077
|
|
1078
|
|
1079
|
|
1080 =head2 hsp
|
|
1081
|
|
1082 Usage : $hit_object->hsp( [string] );
|
|
1083 Purpose : Get a single BlastHSP.pm object for the present BlastHit.pm object.
|
|
1084 Example : $hspObj = $hit_object->hsp; # same as 'best'
|
|
1085 : $hspObj = $hit_object->hsp('best');
|
|
1086 : $hspObj = $hit_object->hsp('worst');
|
|
1087 Returns : Object reference for a Bio::Search::HSP::BlastHSP.pm object.
|
|
1088 Argument : String (or no argument).
|
|
1089 : No argument (default) = highest scoring HSP (same as 'best').
|
|
1090 : 'best' or 'first' = highest scoring HSP.
|
|
1091 : 'worst' or 'last' = lowest scoring HSP.
|
|
1092 Throws : Exception if the HSPs have not been collected.
|
|
1093 : Exception if an unrecognized argument is used.
|
|
1094
|
|
1095 See Also : L<hsps()|hsps>, L<num_hsps>()
|
|
1096
|
|
1097 =cut
|
|
1098
|
|
1099 #----------
|
|
1100 sub hsp {
|
|
1101 #----------
|
|
1102 my( $self, $option ) = @_;
|
|
1103 $option ||= 'best';
|
|
1104
|
|
1105 if (not ref $self->{'_hsps'}) {
|
|
1106 $self->throw("Can't get HSPs: data not collected.");
|
|
1107 }
|
|
1108
|
|
1109 my @hsps = @{$self->{'_hsps'}};
|
|
1110
|
|
1111 return $hsps[0] if $option =~ /best|first|1/i;
|
|
1112 return $hsps[$#hsps] if $option =~ /worst|last/i;
|
|
1113
|
|
1114 $self->throw("Can't get HSP for: $option\n" .
|
|
1115 "Valid arguments: 'best', 'worst'");
|
|
1116 }
|
|
1117
|
|
1118
|
|
1119
|
|
1120 =head2 num_hsps
|
|
1121
|
|
1122 Usage : $hit_object->num_hsps();
|
|
1123 Purpose : Get the number of HSPs for the present Blast hit.
|
|
1124 Example : $nhsps = $hit_object->num_hsps();
|
|
1125 Returns : Integer
|
|
1126 Argument : n/a
|
|
1127 Throws : Exception if the HSPs have not been collected.
|
|
1128
|
|
1129 See Also : L<hsps()|hsps>
|
|
1130
|
|
1131 =cut
|
|
1132
|
|
1133 #-------------
|
|
1134 sub num_hsps {
|
|
1135 #-------------
|
|
1136 my $self = shift;
|
|
1137
|
|
1138 if (not defined $self->{'_hsps'}) {
|
|
1139 $self->throw("Can't get HSPs: data not collected.");
|
|
1140 }
|
|
1141
|
|
1142 return scalar(@{$self->{'_hsps'}});
|
|
1143 }
|
|
1144
|
|
1145
|
|
1146
|
|
1147 =head2 logical_length
|
|
1148
|
|
1149 Usage : $hit_object->logical_length( [seq_type] );
|
|
1150 : (mostly intended for internal use).
|
|
1151 Purpose : Get the logical length of the hit sequence.
|
|
1152 : For query sequence of BLASTX and TBLASTX reports and the hit
|
|
1153 : sequence of TBLASTN and TBLASTX reports, the returned length
|
|
1154 : is the length of the would-be amino acid sequence (length/3).
|
|
1155 : For all other BLAST flavors, this function is the same as length().
|
|
1156 Example : $len = $hit_object->logical_length();
|
|
1157 Returns : Integer
|
|
1158 Argument : seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
|
|
1159 ('sbjct' is synonymous with 'hit')
|
|
1160 Throws : n/a
|
|
1161 Comments : This is important for functions like frac_aligned_query()
|
|
1162 : which need to operate in amino acid coordinate space when dealing
|
|
1163 : with T?BLASTX type reports.
|
|
1164
|
|
1165 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>
|
|
1166
|
|
1167 =cut
|
|
1168
|
|
1169 #--------------------
|
|
1170 sub logical_length {
|
|
1171 #--------------------
|
|
1172 my $self = shift;
|
|
1173 my $seqType = shift || 'query';
|
|
1174 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1175
|
|
1176 my $length;
|
|
1177
|
|
1178 # For the sbjct, return logical sbjct length
|
|
1179 if( $seqType eq 'sbjct' ) {
|
|
1180 $length = $self->{'_logical_length'} || $self->{'_length'};
|
|
1181 }
|
|
1182 else {
|
|
1183 # Otherwise, return logical query length
|
|
1184 $length = $self->{'_query_length'};
|
|
1185
|
|
1186 # Adjust length based on BLAST flavor.
|
|
1187 if($self->{'_blast_program'} =~ /T?BLASTX/ ) {
|
|
1188 $length /= 3;
|
|
1189 }
|
|
1190 }
|
|
1191 return $length;
|
|
1192 }
|
|
1193
|
|
1194
|
|
1195 =head2 length_aln
|
|
1196
|
|
1197 Usage : $hit_object->length_aln( [seq_type] );
|
|
1198 Purpose : Get the total length of the aligned region for query or sbjct seq.
|
|
1199 : This number will include all HSPs
|
|
1200 Example : $len = $hit_object->length_aln(); # default = query
|
|
1201 : $lenAln = $hit_object->length_aln('query');
|
|
1202 Returns : Integer
|
|
1203 Argument : seq_Type = 'query' or 'hit' or 'sbjct' (Default = 'query')
|
|
1204 ('sbjct' is synonymous with 'hit')
|
|
1205 Throws : Exception if the argument is not recognized.
|
|
1206 Comments : This method will report the logical length of the alignment,
|
|
1207 : meaning that for TBLAST[NX] reports, the length is reported
|
|
1208 : using amino acid coordinate space (i.e., nucleotides / 3).
|
|
1209 :
|
|
1210 : This method requires that all HSPs be tiled. If they have not
|
|
1211 : already been tiled, they will be tiled first automatically..
|
|
1212 : If you don't want the tiled data, iterate through each HSP
|
|
1213 : calling length() on each (use hsps() to get all HSPs).
|
|
1214
|
|
1215 See Also : L<length()|length>, L<frac_aligned_query()|frac_aligned_query>, L<frac_aligned_hit()|frac_aligned_hit>, L<gaps()|gaps>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<Bio::Search::HSP::BlastHSP::length()|Bio::Search::HSP::BlastHSP>
|
|
1216
|
|
1217 =cut
|
|
1218
|
|
1219 #---------------'
|
|
1220 sub length_aln {
|
|
1221 #---------------
|
|
1222 my( $self, $seqType ) = @_;
|
|
1223
|
|
1224 $seqType ||= 'query';
|
|
1225 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1226
|
|
1227 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1228
|
|
1229 my $data = $self->{'_length_aln_'.$seqType};
|
|
1230
|
|
1231 ## If we don't have data, figure out what went wrong.
|
|
1232 if(!$data) {
|
|
1233 $self->throw("Can't get length aln for sequence type \"$seqType\"" .
|
|
1234 "Valid types are 'query', 'hit', 'sbjct' ('sbjct' = 'hit')");
|
|
1235 }
|
|
1236 $data;
|
|
1237 }
|
|
1238
|
|
1239
|
|
1240 =head2 gaps
|
|
1241
|
|
1242 Usage : $hit_object->gaps( [seq_type] );
|
|
1243 Purpose : Get the number of gaps in the aligned query, sbjct, or both sequences.
|
|
1244 : Data is summed across all HSPs.
|
|
1245 Example : $qgaps = $hit_object->gaps('query');
|
|
1246 : $hgaps = $hit_object->gaps('hit');
|
|
1247 : $tgaps = $hit_object->gaps(); # default = total (query + hit)
|
|
1248 Returns : scalar context: integer
|
|
1249 : array context without args: two-element list of integers
|
|
1250 : (queryGaps, sbjctGaps)
|
|
1251 : Array context can be forced by providing an argument of 'list' or 'array'.
|
|
1252 :
|
|
1253 : CAUTION: Calling this method within printf or sprintf is arrray context.
|
|
1254 : So this function may not give you what you expect. For example:
|
|
1255 : printf "Total gaps: %d", $hit->gaps();
|
|
1256 : Actually returns a two-element array, so what gets printed
|
|
1257 : is the number of gaps in the query, not the total
|
|
1258 :
|
|
1259 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total' | 'list' (default = 'total')
|
|
1260 ('sbjct' is synonymous with 'hit')
|
|
1261 Throws : n/a
|
|
1262 Comments : If you need data for each HSP, use hsps() and then interate
|
|
1263 : through each HSP object.
|
|
1264 : This method requires that all HSPs be tiled. If they have not
|
|
1265 : already been tiled, they will be tiled first automatically..
|
|
1266 : Not relying on wantarray since that will fail in situations
|
|
1267 : such as printf "%d", $hit->gaps() in which you might expect to
|
|
1268 : be printing the total gaps, but evaluates to array context.
|
|
1269
|
|
1270 See Also : L<length_aln()|length_aln>
|
|
1271
|
|
1272 =cut
|
|
1273
|
|
1274 #----------
|
|
1275 sub gaps {
|
|
1276 #----------
|
|
1277 my( $self, $seqType ) = @_;
|
|
1278
|
|
1279 $seqType ||= (wantarray ? 'list' : 'total');
|
|
1280 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1281
|
|
1282 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1283
|
|
1284 $seqType = lc($seqType);
|
|
1285
|
|
1286 if($seqType =~ /list|array/i) {
|
|
1287 return ($self->{'_gaps_query'}, $self->{'_gaps_sbjct'});
|
|
1288 }
|
|
1289
|
|
1290 if($seqType eq 'total') {
|
|
1291 return ($self->{'_gaps_query'} + $self->{'_gaps_sbjct'}) || 0;
|
|
1292 } else {
|
|
1293 return $self->{'_gaps_'.$seqType} || 0;
|
|
1294 }
|
|
1295 }
|
|
1296
|
|
1297
|
|
1298
|
|
1299 =head2 matches
|
|
1300
|
|
1301 Usage : $hit_object->matches( [class] );
|
|
1302 Purpose : Get the total number of identical or conserved matches
|
|
1303 : (or both) across all HSPs.
|
|
1304 : (Note: 'conservative' matches are indicated as 'positives'
|
|
1305 : in the Blast report.)
|
|
1306 Example : ($id,$cons) = $hit_object->matches(); # no argument
|
|
1307 : $id = $hit_object->matches('id');
|
|
1308 : $cons = $hit_object->matches('cons');
|
|
1309 Returns : Integer or a 2-element array of integers
|
|
1310 Argument : class = 'id' | 'cons' OR none.
|
|
1311 : If no argument is provided, both identical and conservative
|
|
1312 : numbers are returned in a two element list.
|
|
1313 : (Other terms can be used to refer to the conservative
|
|
1314 : matches, e.g., 'positive'. All that is checked is whether or
|
|
1315 : not the supplied string starts with 'id'. If not, the
|
|
1316 : conservative matches are returned.)
|
|
1317 Throws : Exception if the requested data cannot be obtained.
|
|
1318 Comments : If you need data for each HSP, use hsps() and then interate
|
|
1319 : through the HSP objects.
|
|
1320 : Does not rely on wantarray to return a list. Only checks for
|
|
1321 : the presence of an argument (no arg = return list).
|
|
1322
|
|
1323 See Also : L<Bio::Search::HSP::BlastHSP::matches()|Bio::Search::HSP::BlastHSP>, L<hsps()|hsps>
|
|
1324
|
|
1325 =cut
|
|
1326
|
|
1327 #---------------
|
|
1328 sub matches {
|
|
1329 #---------------
|
|
1330 my( $self, $arg) = @_;
|
|
1331 my(@data,$data);
|
|
1332
|
|
1333 if(!$arg) {
|
|
1334 @data = ($self->{'_totalIdentical'}, $self->{'_totalConserved'});
|
|
1335
|
|
1336 return @data if @data;
|
|
1337
|
|
1338 } else {
|
|
1339
|
|
1340 if($arg =~ /^id/i) {
|
|
1341 $data = $self->{'_totalIdentical'};
|
|
1342 } else {
|
|
1343 $data = $self->{'_totalConserved'};
|
|
1344 }
|
|
1345 return $data if $data;
|
|
1346 }
|
|
1347
|
|
1348 ## Something went wrong if we make it to here.
|
|
1349 $self->throw("Can't get identical or conserved data: no data.");
|
|
1350 }
|
|
1351
|
|
1352
|
|
1353 =head2 start
|
|
1354
|
|
1355 Usage : $sbjct->start( [seq_type] );
|
|
1356 Purpose : Gets the start coordinate for the query, sbjct, or both sequences
|
|
1357 : in the BlastHit object. If there is more than one HSP, the lowest start
|
|
1358 : value of all HSPs is returned.
|
|
1359 Example : $qbeg = $sbjct->start('query');
|
|
1360 : $sbeg = $sbjct->start('hit');
|
|
1361 : ($qbeg, $sbeg) = $sbjct->start();
|
|
1362 Returns : scalar context: integer
|
|
1363 : array context without args: list of two integers (queryStart, sbjctStart)
|
|
1364 : Array context can be "induced" by providing an argument of 'list' or 'array'.
|
|
1365 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
|
|
1366 ('sbjct' is synonymous with 'hit')
|
|
1367 Throws : n/a
|
|
1368 Comments : This method requires that all HSPs be tiled. If there is more than one
|
|
1369 : HSP and they have not already been tiled, they will be tiled first automatically..
|
|
1370 : Remember that the start and end coordinates of all HSPs are
|
|
1371 : normalized so that start < end. Strand information can be
|
|
1372 : obtained by calling $hit->strand().
|
|
1373
|
|
1374 See Also : L<end()|end>, L<range()|range>, L<strand()|strand>, L<HSP Tiling and Ambiguous Alignments>, L<Bio::Search::HSP::BlastHSP::start|Bio::Search::HSP::BlastHSP>
|
|
1375
|
|
1376 =cut
|
|
1377
|
|
1378 #----------
|
|
1379 sub start {
|
|
1380 #----------
|
|
1381 my ($self, $seqType) = @_;
|
|
1382
|
|
1383 $seqType ||= (wantarray ? 'list' : 'query');
|
|
1384 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1385
|
|
1386 # If there is only one HSP, defer this call to the solitary HSP.
|
|
1387 if($self->num_hsps == 1) {
|
|
1388 return $self->hsp->start($seqType);
|
|
1389 } else {
|
|
1390 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1391 if($seqType =~ /list|array/i) {
|
|
1392 return ($self->{'_queryStart'}, $self->{'_sbjctStart'});
|
|
1393 } else {
|
|
1394 ## Sensitive to member name changes.
|
|
1395 $seqType = "_\L$seqType\E";
|
|
1396 return $self->{$seqType.'Start'};
|
|
1397 }
|
|
1398 }
|
|
1399 }
|
|
1400
|
|
1401
|
|
1402 =head2 end
|
|
1403
|
|
1404 Usage : $sbjct->end( [seq_type] );
|
|
1405 Purpose : Gets the end coordinate for the query, sbjct, or both sequences
|
|
1406 : in the BlastHit object. If there is more than one HSP, the largest end
|
|
1407 : value of all HSPs is returned.
|
|
1408 Example : $qend = $sbjct->end('query');
|
|
1409 : $send = $sbjct->end('hit');
|
|
1410 : ($qend, $send) = $sbjct->end();
|
|
1411 Returns : scalar context: integer
|
|
1412 : array context without args: list of two integers (queryEnd, sbjctEnd)
|
|
1413 : Array context can be "induced" by providing an argument of 'list' or 'array'.
|
|
1414 Argument : In scalar context: seq_type = 'query' or 'sbjct'
|
|
1415 : (case insensitive). If not supplied, 'query' is used.
|
|
1416 Throws : n/a
|
|
1417 Comments : This method requires that all HSPs be tiled. If there is more than one
|
|
1418 : HSP and they have not already been tiled, they will be tiled first automatically..
|
|
1419 : Remember that the start and end coordinates of all HSPs are
|
|
1420 : normalized so that start < end. Strand information can be
|
|
1421 : obtained by calling $hit->strand().
|
|
1422
|
|
1423 See Also : L<start()|start>, L<range()|range>, L<strand()|strand>, L<HSP Tiling and Ambiguous Alignments>, L<Bio::Search::HSP::BlastHSP::end|Bio::Search::HSP::BlastHSP>
|
|
1424
|
|
1425 =cut
|
|
1426
|
|
1427 #----------
|
|
1428 sub end {
|
|
1429 #----------
|
|
1430 my ($self, $seqType) = @_;
|
|
1431
|
|
1432 $seqType ||= (wantarray ? 'list' : 'query');
|
|
1433 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1434
|
|
1435 # If there is only one HSP, defer this call to the solitary HSP.
|
|
1436 if($self->num_hsps == 1) {
|
|
1437 return $self->hsp->end($seqType);
|
|
1438 } else {
|
|
1439 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1440 if($seqType =~ /list|array/i) {
|
|
1441 return ($self->{'_queryStop'}, $self->{'_sbjctStop'});
|
|
1442 } else {
|
|
1443 ## Sensitive to member name changes.
|
|
1444 $seqType = "_\L$seqType\E";
|
|
1445 return $self->{$seqType.'Stop'};
|
|
1446 }
|
|
1447 }
|
|
1448 }
|
|
1449
|
|
1450 =head2 range
|
|
1451
|
|
1452 Usage : $sbjct->range( [seq_type] );
|
|
1453 Purpose : Gets the (start, end) coordinates for the query or sbjct sequence
|
|
1454 : in the HSP alignment.
|
|
1455 Example : ($qbeg, $qend) = $sbjct->range('query');
|
|
1456 : ($sbeg, $send) = $sbjct->range('hit');
|
|
1457 Returns : Two-element array of integers
|
|
1458 Argument : seq_type = string, 'query' or 'hit' or 'sbjct' (default = 'query')
|
|
1459 ('sbjct' is synonymous with 'hit')
|
|
1460 Throws : n/a
|
|
1461
|
|
1462 See Also : L<start()|start>, L<end()|end>
|
|
1463
|
|
1464 =cut
|
|
1465
|
|
1466 #----------
|
|
1467 sub range {
|
|
1468 #----------
|
|
1469 my ($self, $seqType) = @_;
|
|
1470 $seqType ||= 'query';
|
|
1471 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1472 return ($self->start($seqType), $self->end($seqType));
|
|
1473 }
|
|
1474
|
|
1475
|
|
1476 =head2 frac_identical
|
|
1477
|
|
1478 Usage : $hit_object->frac_identical( [seq_type] );
|
|
1479 Purpose : Get the overall fraction of identical positions across all HSPs.
|
|
1480 : The number refers to only the aligned regions and does not
|
|
1481 : account for unaligned regions in between the HSPs, if any.
|
|
1482 Example : $frac_iden = $hit_object->frac_identical('query');
|
|
1483 Returns : Float (2-decimal precision, e.g., 0.75).
|
|
1484 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total'
|
|
1485 : default = 'query' (but see comments below).
|
|
1486 : ('sbjct' is synonymous with 'hit')
|
|
1487 Throws : n/a
|
|
1488 Comments : Different versions of Blast report different values for the total
|
|
1489 : length of the alignment. This is the number reported in the
|
|
1490 : denominators in the stats section:
|
|
1491 : "Identical = 34/120 Positives = 67/120".
|
|
1492 : NCBI BLAST uses the total length of the alignment (with gaps)
|
|
1493 : WU-BLAST uses the length of the query sequence (without gaps).
|
|
1494 :
|
|
1495 : Therefore, when called with an argument of 'total',
|
|
1496 : this method will report different values depending on the
|
|
1497 : version of BLAST used. Total does NOT take into account HSP
|
|
1498 : tiling, so it should not be used.
|
|
1499 :
|
|
1500 : To get the fraction identical among only the aligned residues,
|
|
1501 : ignoring the gaps, call this method without an argument or
|
|
1502 : with an argument of 'query' or 'hit'.
|
|
1503 :
|
|
1504 : If you need data for each HSP, use hsps() and then iterate
|
|
1505 : through the HSP objects.
|
|
1506 : This method requires that all HSPs be tiled. If they have not
|
|
1507 : already been tiled, they will be tiled first automatically.
|
|
1508
|
|
1509 See Also : L<frac_conserved()|frac_conserved>, L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
|
|
1510
|
|
1511 =cut
|
|
1512
|
|
1513 #------------------
|
|
1514 sub frac_identical {
|
|
1515 #------------------
|
|
1516 my ($self, $seqType) = @_;
|
|
1517 $seqType ||= 'query';
|
|
1518 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1519
|
|
1520 ## Sensitive to member name format.
|
|
1521 $seqType = lc($seqType);
|
|
1522
|
|
1523 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1524
|
|
1525 sprintf( "%.2f", $self->{'_totalIdentical'}/$self->{'_length_aln_'.$seqType});
|
|
1526 }
|
|
1527
|
|
1528
|
|
1529
|
|
1530 =head2 frac_conserved
|
|
1531
|
|
1532 Usage : $hit_object->frac_conserved( [seq_type] );
|
|
1533 Purpose : Get the overall fraction of conserved positions across all HSPs.
|
|
1534 : The number refers to only the aligned regions and does not
|
|
1535 : account for unaligned regions in between the HSPs, if any.
|
|
1536 Example : $frac_cons = $hit_object->frac_conserved('hit');
|
|
1537 Returns : Float (2-decimal precision, e.g., 0.75).
|
|
1538 Argument : seq_type: 'query' | 'hit' or 'sbjct' | 'total'
|
|
1539 : default = 'query' (but see comments below).
|
|
1540 : ('sbjct' is synonymous with 'hit')
|
|
1541 Throws : n/a
|
|
1542 Comments : Different versions of Blast report different values for the total
|
|
1543 : length of the alignment. This is the number reported in the
|
|
1544 : denominators in the stats section:
|
|
1545 : "Positives = 34/120 Positives = 67/120".
|
|
1546 : NCBI BLAST uses the total length of the alignment (with gaps)
|
|
1547 : WU-BLAST uses the length of the query sequence (without gaps).
|
|
1548 :
|
|
1549 : Therefore, when called with an argument of 'total',
|
|
1550 : this method will report different values depending on the
|
|
1551 : version of BLAST used. Total does NOT take into account HSP
|
|
1552 : tiling, so it should not be used.
|
|
1553 :
|
|
1554 : To get the fraction conserved among only the aligned residues,
|
|
1555 : ignoring the gaps, call this method without an argument or
|
|
1556 : with an argument of 'query' or 'hit'.
|
|
1557 :
|
|
1558 : If you need data for each HSP, use hsps() and then interate
|
|
1559 : through the HSP objects.
|
|
1560 : This method requires that all HSPs be tiled. If they have not
|
|
1561 : already been tiled, they will be tiled first automatically.
|
|
1562
|
|
1563 See Also : L<frac_identical()|frac_identical>, L<matches()|matches>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
|
|
1564
|
|
1565 =cut
|
|
1566
|
|
1567 #--------------------
|
|
1568 sub frac_conserved {
|
|
1569 #--------------------
|
|
1570 my ($self, $seqType) = @_;
|
|
1571 $seqType ||= 'query';
|
|
1572 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1573
|
|
1574 ## Sensitive to member name format.
|
|
1575 $seqType = lc($seqType);
|
|
1576
|
|
1577 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1578
|
|
1579 sprintf( "%.2f", $self->{'_totalConserved'}/$self->{'_length_aln_'.$seqType});
|
|
1580 }
|
|
1581
|
|
1582
|
|
1583
|
|
1584
|
|
1585 =head2 frac_aligned_query
|
|
1586
|
|
1587 Usage : $hit_object->frac_aligned_query();
|
|
1588 Purpose : Get the fraction of the query sequence which has been aligned
|
|
1589 : across all HSPs (not including intervals between non-overlapping
|
|
1590 : HSPs).
|
|
1591 Example : $frac_alnq = $hit_object->frac_aligned_query();
|
|
1592 Returns : Float (2-decimal precision, e.g., 0.75).
|
|
1593 Argument : n/a
|
|
1594 Throws : n/a
|
|
1595 Comments : If you need data for each HSP, use hsps() and then interate
|
|
1596 : through the HSP objects.
|
|
1597 : To compute the fraction aligned, the logical length of the query
|
|
1598 : sequence is used, meaning that for [T]BLASTX reports, the
|
|
1599 : full length of the query sequence is converted into amino acids
|
|
1600 : by dividing by 3. This is necessary because of the way
|
|
1601 : the lengths of aligned sequences are computed.
|
|
1602 : This method requires that all HSPs be tiled. If they have not
|
|
1603 : already been tiled, they will be tiled first automatically.
|
|
1604
|
|
1605 See Also : L<frac_aligned_hit()|frac_aligned_hit>, L<logical_length()|logical_length>, L<length_aln()|length_aln>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
|
|
1606
|
|
1607 =cut
|
|
1608
|
|
1609 #----------------------
|
|
1610 sub frac_aligned_query {
|
|
1611 #----------------------
|
|
1612 my $self = shift;
|
|
1613
|
|
1614 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1615
|
|
1616 sprintf( "%.2f", $self->{'_length_aln_query'}/$self->logical_length('query'));
|
|
1617 }
|
|
1618
|
|
1619
|
|
1620
|
|
1621 =head2 frac_aligned_hit
|
|
1622
|
|
1623 Usage : $hit_object->frac_aligned_hit();
|
|
1624 Purpose : Get the fraction of the hit (sbjct) sequence which has been aligned
|
|
1625 : across all HSPs (not including intervals between non-overlapping
|
|
1626 : HSPs).
|
|
1627 Example : $frac_alnq = $hit_object->frac_aligned_hit();
|
|
1628 Returns : Float (2-decimal precision, e.g., 0.75).
|
|
1629 Argument : n/a
|
|
1630 Throws : n/a
|
|
1631 Comments : If you need data for each HSP, use hsps() and then interate
|
|
1632 : through the HSP objects.
|
|
1633 : To compute the fraction aligned, the logical length of the sbjct
|
|
1634 : sequence is used, meaning that for TBLAST[NX] reports, the
|
|
1635 : full length of the sbjct sequence is converted into amino acids
|
|
1636 : by dividing by 3. This is necessary because of the way
|
|
1637 : the lengths of aligned sequences are computed.
|
|
1638 : This method requires that all HSPs be tiled. If they have not
|
|
1639 : already been tiled, they will be tiled first automatically.
|
|
1640
|
|
1641 See Also : L<frac_aligned_query()|frac_aligned_query>, L<matches()|matches>, , L<logical_length()|logical_length>, L<length_aln()|length_aln>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
|
|
1642
|
|
1643 =cut
|
|
1644
|
|
1645 #--------------------
|
|
1646 sub frac_aligned_hit {
|
|
1647 #--------------------
|
|
1648 my $self = shift;
|
|
1649
|
|
1650 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1651
|
|
1652 sprintf( "%.2f", $self->{'_length_aln_sbjct'}/$self->logical_length('sbjct'));
|
|
1653 }
|
|
1654
|
|
1655
|
|
1656 ## These methods are being maintained for backward compatibility.
|
|
1657
|
|
1658 =head2 frac_aligned_sbjct
|
|
1659
|
|
1660 Same as L<frac_aligned_hit()|frac_aligned_hit>
|
|
1661
|
|
1662 =cut
|
|
1663
|
|
1664 #----------------
|
|
1665 sub frac_aligned_sbjct { my $self=shift; $self->frac_aligned_hit(@_); }
|
|
1666 #----------------
|
|
1667
|
|
1668 =head2 num_unaligned_sbjct
|
|
1669
|
|
1670 Same as L<num_unaligned_hit()|num_unaligned_hit>
|
|
1671
|
|
1672 =cut
|
|
1673
|
|
1674 #----------------
|
|
1675 sub num_unaligned_sbjct { my $self=shift; $self->num_unaligned_hit(@_); }
|
|
1676 #----------------
|
|
1677
|
|
1678
|
|
1679
|
|
1680 =head2 num_unaligned_hit
|
|
1681
|
|
1682 Usage : $hit_object->num_unaligned_hit();
|
|
1683 Purpose : Get the number of the unaligned residues in the hit sequence.
|
|
1684 : Sums across all all HSPs.
|
|
1685 Example : $num_unaln = $hit_object->num_unaligned_hit();
|
|
1686 Returns : Integer
|
|
1687 Argument : n/a
|
|
1688 Throws : n/a
|
|
1689 Comments : See notes regarding logical lengths in the comments for frac_aligned_hit().
|
|
1690 : They apply here as well.
|
|
1691 : If you need data for each HSP, use hsps() and then interate
|
|
1692 : through the HSP objects.
|
|
1693 : This method requires that all HSPs be tiled. If they have not
|
|
1694 : already been tiled, they will be tiled first automatically..
|
|
1695
|
|
1696 See Also : L<num_unaligned_query()|num_unaligned_query>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>, L<frac_aligned_hit()|frac_aligned_hit>
|
|
1697
|
|
1698 =cut
|
|
1699
|
|
1700 #---------------------
|
|
1701 sub num_unaligned_hit {
|
|
1702 #---------------------
|
|
1703 my $self = shift;
|
|
1704
|
|
1705 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1706
|
|
1707 my $num = $self->logical_length('sbjct') - $self->{'_length_aln_sbjct'};
|
|
1708 ($num < 0 ? 0 : $num );
|
|
1709 }
|
|
1710
|
|
1711
|
|
1712 =head2 num_unaligned_query
|
|
1713
|
|
1714 Usage : $hit_object->num_unaligned_query();
|
|
1715 Purpose : Get the number of the unaligned residues in the query sequence.
|
|
1716 : Sums across all all HSPs.
|
|
1717 Example : $num_unaln = $hit_object->num_unaligned_query();
|
|
1718 Returns : Integer
|
|
1719 Argument : n/a
|
|
1720 Throws : n/a
|
|
1721 Comments : See notes regarding logical lengths in the comments for frac_aligned_query().
|
|
1722 : They apply here as well.
|
|
1723 : If you need data for each HSP, use hsps() and then interate
|
|
1724 : through the HSP objects.
|
|
1725 : This method requires that all HSPs be tiled. If they have not
|
|
1726 : already been tiled, they will be tiled first automatically..
|
|
1727
|
|
1728 See Also : L<num_unaligned_hit()|num_unaligned_hit>, L<frac_aligned_query()|frac_aligned_query>, L<Bio::Search::BlastUtils::tile_hsps()|Bio::Search::BlastUtils>
|
|
1729
|
|
1730 =cut
|
|
1731
|
|
1732 #-----------------------
|
|
1733 sub num_unaligned_query {
|
|
1734 #-----------------------
|
|
1735 my $self = shift;
|
|
1736
|
|
1737 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1738
|
|
1739 my $num = $self->logical_length('query') - $self->{'_length_aln_query'};
|
|
1740 ($num < 0 ? 0 : $num );
|
|
1741 }
|
|
1742
|
|
1743
|
|
1744
|
|
1745 =head2 seq_inds
|
|
1746
|
|
1747 Usage : $hit->seq_inds( seq_type, class, collapse );
|
|
1748 Purpose : Get a list of residue positions (indices) across all HSPs
|
|
1749 : for identical or conserved residues in the query or sbjct sequence.
|
|
1750 Example : @s_ind = $hit->seq_inds('query', 'identical');
|
|
1751 : @h_ind = $hit->seq_inds('hit', 'conserved');
|
|
1752 : @h_ind = $hit->seq_inds('hit', 'conserved', 1);
|
|
1753 Returns : Array of integers
|
|
1754 : May include ranges if collapse is non-zero.
|
|
1755 Argument : [0] seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
|
|
1756 : ('sbjct' is synonymous with 'hit')
|
|
1757 : [1] class = 'identical' or 'conserved' (default = 'identical')
|
|
1758 : (can be shortened to 'id' or 'cons')
|
|
1759 : (actually, anything not 'id' will evaluate to 'conserved').
|
|
1760 : [2] collapse = boolean, if non-zero, consecutive positions are merged
|
|
1761 : using a range notation, e.g., "1 2 3 4 5 7 9 10 11"
|
|
1762 : collapses to "1-5 7 9-11". This is useful for
|
|
1763 : consolidating long lists. Default = no collapse.
|
|
1764 Throws : n/a.
|
|
1765 Comments : Note that HSPs are not tiled for this. This could be a problem
|
|
1766 : for hits containing mutually exclusive HSPs.
|
|
1767 : TODO: Consider tiling and then reporting seq_inds for the
|
|
1768 : best HSP contig.
|
|
1769
|
|
1770 See Also : L<Bio::Search::HSP::BlastHSP::seq_inds()|Bio::Search::HSP::BlastHSP>
|
|
1771
|
|
1772 =cut
|
|
1773
|
|
1774 #-------------
|
|
1775 sub seq_inds {
|
|
1776 #-------------
|
|
1777 my ($self, $seqType, $class, $collapse) = @_;
|
|
1778
|
|
1779 $seqType ||= 'query';
|
|
1780 $class ||= 'identical';
|
|
1781 $collapse ||= 0;
|
|
1782
|
|
1783 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1784
|
|
1785 my (@inds, $hsp);
|
|
1786 foreach $hsp ($self->hsps) {
|
|
1787 # This will merge data for all HSPs together.
|
|
1788 push @inds, $hsp->seq_inds($seqType, $class);
|
|
1789 }
|
|
1790
|
|
1791 # Need to remove duplicates and sort the merged positions.
|
|
1792 if(@inds) {
|
|
1793 my %tmp = map { $_, 1 } @inds;
|
|
1794 @inds = sort {$a <=> $b} keys %tmp;
|
|
1795 }
|
|
1796
|
|
1797 $collapse ? &Bio::Search::BlastUtils::collapse_nums(@inds) : @inds;
|
|
1798 }
|
|
1799
|
|
1800
|
|
1801 =head2 iteration
|
|
1802
|
|
1803 Usage : $sbjct->iteration( );
|
|
1804 Purpose : Gets the iteration number in which the Hit was found.
|
|
1805 Example : $iteration_num = $sbjct->iteration();
|
|
1806 Returns : Integer greater than or equal to 1
|
|
1807 Non-PSI-BLAST reports will report iteration as 1, but this number
|
|
1808 is only meaningful for PSI-BLAST reports.
|
|
1809 Argument : none
|
|
1810 Throws : none
|
|
1811
|
|
1812 See Also : L<found_again()|found_again>
|
|
1813
|
|
1814 =cut
|
|
1815
|
|
1816 #----------------
|
|
1817 sub iteration { shift->{'_iteration'} }
|
|
1818 #----------------
|
|
1819
|
|
1820
|
|
1821 =head2 found_again
|
|
1822
|
|
1823 Usage : $sbjct->found_again;
|
|
1824 Purpose : Gets a boolean indicator whether or not the hit has
|
|
1825 been found in a previous iteration.
|
|
1826 This is only applicable to PSI-BLAST reports.
|
|
1827
|
|
1828 This method indicates if the hit was reported in the
|
|
1829 "Sequences used in model and found again" section of the
|
|
1830 PSI-BLAST report or if it was reported in the
|
|
1831 "Sequences not found previously or not previously below threshold"
|
|
1832 section of the PSI-BLAST report. Only for hits in iteration > 1.
|
|
1833
|
|
1834 Example : if( $sbjct->found_again()) { ... };
|
|
1835 Returns : Boolean (1 or 0) for PSI-BLAST report iterations greater than 1.
|
|
1836 Returns undef for PSI-BLAST report iteration 1 and non PSI_BLAST
|
|
1837 reports.
|
|
1838 Argument : none
|
|
1839 Throws : none
|
|
1840
|
|
1841 See Also : L<found_again()|found_again>
|
|
1842
|
|
1843 =cut
|
|
1844
|
|
1845 #----------------
|
|
1846 sub found_again { shift->{'_found_again'} }
|
|
1847 #----------------
|
|
1848
|
|
1849
|
|
1850 =head2 strand
|
|
1851
|
|
1852 Usage : $sbjct->strand( [seq_type] );
|
|
1853 Purpose : Gets the strand(s) for the query, sbjct, or both sequences
|
|
1854 : in the best HSP of the BlastHit object after HSP tiling.
|
|
1855 : Only valid for BLASTN, TBLASTX, BLASTX-query, TBLASTN-hit.
|
|
1856 Example : $qstrand = $sbjct->strand('query');
|
|
1857 : $sstrand = $sbjct->strand('hit');
|
|
1858 : ($qstrand, $sstrand) = $sbjct->strand();
|
|
1859 Returns : scalar context: integer '1', '-1', or '0'
|
|
1860 : array context without args: list of two strings (queryStrand, sbjctStrand)
|
|
1861 : Array context can be "induced" by providing an argument of 'list' or 'array'.
|
|
1862 Argument : In scalar context: seq_type = 'query' or 'hit' or 'sbjct' (default = 'query')
|
|
1863 ('sbjct' is synonymous with 'hit')
|
|
1864 Throws : n/a
|
|
1865 Comments : This method requires that all HSPs be tiled. If they have not
|
|
1866 : already been tiled, they will be tiled first automatically..
|
|
1867 : If you don't want the tiled data, iterate through each HSP
|
|
1868 : calling strand() on each (use hsps() to get all HSPs).
|
|
1869 :
|
|
1870 : Formerly (prior to 10/21/02), this method would return the
|
|
1871 : string "-1/1" for hits with HSPs on both strands.
|
|
1872 : However, now that strand and frame is properly being accounted
|
|
1873 : for during HSP tiling, it makes more sense for strand()
|
|
1874 : to return the strand data for the best HSP after tiling.
|
|
1875 :
|
|
1876 : If you really want to know about hits on opposite strands,
|
|
1877 : you should be iterating through the HSPs using methods on the
|
|
1878 : HSP objects.
|
|
1879 :
|
|
1880 : A possible use case where knowing whether a hit has HSPs
|
|
1881 : on both strands would be when filtering via SearchIO for hits with
|
|
1882 : this property. However, in this case it would be better to have a
|
|
1883 : dedicated method such as $hit->hsps_on_both_strands(). Similarly
|
|
1884 : for frame. This could be provided if there is interest.
|
|
1885
|
|
1886 See Also : B<Bio::Search::HSP::BlastHSP::strand>()
|
|
1887
|
|
1888 =cut
|
|
1889
|
|
1890 #----------'
|
|
1891 sub strand {
|
|
1892 #----------
|
|
1893 my ($self, $seqType) = @_;
|
|
1894
|
|
1895 Bio::Search::BlastUtils::tile_hsps($self) if not $self->{'_tile_hsps'};
|
|
1896
|
|
1897 $seqType ||= (wantarray ? 'list' : 'query');
|
|
1898 $seqType = 'sbjct' if $seqType eq 'hit';
|
|
1899
|
|
1900 my ($qstr, $hstr);
|
|
1901 # If there is only one HSP, defer this call to the solitary HSP.
|
|
1902 if($self->num_hsps == 1) {
|
|
1903 return $self->hsp->strand($seqType);
|
|
1904 }
|
|
1905 elsif( defined $self->{'_qstrand'}) {
|
|
1906 # Get the data computed during hsp tiling.
|
|
1907 $qstr = $self->{'_qstrand'};
|
|
1908 $hstr = $self->{'_sstrand'};
|
|
1909 }
|
|
1910 else {
|
|
1911 # otherwise, iterate through all HSPs collecting strand info.
|
|
1912 # This will return the string "-1/1" if there are HSPs on different strands.
|
|
1913 # NOTE: This was the pre-10/21/02 procedure which will no longer be used,
|
|
1914 # (unless the above elsif{} is commented out).
|
|
1915 my (%qstr, %hstr);
|
|
1916 foreach my $hsp( $self->hsps ) {
|
|
1917 my ( $q, $h ) = $hsp->strand();
|
|
1918 $qstr{ $q }++;
|
|
1919 $hstr{ $h }++;
|
|
1920 }
|
|
1921 $qstr = join( '/', sort keys %qstr);
|
|
1922 $hstr = join( '/', sort keys %hstr);
|
|
1923 }
|
|
1924
|
|
1925 if($seqType =~ /list|array/i) {
|
|
1926 return ($qstr, $hstr);
|
|
1927 } elsif( $seqType eq 'query' ) {
|
|
1928 return $qstr;
|
|
1929 } else {
|
|
1930 return $hstr;
|
|
1931 }
|
|
1932 }
|
|
1933
|
|
1934
|
|
1935 1;
|
|
1936 __END__
|
|
1937
|
|
1938 #####################################################################################
|
|
1939 # END OF CLASS #
|
|
1940 #####################################################################################
|
|
1941
|
|
1942
|
|
1943 =head1 FOR DEVELOPERS ONLY
|
|
1944
|
|
1945 =head2 Data Members
|
|
1946
|
|
1947 Information about the various data members of this module is provided for those
|
|
1948 wishing to modify or understand the code. Two things to bear in mind:
|
|
1949
|
|
1950 =over 4
|
|
1951
|
|
1952 =item 1 Do NOT rely on these in any code outside of this module.
|
|
1953
|
|
1954 All data members are prefixed with an underscore to signify that they are private.
|
|
1955 Always use accessor methods. If the accessor doesn't exist or is inadequate,
|
|
1956 create or modify an accessor (and let me know, too!). (An exception to this might
|
|
1957 be for BlastHSP.pm which is more tightly coupled to BlastHit.pm and
|
|
1958 may access BlastHit data members directly for efficiency purposes, but probably
|
|
1959 should not).
|
|
1960
|
|
1961 =item 2 This documentation may be incomplete and out of date.
|
|
1962
|
|
1963 It is easy for these data member descriptions to become obsolete as
|
|
1964 this module is still evolving. Always double check this info and search
|
|
1965 for members not described here.
|
|
1966
|
|
1967 =back
|
|
1968
|
|
1969 An instance of Bio::Search::Hit::BlastHit.pm is a blessed reference to a hash containing
|
|
1970 all or some of the following fields:
|
|
1971
|
|
1972 FIELD VALUE
|
|
1973 --------------------------------------------------------------
|
|
1974 _hsps : Array ref for a list of Bio::Search::HSP::BlastHSP.pm objects.
|
|
1975 :
|
|
1976 _db : Database identifier from the summary line.
|
|
1977 :
|
|
1978 _desc : Description data for the hit from the summary line.
|
|
1979 :
|
|
1980 _length : Total length of the hit sequence.
|
|
1981 :
|
|
1982 _score : BLAST score.
|
|
1983 :
|
|
1984 _bits : BLAST score (in bits). Matrix-independent.
|
|
1985 :
|
|
1986 _p : BLAST P value. Obtained from summary section. (Blast1/WU-Blast only)
|
|
1987 :
|
|
1988 _expect : BLAST Expect value. Obtained from summary section.
|
|
1989 :
|
|
1990 _n : BLAST N value (number of HSPs) (Blast1/WU-Blast2 only)
|
|
1991 :
|
|
1992 _frame : Reading frame for TBLASTN and TBLASTX analyses.
|
|
1993 :
|
|
1994 _totalIdentical: Total number of identical aligned monomers.
|
|
1995 :
|
|
1996 _totalConserved: Total number of conserved aligned monomers (a.k.a. "positives").
|
|
1997 :
|
|
1998 _overlap : Maximum number of overlapping residues between adjacent HSPs
|
|
1999 : before considering the alignment to be ambiguous.
|
|
2000 :
|
|
2001 _ambiguous_aln : Boolean. True if the alignment of all HSPs is ambiguous.
|
|
2002 :
|
|
2003 _length_aln_query : Length of the aligned region of the query sequence.
|
|
2004 :
|
|
2005 _length_aln_sbjct : Length of the aligned region of the sbjct sequence.
|
|
2006
|
|
2007
|
|
2008 =cut
|
|
2009
|
|
2010 1;
|