Mercurial > repos > mahtabm > ensembl
comparison variant_effect_predictor/Bio/Search/BlastUtils.pm @ 0:1f6dce3d34e0
Uploaded
| author | mahtabm |
|---|---|
| date | Thu, 11 Apr 2013 02:01:53 -0400 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:1f6dce3d34e0 |
|---|---|
| 1 =head1 NAME | |
| 2 | |
| 3 Bio::Search::BlastUtils - Utility functions for Bio::Search:: BLAST objects | |
| 4 | |
| 5 =head1 SYNOPSIS | |
| 6 | |
| 7 This module is just a collection of subroutines, not an object. | |
| 8 | |
| 9 =head1 DESCRIPTION | |
| 10 | |
| 11 The BlastUtils.pm module is a collection of subroutines used primarily by | |
| 12 Bio::Search::Hit::BlastHit objects for some of the additional | |
| 13 functionality, such as HSP tiling. Right now, the BlastUtils is just a | |
| 14 collection of methods, not an object, and it's tightly coupled to | |
| 15 Bio::Search::Hit::BlastHit. A goal for the future is to generalize it | |
| 16 to work based on the Bio::Search interfaces, then it can work with any | |
| 17 objects that implements them. | |
| 18 | |
| 19 =head1 AUTHOR | |
| 20 | |
| 21 Steve Chervitz E<lt>sac@bioperl.orgE<gt> | |
| 22 | |
| 23 =cut | |
| 24 | |
| 25 #' | |
| 26 | |
| 27 package Bio::Search::BlastUtils; | |
| 28 | |
| 29 | |
| 30 | |
| 31 =head2 tile_hsps | |
| 32 | |
| 33 Usage : tile_hsps( $sbjct ); | |
| 34 : This is called automatically by Bio::Search::Hit::BlastHit | |
| 35 : during object construction or | |
| 36 : as needed by methods that rely on having tiled data. | |
| 37 Purpose : Collect statistics about the aligned sequences in a set of HSPs. | |
| 38 : Calculates the following data across all HSPs: | |
| 39 : -- total alignment length | |
| 40 : -- total identical residues | |
| 41 : -- total conserved residues | |
| 42 Returns : n/a | |
| 43 Argument : A Bio::Search::Hit::BlastHit object | |
| 44 Throws : n/a | |
| 45 Comments : | |
| 46 : This method is *strongly* coupled to Bio::Search::Hit::BlastHit | |
| 47 : (it accesses BlastHit data members directly). | |
| 48 : TODO: Re-write this to the Bio::Search::Hit::HitI interface. | |
| 49 : | |
| 50 : This method performs more careful summing of data across | |
| 51 : all HSPs in the Sbjct object. Only HSPs that are in the same strand | |
| 52 : and frame are tiled. Simply summing the data from all HSPs | |
| 53 : in the same strand and frame will overestimate the actual | |
| 54 : length of the alignment if there is overlap between different HSPs | |
| 55 : (often the case). | |
| 56 : | |
| 57 : The strategy is to tile the HSPs and sum over the | |
| 58 : contigs, collecting data separately from overlapping and | |
| 59 : non-overlapping regions of each HSP. To facilitate this, the | |
| 60 : HSP.pm object now permits extraction of data from sub-sections | |
| 61 : of an HSP. | |
| 62 : | |
| 63 : Additional useful information is collected from the results | |
| 64 : of the tiling. It is possible that sub-sequences in | |
| 65 : different HSPs will overlap significantly. In this case, it | |
| 66 : is impossible to create a single unambiguous alignment by | |
| 67 : concatenating the HSPs. The ambiguity may indicate the | |
| 68 : presence of multiple, similar domains in one or both of the | |
| 69 : aligned sequences. This ambiguity is recorded using the | |
| 70 : ambiguous_aln() method. | |
| 71 : | |
| 72 : This method does not attempt to discern biologically | |
| 73 : significant vs. insignificant overlaps. The allowable amount of | |
| 74 : overlap can be set with the overlap() method or with the -OVERLAP | |
| 75 : parameter used when constructing the Blast & Sbjct objects. | |
| 76 : | |
| 77 : For a given hit, both the query and the sbjct sequences are | |
| 78 : tiled independently. | |
| 79 : | |
| 80 : -- If only query sequence HSPs overlap, | |
| 81 : this may suggest multiple domains in the sbjct. | |
| 82 : -- If only sbjct sequence HSPs overlap, | |
| 83 : this may suggest multiple domains in the query. | |
| 84 : -- If both query & sbjct sequence HSPs overlap, | |
| 85 : this suggests multiple domains in both. | |
| 86 : -- If neither query & sbjct sequence HSPs overlap, | |
| 87 : this suggests either no multiple domains in either | |
| 88 : sequence OR that both sequences have the same | |
| 89 : distribution of multiple similar domains. | |
| 90 : | |
| 91 : This method can deal with the special case of when multiple | |
| 92 : HSPs exactly overlap. | |
| 93 : | |
| 94 : Efficiency concerns: | |
| 95 : Speed will be an issue for sequences with numerous HSPs. | |
| 96 : | |
| 97 Bugs : Currently, tile_hsps() does not properly account for | |
| 98 : the number of non-tiled but overlapping HSPs, which becomes a problem | |
| 99 : as overlap() grows. Large values overlap() may thus lead to | |
| 100 : incorrect statistics for some hits. For best results, keep overlap() | |
| 101 : below 5 (DEFAULT IS 2). For more about this, see the "HSP Tiling and | |
| 102 : Ambiguous Alignments" section in L<Bio::Search::Hit::BlastHit>. | |
| 103 | |
| 104 See Also : L<_adjust_contigs>(), L<Bio::Search::Hit::BlastHit|Bio::Search::Hit::BlastHit> | |
| 105 | |
| 106 =cut | |
| 107 | |
| 108 #-------------- | |
| 109 sub tile_hsps { | |
| 110 #-------------- | |
| 111 my $sbjct = shift; | |
| 112 | |
| 113 $sbjct->{'_tile_hsps'} = 1; | |
| 114 $sbjct->{'_gaps_query'} = 0; | |
| 115 $sbjct->{'_gaps_sbjct'} = 0; | |
| 116 | |
| 117 ## Simple summation scheme. Valid if there is only one HSP. | |
| 118 if((defined($sbjct->{'_n'}) and $sbjct->{'_n'} == 1) or $sbjct->num_hsps == 1) { | |
| 119 my $hsp = $sbjct->hsp; | |
| 120 $sbjct->{'_length_aln_query'} = $hsp->length('query'); | |
| 121 $sbjct->{'_length_aln_sbjct'} = $hsp->length('sbjct'); | |
| 122 $sbjct->{'_length_aln_total'} = $hsp->length('total'); | |
| 123 ($sbjct->{'_totalIdentical'},$sbjct->{'_totalConserved'}) = $hsp->matches(); | |
| 124 $sbjct->{'_gaps_query'} = $hsp->gaps('query'); | |
| 125 $sbjct->{'_gaps_sbjct'} = $hsp->gaps('sbjct'); | |
| 126 | |
| 127 # print "_tile_hsps(): single HSP, easy stats.\n"; | |
| 128 return; | |
| 129 } else { | |
| 130 # print STDERR "Sbjct: _tile_hsps: summing multiple HSPs\n"; | |
| 131 $sbjct->{'_length_aln_query'} = 0; | |
| 132 $sbjct->{'_length_aln_sbjct'} = 0; | |
| 133 $sbjct->{'_length_aln_total'} = 0; | |
| 134 $sbjct->{'_totalIdentical'} = 0; | |
| 135 $sbjct->{'_totalConserved'} = 0; | |
| 136 } | |
| 137 | |
| 138 ## More than one HSP. Must tile HSPs. | |
| 139 # print "\nTiling HSPs for $sbjct\n"; | |
| 140 my($hsp, $qstart, $qstop, $sstart, $sstop); | |
| 141 my($frame, $strand, $qstrand, $sstrand); | |
| 142 my(@qcontigs, @scontigs); | |
| 143 my $qoverlap = 0; | |
| 144 my $soverlap = 0; | |
| 145 my $max_overlap = $sbjct->{'_overlap'}; | |
| 146 | |
| 147 foreach $hsp ($sbjct->hsps()) { | |
| 148 # printf " HSP: %s\n%s\n",$hsp->name, $hsp->str('query'); | |
| 149 # printf " Length = %d; Identical = %d; Conserved = %d; Conserved(1-10): %d",$hsp->length, $hsp->length(-TYPE=>'iden'), $hsp->length(-TYPE=>'cons'), $hsp->length(-TYPE=>'cons',-START=>0,-STOP=>10); | |
| 150 ($qstart, $qstop) = $hsp->range('query'); | |
| 151 ($sstart, $sstop) = $hsp->range('sbjct'); | |
| 152 $frame = $hsp->frame; | |
| 153 $frame = -1 unless defined $frame; | |
| 154 ($qstrand, $sstrand) = $hsp->strand; | |
| 155 | |
| 156 my ($qgaps, $sgaps) = $hsp->gaps(); | |
| 157 $sbjct->{'_gaps_query'} += $qgaps; | |
| 158 $sbjct->{'_gaps_sbjct'} += $sgaps; | |
| 159 | |
| 160 $sbjct->{'_length_aln_total'} += $hsp->length; | |
| 161 ## Collect contigs in the query sequence. | |
| 162 $qoverlap = &_adjust_contigs('query', $hsp, $qstart, $qstop, \@qcontigs, $max_overlap, $frame, $qstrand); | |
| 163 | |
| 164 ## Collect contigs in the sbjct sequence (needed for domain data and gapped Blast). | |
| 165 $soverlap = &_adjust_contigs('sbjct', $hsp, $sstart, $sstop, \@scontigs, $max_overlap, $frame, $sstrand); | |
| 166 | |
| 167 ## Collect overall start and stop data for query and sbjct over all HSPs. | |
| 168 if(not defined $sbjct->{'_queryStart'}) { | |
| 169 $sbjct->{'_queryStart'} = $qstart; | |
| 170 $sbjct->{'_queryStop'} = $qstop; | |
| 171 $sbjct->{'_sbjctStart'} = $sstart; | |
| 172 $sbjct->{'_sbjctStop'} = $sstop; | |
| 173 } else { | |
| 174 $sbjct->{'_queryStart'} = ($qstart < $sbjct->{'_queryStart'} ? $qstart : $sbjct->{'_queryStart'}); | |
| 175 $sbjct->{'_queryStop'} = ($qstop > $sbjct->{'_queryStop'} ? $qstop : $sbjct->{'_queryStop'}); | |
| 176 $sbjct->{'_sbjctStart'} = ($sstart < $sbjct->{'_sbjctStart'} ? $sstart : $sbjct->{'_sbjctStart'}); | |
| 177 $sbjct->{'_sbjctStop'} = ($sstop > $sbjct->{'_sbjctStop'} ? $sstop : $sbjct->{'_sbjctStop'}); | |
| 178 } | |
| 179 } | |
| 180 | |
| 181 ## Collect data across the collected contigs. | |
| 182 | |
| 183 # print "\nQUERY CONTIGS:\n"; | |
| 184 # print " gaps = $sbjct->{'_gaps_query'}\n"; | |
| 185 | |
| 186 # TODO: Account for strand/frame issue! | |
| 187 # Strategy: collect data on a per strand+frame basis and save the most significant one. | |
| 188 my (%qctg_dat); | |
| 189 foreach(@qcontigs) { | |
| 190 # print " query contig: $_->{'start'} - $_->{'stop'}\n"; | |
| 191 # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n"; | |
| 192 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'}); | |
| 193 $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1; | |
| 194 $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'}; | |
| 195 $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'}; | |
| 196 $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; | |
| 197 } | |
| 198 | |
| 199 # Find longest contig. | |
| 200 my @sortedkeys = reverse sort { $qctg_dat{ $a }->{'length_aln_query'} <=> $qctg_dat{ $b }->{'length_aln_query'} } keys %qctg_dat; | |
| 201 | |
| 202 # Save the largest to the sbjct: | |
| 203 my $longest = $sortedkeys[0]; | |
| 204 $sbjct->{'_length_aln_query'} = $qctg_dat{ $longest }->{'length_aln_query'}; | |
| 205 $sbjct->{'_totalIdentical'} = $qctg_dat{ $longest }->{'totalIdentical'}; | |
| 206 $sbjct->{'_totalConserved'} = $qctg_dat{ $longest }->{'totalConserved'}; | |
| 207 $sbjct->{'_qstrand'} = $qctg_dat{ $longest }->{'qstrand'}; | |
| 208 | |
| 209 ## Collect data for sbjct contigs. Important for gapped Blast. | |
| 210 ## The totalIdentical and totalConserved numbers will be the same | |
| 211 ## as determined for the query contigs. | |
| 212 | |
| 213 # print "\nSBJCT CONTIGS:\n"; | |
| 214 # print " gaps = $sbjct->{'_gaps_sbjct'}\n"; | |
| 215 | |
| 216 my (%sctg_dat); | |
| 217 foreach(@scontigs) { | |
| 218 # print " sbjct contig: $_->{'start'} - $_->{'stop'}\n"; | |
| 219 # print " iden = $_->{'iden'}; cons = $_->{'cons'}\n"; | |
| 220 ($frame, $strand) = ($_->{'frame'}, $_->{'strand'}); | |
| 221 $sctg_dat{ "$frame$strand" }->{'length_aln_sbjct'} += $_->{'stop'} - $_->{'start'} + 1; | |
| 222 $sctg_dat{ "$frame$strand" }->{'frame'} = $frame; | |
| 223 $sctg_dat{ "$frame$strand" }->{'sstrand'} = $strand; | |
| 224 } | |
| 225 | |
| 226 @sortedkeys = reverse sort { $sctg_dat{ $a }->{'length_aln_sbjct'} <=> $sctg_dat{ $b }->{'length_aln_sbjct'} } keys %sctg_dat; | |
| 227 | |
| 228 # Save the largest to the sbjct: | |
| 229 $longest = $sortedkeys[0]; | |
| 230 | |
| 231 $sbjct->{'_length_aln_sbjct'} = $sctg_dat{ $longest }->{'length_aln_sbjct'}; | |
| 232 $sbjct->{'_frame'} = $sctg_dat{ $longest }->{'frame'}; | |
| 233 $sbjct->{'_sstrand'} = $sctg_dat{ $longest }->{'sstrand'}; | |
| 234 | |
| 235 if($qoverlap) { | |
| 236 if($soverlap) { $sbjct->ambiguous_aln('qs'); | |
| 237 # print "\n*** AMBIGUOUS ALIGNMENT: Query and Sbjct\n\n"; | |
| 238 } | |
| 239 else { $sbjct->ambiguous_aln('q'); | |
| 240 # print "\n*** AMBIGUOUS ALIGNMENT: Query\n\n"; | |
| 241 } | |
| 242 } elsif($soverlap) { | |
| 243 $sbjct->ambiguous_aln('s'); | |
| 244 # print "\n*** AMBIGUOUS ALIGNMENT: Sbjct\n\n"; | |
| 245 } | |
| 246 | |
| 247 # Adjust length based on BLAST flavor. | |
| 248 my $prog = $sbjct->algorithm; | |
| 249 if($prog eq 'TBLASTN') { | |
| 250 $sbjct->{'_length_aln_sbjct'} /= 3; | |
| 251 } elsif($prog eq 'BLASTX' ) { | |
| 252 $sbjct->{'_length_aln_query'} /= 3; | |
| 253 } elsif($prog eq 'TBLASTX') { | |
| 254 $sbjct->{'_length_aln_query'} /= 3; | |
| 255 $sbjct->{'_length_aln_sbjct'} /= 3; | |
| 256 } | |
| 257 } | |
| 258 | |
| 259 | |
| 260 | |
| 261 =head2 _adjust_contigs | |
| 262 | |
| 263 Usage : n/a; called automatically during object construction. | |
| 264 Purpose : Builds HSP contigs for a given BLAST hit. | |
| 265 : Utility method called by _tile_hsps() | |
| 266 Returns : | |
| 267 Argument : | |
| 268 Throws : Exceptions propagated from Bio::Search::Hit::BlastHSP::matches() | |
| 269 : for invalid sub-sequence ranges. | |
| 270 Status : Experimental | |
| 271 Comments : This method does not currently support gapped alignments. | |
| 272 : Also, it does not keep track of the number of HSPs that | |
| 273 : overlap within the amount specified by overlap(). | |
| 274 : This will lead to significant tracking errors for large | |
| 275 : overlap values. | |
| 276 | |
| 277 See Also : L<tile_hsps>(), L<Bio::Search::Hit::BlastHSP::matches|Bio::Search::Hit::BlastHSP> | |
| 278 | |
| 279 =cut | |
| 280 | |
| 281 #------------------- | |
| 282 sub _adjust_contigs { | |
| 283 #------------------- | |
| 284 my ($seqType, $hsp, $start, $stop, $contigs_ref, $max_overlap, $frame, $strand) = @_; | |
| 285 | |
| 286 my $overlap = 0; | |
| 287 my ($numID, $numCons); | |
| 288 | |
| 289 # print STDERR "Testing $seqType data: HSP (${\$hsp->name}); $start, $stop, strand=$strand, frame=$frame\n"; | |
| 290 foreach(@$contigs_ref) { | |
| 291 # print STDERR " Contig: $_->{'start'} - $_->{'stop'}, strand=$_->{'strand'}, frame=$_->{'frame'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n"; | |
| 292 | |
| 293 # Don't merge things unless they have matching strand/frame. | |
| 294 next unless ($_->{'frame'} == $frame and $_->{'strand'} == $strand); | |
| 295 | |
| 296 ## Test special case of a nested HSP. Skip it. | |
| 297 if($start >= $_->{'start'} and $stop <= $_->{'stop'}) { | |
| 298 # print STDERR "----> Nested HSP. Skipping.\n"; | |
| 299 $overlap = 1; | |
| 300 next; | |
| 301 } | |
| 302 | |
| 303 ## Test for overlap at beginning of contig. | |
| 304 if($start < $_->{'start'} and $stop > ($_->{'start'} + $max_overlap)) { | |
| 305 # print STDERR "----> Overlaps beg: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n"; | |
| 306 # Collect stats over the non-overlapping region. | |
| 307 eval { | |
| 308 ($numID, $numCons) = $hsp->matches(-SEQ =>$seqType, | |
| 309 -START =>$start, | |
| 310 -STOP =>$_->{'start'}-1); | |
| 311 }; | |
| 312 if($@) { warn "\a\n$@\n"; } | |
| 313 else { | |
| 314 $_->{'start'} = $start; # Assign a new start coordinate to the contig | |
| 315 $_->{'iden'} += $numID; # and add new data to #identical, #conserved. | |
| 316 $_->{'cons'} += $numCons; | |
| 317 $overlap = 1; | |
| 318 } | |
| 319 } | |
| 320 | |
| 321 ## Test for overlap at end of contig. | |
| 322 if($stop > $_->{'stop'} and $start < ($_->{'stop'} - $max_overlap)) { | |
| 323 # print STDERR "----> Overlaps end: existing beg,end: $_->{'start'},$_->{'stop'}, new beg,end: $start,$stop\n"; | |
| 324 # Collect stats over the non-overlapping region. | |
| 325 eval { | |
| 326 ($numID,$numCons) = $hsp->matches(-SEQ =>$seqType, | |
| 327 -START =>$_->{'stop'}, | |
| 328 -STOP =>$stop); | |
| 329 }; | |
| 330 if($@) { warn "\a\n$@\n"; } | |
| 331 else { | |
| 332 $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig | |
| 333 $_->{'iden'} += $numID; # and add new data to #identical, #conserved. | |
| 334 $_->{'cons'} += $numCons; | |
| 335 $overlap = 1; | |
| 336 } | |
| 337 } | |
| 338 $overlap && do { | |
| 339 # print STDERR " New Contig data:\n"; | |
| 340 # print STDERR " Contig: $_->{'start'} - $_->{'stop'}, iden= $_->{'iden'}, cons= $_->{'cons'}\n"; | |
| 341 last; | |
| 342 }; | |
| 343 } | |
| 344 ## If there is no overlap, add the complete HSP data. | |
| 345 !$overlap && do { | |
| 346 # print STDERR "No overlap. Adding new contig.\n"; | |
| 347 ($numID,$numCons) = $hsp->matches(-SEQ=>$seqType); | |
| 348 push @$contigs_ref, {'start'=>$start, 'stop'=>$stop, | |
| 349 'iden'=>$numID, 'cons'=>$numCons, | |
| 350 'strand'=>$strand, 'frame'=>$frame}; | |
| 351 }; | |
| 352 $overlap; | |
| 353 } | |
| 354 | |
| 355 =head2 get_exponent | |
| 356 | |
| 357 Usage : &get_exponent( number ); | |
| 358 Purpose : Determines the power of 10 exponent of an integer, float, | |
| 359 : or scientific notation number. | |
| 360 Example : &get_exponent("4.0e-206"); | |
| 361 : &get_exponent("0.00032"); | |
| 362 : &get_exponent("10."); | |
| 363 : &get_exponent("1000.0"); | |
| 364 : &get_exponent("e+83"); | |
| 365 Argument : Float, Integer, or scientific notation number | |
| 366 Returns : Integer representing the exponent part of the number (+ or -). | |
| 367 : If argument == 0 (zero), return value is "-999". | |
| 368 Comments : Exponents are rounded up (less negative) if the mantissa is >= 5. | |
| 369 : Exponents are rounded down (more negative) if the mantissa is <= -5. | |
| 370 | |
| 371 =cut | |
| 372 | |
| 373 #------------------ | |
| 374 sub get_exponent { | |
| 375 #------------------ | |
| 376 my $data = shift; | |
| 377 | |
| 378 my($num, $exp) = split /[eE]/, $data; | |
| 379 | |
| 380 if( defined $exp) { | |
| 381 $num = 1 if not $num; | |
| 382 $num >= 5 and $exp++; | |
| 383 $num <= -5 and $exp--; | |
| 384 } elsif( $num == 0) { | |
| 385 $exp = -999; | |
| 386 } elsif( not $num =~ /\./) { | |
| 387 $exp = CORE::length($num) -1; | |
| 388 } else { | |
| 389 $exp = 0; | |
| 390 $num .= '0' if $num =~ /\.$/; | |
| 391 my ($c); | |
| 392 my $rev = 0; | |
| 393 if($num !~ /^0/) { | |
| 394 $num = reverse($num); | |
| 395 $rev = 1; | |
| 396 } | |
| 397 do { $c = chop($num); | |
| 398 $c == 0 && $exp++; | |
| 399 } while( $c ne '.'); | |
| 400 | |
| 401 $exp = -$exp if $num == 0 and not $rev; | |
| 402 $exp -= 1 if $rev; | |
| 403 } | |
| 404 return $exp; | |
| 405 } | |
| 406 | |
| 407 =head2 collapse_nums | |
| 408 | |
| 409 Usage : @cnums = collapse_nums( @numbers ); | |
| 410 Purpose : Collapses a list of numbers into a set of ranges of consecutive terms: | |
| 411 : Useful for condensing long lists of consecutive numbers. | |
| 412 : EXPANDED: | |
| 413 : 1 2 3 4 5 6 10 12 13 14 15 17 18 20 21 22 24 26 30 31 32 | |
| 414 : COLLAPSED: | |
| 415 : 1-6 10 12-15 17 18 20-22 24 26 30-32 | |
| 416 Argument : List of numbers sorted numerically. | |
| 417 Returns : List of numbers mixed with ranges of numbers (see above). | |
| 418 Throws : n/a | |
| 419 | |
| 420 See Also : L<Bio::Search::Hit::BlastHit::seq_inds()|Bio::Search::Hit::BlastHit> | |
| 421 | |
| 422 =cut | |
| 423 | |
| 424 #------------------ | |
| 425 sub collapse_nums { | |
| 426 #------------------ | |
| 427 # This is probably not the slickest connectivity algorithm, but will do for now. | |
| 428 my @a = @_; | |
| 429 my ($from, $to, $i, @ca, $consec); | |
| 430 | |
| 431 $consec = 0; | |
| 432 for($i=0; $i < @a; $i++) { | |
| 433 not $from and do{ $from = $a[$i]; next; }; | |
| 434 if($a[$i] == $a[$i-1]+1) { | |
| 435 $to = $a[$i]; | |
| 436 $consec++; | |
| 437 } else { | |
| 438 if($consec == 1) { $from .= ",$to"; } | |
| 439 else { $from .= $consec>1 ? "\-$to" : ""; } | |
| 440 push @ca, split(',', $from); | |
| 441 $from = $a[$i]; | |
| 442 $consec = 0; | |
| 443 $to = undef; | |
| 444 } | |
| 445 } | |
| 446 if(defined $to) { | |
| 447 if($consec == 1) { $from .= ",$to"; } | |
| 448 else { $from .= $consec>1 ? "\-$to" : ""; } | |
| 449 } | |
| 450 push @ca, split(',', $from) if $from; | |
| 451 | |
| 452 @ca; | |
| 453 } | |
| 454 | |
| 455 | |
| 456 =head2 strip_blast_html | |
| 457 | |
| 458 Usage : $boolean = &strip_blast_html( string_ref ); | |
| 459 : This method is exported. | |
| 460 Purpose : Removes HTML formatting from a supplied string. | |
| 461 : Attempts to restore the Blast report to enable | |
| 462 : parsing by Bio::SearchIO::blast.pm | |
| 463 Returns : Boolean: true if string was stripped, false if not. | |
| 464 Argument : string_ref = reference to a string containing the whole Blast | |
| 465 : report containing HTML formatting. | |
| 466 Throws : Croaks if the argument is not a scalar reference. | |
| 467 Comments : Based on code originally written by Alex Dong Li | |
| 468 : (ali@genet.sickkids.on.ca). | |
| 469 : This method does some Blast-specific stripping | |
| 470 : (adds back a '>' character in front of each HSP | |
| 471 : alignment listing). | |
| 472 : | |
| 473 : THIS METHOD IS VERY SENSITIVE TO BLAST FORMATTING CHANGES! | |
| 474 : | |
| 475 : Removal of the HTML tags and accurate reconstitution of the | |
| 476 : non-HTML-formatted report is highly dependent on structure of | |
| 477 : the HTML-formatted version. For example, it assumes that first | |
| 478 : line of each alignment section (HSP listing) starts with a | |
| 479 : <a name=..> anchor tag. This permits the reconstruction of the | |
| 480 : original report in which these lines begin with a ">". | |
| 481 : This is required for parsing. | |
| 482 : | |
| 483 : If the structure of the Blast report itself is not intended to | |
| 484 : be a standard, the structure of the HTML-formatted version | |
| 485 : is even less so. Therefore, the use of this method to | |
| 486 : reconstitute parsable Blast reports from HTML-format versions | |
| 487 : should be considered a temorary solution. | |
| 488 | |
| 489 See Also : B<Bio::Search::Processor::BlastIO::new()> | |
| 490 | |
| 491 =cut | |
| 492 | |
| 493 #-------------------- | |
| 494 sub strip_blast_html { | |
| 495 #-------------------- | |
| 496 # This may not best way to remove html tags. However, it is simple. | |
| 497 # it won't work under following conditions: | |
| 498 # 1) if quoted > appears in a tag (does this ever happen?) | |
| 499 # 2) if a tag is split over multiple lines and this method is | |
| 500 # used to process one line at a time. | |
| 501 | |
| 502 my ($string_ref) = shift; | |
| 503 | |
| 504 ref $string_ref eq 'SCALAR' or | |
| 505 croak ("Can't strip HTML: ". | |
| 506 "Argument is should be a SCALAR reference not a ${\ref $string_ref}\n"); | |
| 507 | |
| 508 my $str = $$string_ref; | |
| 509 my $stripped = 0; | |
| 510 | |
| 511 # Removing "<a name =...>" and adding the '>' character for | |
| 512 # HSP alignment listings. | |
| 513 $str =~ s/(\A|\n)<a name ?=[^>]+> ?/>/sgi and $stripped = 1; | |
| 514 | |
| 515 # Removing all "<>" tags. | |
| 516 $str =~ s/<[^>]+>| //sgi and $stripped = 1; | |
| 517 | |
| 518 # Re-uniting any lone '>' characters. | |
| 519 $str =~ s/(\A|\n)>\s+/\n\n>/sgi and $stripped = 1; | |
| 520 | |
| 521 $$string_ref = $str; | |
| 522 $stripped; | |
| 523 } | |
| 524 | |
| 525 | |
| 526 1; | |
| 527 | |
| 528 |
