0
|
1 # $Id: SeqPattern.pm,v 1.14 2002/10/22 07:38:46 lapp Exp $
|
|
2 #
|
|
3 # bioperl module for Bio::Tools::SeqPattern
|
|
4 #
|
|
5 # Cared for by Steve Chervitz (sac@bioperl.org)
|
|
6 #
|
|
7 # Copyright Steve Chervitz
|
|
8 #
|
|
9 # You may distribute this module under the same terms as perl itself
|
|
10
|
|
11 # POD documentation - main docs before the code
|
|
12
|
|
13 =head1 NAME
|
|
14
|
|
15 Bio::Tools::SeqPattern - Bioperl object for a sequence pattern or motif
|
|
16
|
|
17 =head1 SYNOPSIS
|
|
18
|
|
19 =head2 Object Creation
|
|
20
|
|
21 use Bio::Tools::SeqPattern ();
|
|
22
|
|
23 $pat1 = 'T[GA]AA...TAAT';
|
|
24 $pattern1 = new Bio::Tools::SeqPattern(-SEQ =>$pattern, -TYPE =>'Dna');
|
|
25
|
|
26 $pat2 = '[VILM]R(GXX){3,2}...[^PG]';
|
|
27 $pattern2 = new Bio::Tools::SeqPattern(-SEQ =>$pattern, -TYPE =>'Amino');
|
|
28
|
|
29 =head1 DESCRIPTION
|
|
30
|
|
31 The Bio::Tools::SeqPattern.pm module encapsulates generic data and
|
|
32 methods for manipulating regular expressions describing nucleic or
|
|
33 amino acid sequence patterns (a.k.a, "motifs").
|
|
34
|
|
35 Bio::Tools::SeqPattern.pm is a concrete class that inherits from
|
|
36 B<Bio::Seq.pm>.
|
|
37
|
|
38 This class grew out of a need to have a standard module for doing routine
|
|
39 tasks with sequence patterns such as:
|
|
40
|
|
41 -- Forming a reverse-complement version of a nucleotide sequence pattern
|
|
42 -- Expanding patterns containing ambiguity codes
|
|
43 -- Checking for invalid regexp characters
|
|
44 -- Untainting yet preserving special characters in the pattern
|
|
45
|
|
46 Other features to look for in the future:
|
|
47
|
|
48 -- Full pattern syntax checking
|
|
49 -- Conversion between expanded and ondensed forms of the pattern
|
|
50
|
|
51 =head1 MOTIVATIONS
|
|
52
|
|
53 A key motivation for Bio::Tools::SeqPattern.pm is to have a way to
|
|
54 generate a reverse complement of a nucleotide sequence pattern.
|
|
55 This makes possible simultaneous pattern matching on both sense and
|
|
56 anti-sense strands of a query sequence.
|
|
57
|
|
58 In principle, one could do such a search more inefficiently by testing
|
|
59 against both sense and anti-sense versions of a sequence.
|
|
60 It is entirely equivalent to test a regexp containing both sense and
|
|
61 anti-sense versions of the *pattern* against one copy of the sequence.
|
|
62 The latter approach is much more efficient since:
|
|
63
|
|
64 1) You need only one copy of the sequence.
|
|
65 2) Only one regexp is executed.
|
|
66 3) Regexp patterns are typically much smaller than sequences.
|
|
67
|
|
68 Patterns can be quite complex and it is often difficult to
|
|
69 generate the reverse complement pattern. The Bioperl SeqPattern.pm
|
|
70 addresses this problem, providing a convenient set of tools
|
|
71 for working with biological sequence regular expressions.
|
|
72
|
|
73 Not all patterns have been tested. If you discover a pattern that
|
|
74 is not handled properly by Bio::Tools::SeqPattern.pm, please
|
|
75 send me some email (sac@bioperl.org). Thanks.
|
|
76
|
|
77 =head1 OTHER FEATURES
|
|
78
|
|
79 =head2 Extended Alphabet Support
|
|
80
|
|
81 This module supports the same set of ambiguity codes for nucleotide
|
|
82 sequences as supported by B<Bio::Seq.pm>. These ambiguity codes
|
|
83 define the behavior or the expand() method.
|
|
84
|
|
85 ------------------------------------------
|
|
86 Symbol Meaning Nucleic Acid
|
|
87 ------------------------------------------
|
|
88 A A Adenine
|
|
89 C C Cytosine
|
|
90 G G Guanine
|
|
91 T T Thymine
|
|
92 U U Uracil
|
|
93 M A or C
|
|
94 R A or G Any purine
|
|
95 W A or T
|
|
96 S C or G
|
|
97 Y C or T Any pyrimidine
|
|
98 K G or T
|
|
99 V A or C or G
|
|
100 H A or C or T
|
|
101 D A or G or T
|
|
102 B C or G or T
|
|
103 X G or A or T or C
|
|
104 N G or A or T or C
|
|
105 . G or A or T or C
|
|
106
|
|
107
|
|
108
|
|
109 ------------------------------------------
|
|
110 Symbol Meaning
|
|
111 ------------------------------------------
|
|
112 A Alanine
|
|
113 C Cysteine
|
|
114 D Aspartic Acid
|
|
115 E Glutamic Acid
|
|
116 F Phenylalanine
|
|
117 G Glycine
|
|
118 H Histidine
|
|
119 I Isoleucine
|
|
120 K Lysine
|
|
121 L Leucine
|
|
122 M Methionine
|
|
123 N Asparagine
|
|
124 P Proline
|
|
125 Q Glutamine
|
|
126 R Arginine
|
|
127 S Serine
|
|
128 T Threonine
|
|
129 V Valine
|
|
130 W Tryptophan
|
|
131 Y Tyrosine
|
|
132
|
|
133 B Aspartic Acid, Asparagine
|
|
134 Z Glutamic Acid, Glutamine
|
|
135 X Any amino acid
|
|
136 . Any amino acid
|
|
137
|
|
138
|
|
139 =head2 Multiple Format Support
|
|
140
|
|
141 Ultimately, this module should be able to build SeqPattern.pm objects
|
|
142 using a variety of pattern formats such as ProSite, Blocks, Prints, GCG, etc.
|
|
143 Currently, this module only supports patterns using a grep-like syntax.
|
|
144
|
|
145 =head1 USAGE
|
|
146
|
|
147 A simple demo script called seq_pattern.pl is included in the examples/
|
|
148 directory of the central Bioperl distribution.
|
|
149
|
|
150 =head1 SEE ALSO
|
|
151
|
|
152 L<Bio::Root::Object> - Base class.
|
|
153 L<Bio::Seq> - Lightweight sequence object.
|
|
154
|
|
155 http://bio.perl.org/Projects/modules.html - Online module documentation
|
|
156 http://bio.perl.org/ - Bioperl Project Homepage
|
|
157
|
|
158 =head1 FEEDBACK
|
|
159
|
|
160 =head2 Mailing Lists
|
|
161
|
|
162 User feedback is an integral part of the evolution of this and other
|
|
163 Bioperl modules. Send your comments and suggestions preferably to one
|
|
164 of the Bioperl mailing lists. Your participation is much appreciated.
|
|
165
|
|
166 bioperl-l@bioperl.org - General discussion
|
|
167 http://bio.perl.org/MailList.html - About the mailing lists
|
|
168
|
|
169 =head2 Reporting Bugs
|
|
170
|
|
171 Report bugs to the Bioperl bug tracking system to help us keep track
|
|
172 the bugs and their resolution. Bug reports can be submitted via email
|
|
173 or the web:
|
|
174
|
|
175 bioperl-bugs@bio.perl.org
|
|
176 http://bugzilla.bioperl.org/
|
|
177
|
|
178 =head1 AUTHOR
|
|
179
|
|
180 Steve Chervitz, sac@bioperl.org
|
|
181
|
|
182 =head1 VERSION
|
|
183
|
|
184 Bio::Tools::SeqPattern.pm, 0.011
|
|
185
|
|
186 =head1 COPYRIGHT
|
|
187
|
|
188 Copyright (c) 1997-8 Steve Chervitz. All Rights Reserved.
|
|
189 This module is free software; you can redistribute it and/or
|
|
190 modify it under the same terms as Perl itself.
|
|
191
|
|
192 =cut
|
|
193
|
|
194 #
|
|
195 ##
|
|
196 ###
|
|
197 #### END of main POD documentation.
|
|
198 ###
|
|
199 ##
|
|
200 #'
|
|
201 # CREATED : 28 Aug 1997
|
|
202
|
|
203
|
|
204 package Bio::Tools::SeqPattern;
|
|
205
|
|
206 use Bio::Root::Root;
|
|
207 @ISA = qw(Bio::Root::Root);
|
|
208 use strict;
|
|
209 use vars qw ($ID $VERSION);
|
|
210 $ID = 'Bio::Tools::SeqPattern';
|
|
211 $VERSION = 0.011;
|
|
212
|
|
213 ## These constants may be more appropriate in a Bio::Dictionary.pm
|
|
214 ## type of class.
|
|
215 my $PURINES = 'AG';
|
|
216 my $PYRIMIDINES = 'CT';
|
|
217 my $BEE = 'DN';
|
|
218 my $ZED = 'EQ';
|
|
219 my $Regexp_chars = '\w,.\*()\[\]<>\{\}^\$'; # quoted for use in regexps
|
|
220
|
|
221 ## Package variables used in reverse complementing.
|
|
222 my (%Processed_braces, %Processed_asterics);
|
|
223
|
|
224 #####################################################################################
|
|
225 ## CONSTRUCTOR ##
|
|
226 #####################################################################################
|
|
227
|
|
228
|
|
229 =head1 new
|
|
230
|
|
231 Title : new
|
|
232 Usage : my $seqpat = new Bio::Tools::SeqPattern();
|
|
233 Purpose : Verifies that the type is correct for superclass (Bio::Seq.pm)
|
|
234 : and calls superclass constructor last.
|
|
235 Returns : n/a
|
|
236 Argument : Parameters passed to new()
|
|
237 Throws : Exception if the pattern string (seq) is empty.
|
|
238 Comments : The process of creating a new SeqPattern.pm object
|
|
239 : ensures that the pattern string is untained.
|
|
240
|
|
241 See Also : B<Bio::Root::Root::new()>,
|
|
242 B<Bio::Seq::_initialize()>
|
|
243
|
|
244 =cut
|
|
245
|
|
246 #----------------
|
|
247 sub new {
|
|
248 #----------------
|
|
249 my($class, %param) = @_;
|
|
250
|
|
251 my $self = $class->SUPER::new(%param);
|
|
252 my ($seq,$type) = $self->_rearrange([qw(SEQ TYPE)], %param);
|
|
253
|
|
254 $seq || $self->throw("Empty pattern.");
|
|
255 my $t;
|
|
256 # Get the type ready for Bio::Seq.pm
|
|
257 if ($type =~ /nuc|[dr]na/i) {
|
|
258 $t = 'Dna';
|
|
259 } elsif ($type =~ /amino|pep|prot/i) {
|
|
260 $t = 'Amino';
|
|
261 }
|
|
262 $seq =~ tr/a-z/A-Z/; #ps 8/8/00 Canonicalize to upper case
|
|
263 $self->str($seq);
|
|
264 $self->type($t);
|
|
265
|
|
266 return $self;
|
|
267 }
|
|
268
|
|
269
|
|
270 =head1 alphabet_ok
|
|
271
|
|
272 Title : alphabet_ok
|
|
273 Usage : $mypat->alphabet_ok;
|
|
274 Purpose : Checks for invalid regexp characters.
|
|
275 : Overrides Bio::Seq::alphabet_ok() to allow
|
|
276 : additional regexp characters ,.*()[]<>{}^$
|
|
277 : in addition to the standard genetic alphabet.
|
|
278 : Also untaints the pattern and sets the sequence
|
|
279 : object's sequence to the untained string.
|
|
280 Returns : Boolean (1 | 0)
|
|
281 Argument : n/a
|
|
282 Throws : Exception if the pattern contains invalid characters.
|
|
283 Comments : Does not call the superclass method.
|
|
284 : Actually permits any alphanumeric, not just the
|
|
285 : standard genetic alphabet.
|
|
286
|
|
287 =cut
|
|
288
|
|
289 #----------------'
|
|
290 sub alphabet_ok {
|
|
291 #----------------
|
|
292 my( $self) = @_;
|
|
293
|
|
294 return 1 if $self->{'_alphabet_checked'};
|
|
295
|
|
296 $self->{'_alphabet_checked'} = 1;
|
|
297
|
|
298 my $pat = $self->seq();
|
|
299
|
|
300 if($pat =~ /[^$Regexp_chars]/io) {
|
|
301 $self->throw("Pattern contains invalid characters: $pat",
|
|
302 'Legal characters: a-z,A-Z,0-9,,.*()[]<>{}^$ ');
|
|
303 }
|
|
304
|
|
305 # Untaint pattern (makes code taint-safe).
|
|
306 $pat =~ /[$Regexp_chars]+/io;
|
|
307 $self->setseq(uc($&));
|
|
308 # print STDERR "\npattern ok: $pat\n";
|
|
309 1;
|
|
310 }
|
|
311
|
|
312 =head1 expand
|
|
313
|
|
314 Title : expand
|
|
315 Usage : $seqpat_object->expand();
|
|
316 Purpose : Expands the sequence pattern using special ambiguity codes.
|
|
317 Example : $pat = $seq_pat->expand();
|
|
318 Returns : String containing fully expanded sequence pattern
|
|
319 Argument : n/a
|
|
320 Throws : Exception if sequence type is not recognized
|
|
321 : (i.e., is not one of [DR]NA, Amino)
|
|
322
|
|
323 See Also : B<Extended Alphabet Support>, L<_expand_pep>(), L<_expand_nuc>()
|
|
324
|
|
325 =cut
|
|
326
|
|
327 #----------
|
|
328 sub expand {
|
|
329 #----------
|
|
330 my $self = shift;
|
|
331
|
|
332 if($self->type =~ /[DR]na/i) { $self->_expand_nuc(); }
|
|
333 elsif($self->type =~ /Amino/i) { $self->_expand_pep(); }
|
|
334 else{
|
|
335 $self->throw("Don't know how to expand ${\$self->type} patterns.\n");
|
|
336 }
|
|
337 }
|
|
338
|
|
339
|
|
340 =head1 _expand_pep
|
|
341
|
|
342 Title : _expand_pep
|
|
343 Usage : n/a; automatically called by expand()
|
|
344 Purpose : Expands peptide patterns
|
|
345 Returns : String (the expanded pattern)
|
|
346 Argument : String (the unexpanded pattern)
|
|
347 Throws : n/a
|
|
348
|
|
349 See Also : L<expand>(), L<_expand_nuc>()
|
|
350
|
|
351 =cut
|
|
352
|
|
353 #----------------
|
|
354 sub _expand_pep {
|
|
355 #----------------
|
|
356 my ($self,$pat) = @_;
|
|
357 $pat ||= $self->str;
|
|
358 $pat =~ s/X/./g;
|
|
359 $pat =~ s/^</\^/;
|
|
360 $pat =~ s/>$/\$/;
|
|
361
|
|
362 ## Avoid nested situations: [bmnq] --/--> [[$ZED]mnq]
|
|
363 ## Yet correctly deal with: fze[bmnq] ---> f[$BEE]e[$ZEDmnq]
|
|
364 if($pat =~ /\[\w*[BZ]\w*\]/) {
|
|
365 $pat =~ s/\[(\w*)B(\w*)\]/\[$1$ZED$2\]/g;
|
|
366 $pat =~ s/\[(\w*)Z(\w*)\]/\[$1$BEE$2\]/g;
|
|
367 $pat =~ s/B/\[$ZED\]/g;
|
|
368 $pat =~ s/Z/\[$BEE\]/g;
|
|
369 } else {
|
|
370 $pat =~ s/B/\[$ZED\]/g;
|
|
371 $pat =~ s/Z/\[$BEE\]/g;
|
|
372 }
|
|
373 $pat =~ s/\((.)\)/$1/g; ## Doing these last since:
|
|
374 $pat =~ s/\[(.)\]/$1/g; ## Pattern could contain [B] (for example)
|
|
375
|
|
376 return $pat;
|
|
377 }
|
|
378
|
|
379
|
|
380
|
|
381 =head1 _expand_nuc
|
|
382
|
|
383 Title : _expand_nuc
|
|
384 Purpose : Expands nucleotide patterns
|
|
385 Returns : String (the expanded pattern)
|
|
386 Argument : String (the unexpanded pattern)
|
|
387 Throws : n/a
|
|
388
|
|
389 See Also : L<expand>(), L<_expand_pep>()
|
|
390
|
|
391 =cut
|
|
392
|
|
393 #---------------
|
|
394 sub _expand_nuc {
|
|
395 #---------------
|
|
396 my ($self,$pat) = @_;
|
|
397
|
|
398 $pat ||= $self->str;
|
|
399 $pat =~ s/N|X/./g;
|
|
400 $pat =~ s/pu/R/ig;
|
|
401 $pat =~ s/py/Y/ig;
|
|
402 $pat =~ s/U/T/g;
|
|
403 $pat =~ s/^</\^/;
|
|
404 $pat =~ s/>$/\$/;
|
|
405
|
|
406 ## Avoid nested situations: [ya] --/--> [[ct]a]
|
|
407 ## Yet correctly deal with: sg[ya] ---> [gc]g[cta]
|
|
408 if($pat =~ /\[\w*[RYSWMK]\w*\]/) {
|
|
409 $pat =~ s/\[(\w*)R(\w*)\]/\[$1$PURINES$2\]/g;
|
|
410 $pat =~ s/\[(\w*)Y(\w*)\]/\[$1$PYRIMIDINES$2\]/g;
|
|
411 $pat =~ s/\[(\w*)S(\w*)\]/\[$1GC$2\]/g;
|
|
412 $pat =~ s/\[(\w*)W(\w*)\]/\[$1AT$2\]/g;
|
|
413 $pat =~ s/\[(\w*)M(\w*)\]/\[$1AC$2\]/g;
|
|
414 $pat =~ s/\[(\w*)K(\w*)\]/\[$1GT$2\]/g;
|
|
415 $pat =~ s/\[(\w*)V(\w*)\]/\[$1ACG$2\]/g;
|
|
416 $pat =~ s/\[(\w*)H(\w*)\]/\[$1ACT$2\]/g;
|
|
417 $pat =~ s/\[(\w*)D(\w*)\]/\[$1AGT$2\]/g;
|
|
418 $pat =~ s/\[(\w*)B(\w*)\]/\[$1CGT$2\]/g;
|
|
419 $pat =~ s/R/\[$PURINES\]/g;
|
|
420 $pat =~ s/Y/\[$PYRIMIDINES\]/g;
|
|
421 $pat =~ s/S/\[GC\]/g;
|
|
422 $pat =~ s/W/\[AT\]/g;
|
|
423 $pat =~ s/M/\[AC\]/g;
|
|
424 $pat =~ s/K/\[GT\]/g;
|
|
425 $pat =~ s/V/\[ACG\]/g;
|
|
426 $pat =~ s/H/\[ACT\]/g;
|
|
427 $pat =~ s/D/\[AGT\]/g;
|
|
428 $pat =~ s/B/\[CGT\]/g;
|
|
429 } else {
|
|
430 $pat =~ s/R/\[$PURINES\]/g;
|
|
431 $pat =~ s/Y/\[$PYRIMIDINES\]/g;
|
|
432 $pat =~ s/S/\[GC\]/g;
|
|
433 $pat =~ s/W/\[AT\]/g;
|
|
434 $pat =~ s/M/\[AC\]/g;
|
|
435 $pat =~ s/K/\[GT\]/g;
|
|
436 $pat =~ s/V/\[ACG\]/g;
|
|
437 $pat =~ s/H/\[ACT\]/g;
|
|
438 $pat =~ s/D/\[AGT\]/g;
|
|
439 $pat =~ s/B/\[CGT\]/g;
|
|
440 }
|
|
441 $pat =~ s/\((.)\)/$1/g; ## Doing thses last since:
|
|
442 $pat =~ s/\[(.)\]/$1/g; ## Pattern could contain [y] (for example)
|
|
443
|
|
444 return $pat;
|
|
445 }
|
|
446
|
|
447
|
|
448
|
|
449 =head1 revcom
|
|
450
|
|
451 Title : revcom
|
|
452 Usage : revcom([1]);
|
|
453 Purpose : Forms a pattern capable of recognizing the reverse complement
|
|
454 : version of a nucleotide sequence pattern.
|
|
455 Example : $pattern_object->revcom();
|
|
456 : $pattern_object->revcom(1); ## returns expanded rev complement pattern.
|
|
457 Returns : Object reference for a new Bio::Tools::SeqPattern containing
|
|
458 : the revcom of the current pattern as its sequence.
|
|
459 Argument : (1) boolean (optional) (default= false)
|
|
460 : true : expand the pattern before rev-complementing.
|
|
461 : false: don't expand pattern before or after rev-complementing.
|
|
462 Throws : Exception if called for amino acid sequence pattern.
|
|
463 Comments : This method permits the simultaneous searching of both
|
|
464 : sense and anti-sense versions of a nucleotide pattern
|
|
465 : by means of a grep-type of functionality in which any
|
|
466 : number of patterns may be or-ed into the recognition
|
|
467 : pattern.
|
|
468 : Overrides Bio::Seq::revcom() and calls it first thing.
|
|
469 : The order of _fixpat() calls is critical.
|
|
470
|
|
471 See Also : B<Bio::Seq::revcom()>, L<_fixpat_1>(), L<_fixpat_2>(), L<_fixpat_3>(), L<_fixpat_4>(), L<_fixpat_5>()
|
|
472
|
|
473 =cut
|
|
474
|
|
475 #-----------'
|
|
476 sub revcom {
|
|
477 #-----------
|
|
478 my($self,$expand) = @_;
|
|
479
|
|
480 if ($self->type !~ /Dna|Rna/i) {
|
|
481 $self->throw("Can't get revcom for ${\$self->type} sequence types.\n");
|
|
482 }
|
|
483 # return $self->{'_rev'} if defined $self->{'_rev'};
|
|
484
|
|
485 $expand ||= 0;
|
|
486 my $str = $self->str;
|
|
487 $str =~ tr/acgtrymkswhbvdnxACGTRYMKSWHBVDNX/tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX/;
|
|
488 my $rev = CORE::reverse $str;
|
|
489 $rev =~ tr/[](){}<>/][)(}{></;
|
|
490
|
|
491 if($expand) {
|
|
492 $rev = $self->_expand_nuc($rev);
|
|
493 # print "\nExpanded: $rev\n";
|
|
494 }
|
|
495
|
|
496 %Processed_braces = ();
|
|
497 %Processed_asterics = ();
|
|
498
|
|
499 my $fixrev = _fixpat_1($rev);
|
|
500 # print "FIX 1: $fixrev";<STDIN>;
|
|
501
|
|
502 $fixrev = _fixpat_2($fixrev);
|
|
503 # print "FIX 2: $fixrev";<STDIN>;
|
|
504
|
|
505 $fixrev = _fixpat_3($fixrev);
|
|
506 # print "FIX 3: $fixrev";<STDIN>;
|
|
507
|
|
508 $fixrev = _fixpat_4($fixrev);
|
|
509 # print "FIX 4: $fixrev";<STDIN>;
|
|
510
|
|
511 $fixrev = _fixpat_5($fixrev);
|
|
512 # print "FIX 5: $fixrev";<STDIN>;
|
|
513
|
|
514 ##### Added by ps 8/7/00 to allow non-greedy matching
|
|
515 $fixrev = _fixpat_6($fixrev);
|
|
516 # print "FIX 6: $fixrev";<STDIN>;
|
|
517
|
|
518 # $self->{'_rev'} = $fixrev;
|
|
519
|
|
520 return new Bio::Tools::SeqPattern(-seq =>$fixrev, -type =>$self->type);
|
|
521 }
|
|
522
|
|
523
|
|
524
|
|
525 =head1 _fixpat_1
|
|
526
|
|
527 Title : _fixpat_1
|
|
528 Usage : n/a; called automatically by revcom()
|
|
529 Purpose : Utility method for revcom()
|
|
530 : Converts all {7,5} --> {5,7} (Part I)
|
|
531 : and [T^] --> [^T] (Part II)
|
|
532 : and *N --> N* (Part III)
|
|
533 Returns : String (the new, partially reversed pattern)
|
|
534 Argument : String (the expanded pattern)
|
|
535 Throws : n/a
|
|
536
|
|
537 See Also : L<revcom>()
|
|
538
|
|
539 =cut
|
|
540
|
|
541 #--------------
|
|
542 sub _fixpat_1 {
|
|
543 #--------------
|
|
544 my $pat = shift;
|
|
545
|
|
546 ## Part I:
|
|
547 my (@done,@parts);
|
|
548 while(1) {
|
|
549 $pat =~ /(.*)\{(\S+?)\}(.*)/ or do{ push @done, $pat; last; };
|
|
550 $pat = $1.'#{'.reverse($2).'}'.$3;
|
|
551 # print "1: $1\n2: $2\n3: $3\n";
|
|
552 # print "modified pat: $pat";<STDIN>;
|
|
553 @parts = split '#', $pat;
|
|
554 push @done, $parts[1];
|
|
555 $pat = $parts[0];
|
|
556 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
557 last if not $pat;
|
|
558 }
|
|
559 $pat = join('', reverse @done);
|
|
560
|
|
561 ## Part II:
|
|
562 @done = ();
|
|
563 while(1) {
|
|
564 $pat =~ /(.*)\[(\S+?)\](.*)/ or do{ push @done, $pat; last; };
|
|
565 $pat = $1.'#['.reverse($2).']'.$3;
|
|
566 # print "1: $1\n2: $2\n3: $3\n";
|
|
567 # print "modified pat: $pat";<STDIN>;
|
|
568 @parts = split '#', $pat;
|
|
569 push @done, $parts[1];
|
|
570 $pat = $parts[0];
|
|
571 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
572 last if not $pat;
|
|
573 }
|
|
574 $pat = join('', reverse @done);
|
|
575
|
|
576 ## Part III:
|
|
577 @done = ();
|
|
578 while(1) {
|
|
579 $pat =~ /(.*)\*([\w.])(.*)/ or do{ push @done, $pat; last; };
|
|
580 $pat = $1.'#'.$2.'*'.$3;
|
|
581 $Processed_asterics{$2}++;
|
|
582 # print "1: $1\n2: $2\n3: $3\n";
|
|
583 # print "modified pat: $pat";<STDIN>;
|
|
584 @parts = split '#', $pat;
|
|
585 push @done, $parts[1];
|
|
586 $pat = $parts[0];
|
|
587 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
588 last if not $pat;
|
|
589 }
|
|
590 return join('', reverse @done);
|
|
591
|
|
592 }
|
|
593
|
|
594
|
|
595 =head1 _fixpat_2
|
|
596
|
|
597 Title : _fixpat_2
|
|
598 Usage : n/a; called automatically by revcom()
|
|
599 Purpose : Utility method for revcom()
|
|
600 : Converts all {5,7}Y ---> Y{5,7}
|
|
601 : and {10,}. ---> .{10,}
|
|
602 Returns : String (the new, partially reversed pattern)
|
|
603 Argument : String (the expanded, partially reversed pattern)
|
|
604 Throws : n/a
|
|
605
|
|
606 See Also : L<revcom>()
|
|
607
|
|
608 =cut
|
|
609
|
|
610 #--------------
|
|
611 sub _fixpat_2 {
|
|
612 #--------------
|
|
613 my $pat = shift;
|
|
614
|
|
615 local($^W) = 0;
|
|
616 my (@done,@parts,$braces);
|
|
617 while(1) {
|
|
618 # $pat =~ s/(.*)([^])])(\{\S+?\})([\w.])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
|
|
619 $pat =~ s/(.*)(\{\S+?\})([\w.])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
|
|
620 $braces = $2;
|
|
621 $braces =~ s/[{}]//g;
|
|
622 $Processed_braces{"$3$braces"}++;
|
|
623 # print "modified pat: $pat";<STDIN>;
|
|
624 @parts = split '#', $pat;
|
|
625 push @done, $parts[1];
|
|
626 $pat = $parts[0];
|
|
627 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
628 last if not $pat;
|
|
629 }
|
|
630 return join('', reverse @done);
|
|
631 }
|
|
632
|
|
633
|
|
634 =head1 _fixpat_3
|
|
635
|
|
636 Title : _fixpat_3
|
|
637 Usage : n/a; called automatically by revcom()
|
|
638 Purpose : Utility method for revcom()
|
|
639 : Converts all {5,7}(XXX) ---> (XXX){5,7}
|
|
640 Returns : String (the new, partially reversed pattern)
|
|
641 Argument : String (the expanded, partially reversed pattern)
|
|
642 Throws : n/a
|
|
643
|
|
644 See Also : L<revcom>()
|
|
645
|
|
646 =cut
|
|
647
|
|
648 #-------------
|
|
649 sub _fixpat_3 {
|
|
650 #-------------
|
|
651 my $pat = shift;
|
|
652
|
|
653 my (@done,@parts,$braces,$newpat,$oldpat);
|
|
654 while(1) {
|
|
655 # $pat =~ s/(.+)(\{\S+\})(\(\w+\))(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
|
|
656 if( $pat =~ /(.*)(.)(\{\S+\})(\(\w+\))(.*)/) {
|
|
657 $newpat = "$1#$2$4$3$5";
|
|
658 ##ps $oldpat = "$1#$2$3$4$5";
|
|
659 # print "1: $1\n2: $2\n3: $3\n4: $4\n5: $5\n";
|
|
660 ##ps $braces = $3;
|
|
661 ##ps $braces =~ s/[{}]//g;
|
|
662 ##ps if( exists $Processed_braces{"$2$braces"} || exists $Processed_asterics{$2}) {
|
|
663 ##ps $pat = $oldpat; # Don't change it. Already processed.
|
|
664 # print "saved pat: $pat";<STDIN>;
|
|
665 ##ps } else {
|
|
666 # print "new pat: $newpat";<STDIN>;
|
|
667 $pat = $newpat; # Change it.
|
|
668 ##ps }
|
|
669 } elsif( $pat =~ /^(\{\S+\})(\(\w+\))(.*)/) {
|
|
670 $pat = "#$2$1$3";
|
|
671 } else {
|
|
672 push @done, $pat; last;
|
|
673 }
|
|
674 @parts = split '#', $pat;
|
|
675 push @done, $parts[1];
|
|
676 $pat = $parts[0];
|
|
677 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
678 last if not $pat;
|
|
679 }
|
|
680 return join('', reverse @done);
|
|
681 }
|
|
682
|
|
683
|
|
684 =head1 _fixpat_4
|
|
685
|
|
686 Title : _fixpat_4
|
|
687 Usage : n/a; called automatically by revcom()
|
|
688 Purpose : Utility method for revcom()
|
|
689 : Converts all {5,7}[XXX] ---> [XXX]{5,7}
|
|
690 Returns : String (the new, partially reversed pattern)
|
|
691 Argument : String (the expanded, partially reversed pattern)
|
|
692 Throws : n/a
|
|
693
|
|
694 See Also : L<revcom>()
|
|
695
|
|
696 =cut
|
|
697
|
|
698 #---------------
|
|
699 sub _fixpat_4 {
|
|
700 #---------------
|
|
701 my $pat = shift;
|
|
702
|
|
703 my (@done,@parts,$braces,$newpat,$oldpat);
|
|
704 while(1) {
|
|
705 # $pat =~ s/(.*)(\{\S+\})(\[\w+\])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
|
|
706 # $pat =~ s/(.*)([^\w.])(\{\S+\})(\[\w+\])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
|
|
707 if( $pat =~ /(.*)(.)(\{\S+\})(\[\w+\])(.*)/) {
|
|
708 $newpat = "$1#$2$4$3$5";
|
|
709 $oldpat = "$1#$2$3$4$5";
|
|
710 # print "1: $1\n2: $2\n3: $3\n4: $4\n5: $5\n";
|
|
711 $braces = $3;
|
|
712 $braces =~ s/[{}]//g;
|
|
713 if( (defined $braces and defined $2) and
|
|
714 exists $Processed_braces{"$2$braces"} || exists $Processed_asterics{$2}) {
|
|
715 $pat = $oldpat; # Don't change it. Already processed.
|
|
716 # print "saved pat: $pat";<STDIN>;
|
|
717 } else {
|
|
718 $pat = $newpat; # Change it.
|
|
719 # print "new pat: $pat";<STDIN>;
|
|
720 }
|
|
721 } elsif( $pat =~ /^(\{\S+\})(\[\w+\])(.*)/) {
|
|
722 $pat = "#$2$1$3";
|
|
723 } else {
|
|
724 push @done, $pat; last;
|
|
725 }
|
|
726
|
|
727 @parts = split '#', $pat;
|
|
728 push @done, $parts[1];
|
|
729 $pat = $parts[0];
|
|
730 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
731 last if not $pat;
|
|
732 }
|
|
733 return join('', reverse @done);
|
|
734 }
|
|
735
|
|
736
|
|
737 =head1 _fixpat_5
|
|
738
|
|
739 Title : _fixpat_5
|
|
740 Usage : n/a; called automatically by revcom()
|
|
741 Purpose : Utility method for revcom()
|
|
742 : Converts all *[XXX] ---> [XXX]*
|
|
743 : and *(XXX) ---> (XXX)*
|
|
744 Returns : String (the new, partially reversed pattern)
|
|
745 Argument : String (the expanded, partially reversed pattern)
|
|
746 Throws : n/a
|
|
747
|
|
748 See Also : L<revcom>()
|
|
749
|
|
750 =cut
|
|
751
|
|
752 #--------------
|
|
753 sub _fixpat_5 {
|
|
754 #--------------
|
|
755 my $pat = shift;
|
|
756
|
|
757 my (@done,@parts,$newpat,$oldpat);
|
|
758 while(1) {
|
|
759 # $pat =~ s/(.*)(\{\S+\})(\[\w+\])(.*)/$1#$3$2$4/ or do{ push @done, $pat; last; };
|
|
760 # $pat =~ s/(.*)([^\w.])(\{\S+\})(\[\w+\])(.*)/$1$2#$4$3$5/ or do{ push @done, $pat; last; };
|
|
761 if( $pat =~ /(.*)(.)\*(\[\w+\]|\(\w+\))(.*)/) {
|
|
762 $newpat = "$1#$2$3*$4";
|
|
763 $oldpat = "$1#$2*$3$4";
|
|
764 # print "1: $1\n2: $2\n3: $3\n4: $4\n";
|
|
765 if( exists $Processed_asterics{$2}) {
|
|
766 $pat = $oldpat; # Don't change it. Already processed.
|
|
767 # print "saved pat: $pat";<STDIN>;
|
|
768 } else {
|
|
769 $pat = $newpat; # Change it.
|
|
770 # print "new pat: $pat";<STDIN>;
|
|
771 }
|
|
772 } elsif( $pat =~ /^\*(\[\w+\]|\(\w+\))(.*)/) {
|
|
773 $pat = "#$1*$3";
|
|
774 } else {
|
|
775 push @done, $pat; last;
|
|
776 }
|
|
777
|
|
778 @parts = split '#', $pat;
|
|
779 push @done, $parts[1];
|
|
780 $pat = $parts[0];
|
|
781 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
782 last if not $pat;
|
|
783 }
|
|
784 return join('', reverse @done);
|
|
785 }
|
|
786
|
|
787
|
|
788
|
|
789
|
|
790
|
|
791 ############################
|
|
792 #
|
|
793 # PS: Added 8/7/00 to allow non-greedy matching patterns
|
|
794 #
|
|
795 ######################################
|
|
796
|
|
797 =head1 _fixpat_6
|
|
798
|
|
799 Title : _fixpat_6
|
|
800 Usage : n/a; called automatically by revcom()
|
|
801 Purpose : Utility method for revcom()
|
|
802 : Converts all ?Y{5,7} ---> Y{5,7}?
|
|
803 : and ?(XXX){5,7} ---> (XXX){5,7}?
|
|
804 : and ?[XYZ]{5,7} ---> [XYZ]{5,7}?
|
|
805 Returns : String (the new, partially reversed pattern)
|
|
806 Argument : String (the expanded, partially reversed pattern)
|
|
807 Throws : n/a
|
|
808
|
|
809 See Also : L<revcom>()
|
|
810
|
|
811 =cut
|
|
812
|
|
813 #--------------
|
|
814 sub _fixpat_6 {
|
|
815 #--------------
|
|
816 my $pat = shift;
|
|
817 my (@done,@parts);
|
|
818
|
|
819 @done = ();
|
|
820 while(1) {
|
|
821 $pat =~ /(.*)\?(\[\w+\]|\(\w+\)|\w)(\{\S+?\})?(.*)/ or do{ push @done, $pat; last; };
|
|
822 my $quantifier = $3 ? $3 : ""; # Shut up warning if no explicit quantifier
|
|
823 $pat = $1.'#'.$2.$quantifier.'?'.$4;
|
|
824 # $pat = $1.'#'.$2.$3.'?'.$4;
|
|
825
|
|
826 # print "1: $1\n2: $2\n3: $3\n";
|
|
827 # print "modified pat: $pat";<STDIN>;
|
|
828 @parts = split '#', $pat;
|
|
829 push @done, $parts[1];
|
|
830 $pat = $parts[0];
|
|
831 # print "done: $parts[1]<---\nnew pat: $pat<---";<STDIN>;
|
|
832 last if not $pat;
|
|
833 }
|
|
834 return join('', reverse @done);
|
|
835
|
|
836 }
|
|
837
|
|
838 =head2 str
|
|
839
|
|
840 Title : str
|
|
841 Usage : $obj->str($newval)
|
|
842 Function:
|
|
843 Returns : value of str
|
|
844 Args : newvalue (optional)
|
|
845
|
|
846
|
|
847 =cut
|
|
848
|
|
849 sub str{
|
|
850 my $obj = shift;
|
|
851 if( @_ ) {
|
|
852 my $value = shift;
|
|
853 $obj->{'str'} = $value;
|
|
854 }
|
|
855 return $obj->{'str'};
|
|
856
|
|
857 }
|
|
858
|
|
859 =head2 type
|
|
860
|
|
861 Title : type
|
|
862 Usage : $obj->type($newval)
|
|
863 Function:
|
|
864 Returns : value of type
|
|
865 Args : newvalue (optional)
|
|
866
|
|
867
|
|
868 =cut
|
|
869
|
|
870 sub type{
|
|
871 my $obj = shift;
|
|
872 if( @_ ) {
|
|
873 my $value = shift;
|
|
874 $obj->{'type'} = $value;
|
|
875 }
|
|
876 return $obj->{'type'};
|
|
877
|
|
878 }
|
|
879
|
|
880 1;
|
|
881
|
|
882 __END__
|
|
883
|
|
884 #########################################################################
|
|
885 # End of class
|
|
886 #########################################################################
|
|
887
|
|
888 =head1 FOR DEVELOPERS ONLY
|
|
889
|
|
890 =head2 Data Members
|
|
891
|
|
892 Information about the various data members of this module is provided
|
|
893 for those wishing to modify or understand the code. Two things to bear
|
|
894 in mind:
|
|
895
|
|
896 =over 2
|
|
897
|
|
898 =item 1 Do NOT rely on these in any code outside of this module.
|
|
899
|
|
900 All data members are prefixed with an underscore to signify that they
|
|
901 are private. Always use accessor methods. If the accessor doesn't
|
|
902 exist or is inadequate, create or modify an accessor (and let me know,
|
|
903 too!).
|
|
904
|
|
905 =item 2 This documentation may be incomplete and out of date.
|
|
906
|
|
907 It is easy for this documentation to become obsolete as this module is
|
|
908 still evolving. Always double check this info and search for members
|
|
909 not described here.
|
|
910
|
|
911 =back
|
|
912
|
|
913 An instance of Bio::Tools::RestrictionEnzyme.pm is a blessed reference
|
|
914 to a hash containing all or some of the following fields:
|
|
915
|
|
916 FIELD VALUE
|
|
917 ------------------------------------------------------------------------
|
|
918 _rev : The corrected reverse complement of the fully expanded pattern.
|
|
919
|
|
920 INHERITED DATA MEMBERS:
|
|
921
|
|
922 _seq : (From Bio::Seq.pm) The original, unexpanded input sequence after untainting.
|
|
923 _type : (From Bio::Seq.pm) 'Dna' or 'Amino'
|
|
924
|
|
925
|
|
926 =cut
|