annotate logo.pm @ 3:a0b27058dcac draft

Uploaded
author davidvanzessen
date Wed, 17 Sep 2014 07:25:17 -0400
parents 2f4298673519
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
1 #!/usr/local/bin/perl -w
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
2
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
3 =head1 NAME
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
4
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
5 logo.pm - organizes data in FASTA and CLUSTAL formats into height data.
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
6
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
7 =head1 SYNOPSIS
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
8
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
9 Perl module
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
10
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
11 =head1 DESCRIPTION
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
12
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
13 logo.pm: Takes in strings of aligned sequences and sorts them vertically
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
14 based on height as assigned by the following equations found in
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
15 Schneider and Stephens paper "Sequence Logos: A New Way to Display
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
16 Consensus Sequences":
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
17
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
18 height = f(b,l) * R(l) (1)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
19
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
20 where f(b,l) is the frequency of base or amino acid "b" at position
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
21 "l". R(l) is amount of information present at position "l" and can
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
22 be quantified as follows:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
23
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
24 R(l) for amino acids = log(20) - (H(l) + e(n)) (2a)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
25 R(l) for nucleic acids = 2 - (H(l) + e(n)) (2b)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
26
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
27 where log is taken base 2, H(l) is the uncertainty at position "l",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
28 and e(n) is the error correction factor for small "n". H(l) is
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
29 computed as follows:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
30
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
31 H(l) = - (Sum f(b,l) * log[ f(b,l) ]) (3)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
32
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
33 where again, log is taken base 2. f(b,l) is the frequency of base
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
34 "b" at position "l". The sum is taken over all amino acids or
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
35 bases, depending on which the data is.
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
36
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
37 Currently, logo.pm uses an approximation for e(n), given by:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
38
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
39 e(n) = (s-1) / (2 * ln2 * n) (4)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
40
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
41 Where s is 4 for nucleotides, 20 for amino acids ; n is the number
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
42 of sequences in the alignment. e(n) also gives the height of error
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
43 bars.
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
44
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
45 =cut
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
46
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
47 package logo;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
48
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
49 use strict;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
50 use Carp;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
51
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
52 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
53 ###### SOME VARIABLES ######
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
54 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
55
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
56 my $DEBUG = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
57
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
58 my $AA = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
59 my $NA = 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
60
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
61 my %BASES = ("a" => "adenine",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
62 "t" => "thymine",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
63 "g" => "guanine",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
64 "c" => "cytosine",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
65 "u" => "uracil");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
66
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
67 # does not include B or Z
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
68 my %AMINOACIDS = ("a" => "", "c" => "", "d" => "", "e" => "", "f" => "",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
69 "g" => "", "h" => "", "i" => "", "k" => "", "l" => "",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
70 "m" => "", "n" => "", "p" => "", "q" => "", "r" => "",
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
71 "s" => "", "t" => "", "v" => "", "w" => "", "y" => "");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
72
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
73 my @data;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
74 my $kind;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
75 my ($seqs_r, $desc_r);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
76
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
77 my $CONFIDENCE_LIMIT = 0.90;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
78
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
79 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
80 ###### SOME FUNCTIONS ######
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
81 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
82
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
83 =head1 APPENDIX
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
84
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
85 =cut
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
86
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
87 =head2 getHeightData()
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
88
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
89 Usage : my ($height_data_r, $description_data_r, $kind) =
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
90 logo::getHeightData($input_data_r, $params);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
91 Returns : * REFERENCE TO array of height data
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
92 * REFERENCE TO array of input description strings
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
93 * $AA if the data is amino acid, $NA otherise
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
94 Args : $input_data_r : input data in CLUSTAL or FASTA formats
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
95 : $params : hash of parameters
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
96
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
97 getHeightData is the entry point into the logo.pm module. $input_data_r is a
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
98 reference to an array of strings containing FASTA or CLUSTAL data, where all
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
99 lines whose first character is "#" is considered a comment line.
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
100
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
101 $params is a hash of parameters with the following keys:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
102 * smallsampletoggle : 0 to turn off small sample correction, otherwise
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
103 small sample correction is on
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
104 * input_kind : 0 for amino acids, 1 for nucleic acids; if undefined,
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
105 logo.pm will attempt to automatically detect whether the
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
106 input consists of amino or nucleic acid data. If
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
107 $input_kind is defined, only those residues defined by
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
108 $input_kind will be in the output -- all other residues are
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
109 considered as spaces. For example, if $input_kind is $NA,
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
110 the residue "I" or "i" are considered spaces, since "I" and
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
111 "i" are not nucleic acid residues.
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
112 * stretch : stretch all characters so they are flush at the maximum number
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
113 of bits allowed
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
114
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
115 Sample use:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
116
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
117 # get FASTA data
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
118 open (FASTA, "$fastafile");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
119 my @inputdata = <FASTA>;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
120 close (FASTA);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
121
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
122 my %heightparams = (
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
123 smallsamplecorrection => 0,
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
124 input_kind => 0,
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
125 stretch => 0
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
126 );
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
127
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
128 # get height data
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
129 my ($heightdata_r, $desc_r, $kind) = logo::getHeightData(\@inputdata, \%heightparams);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
130
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
131 =cut
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
132
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
133 # entry point into module
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
134 sub getHeightData {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
135
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
136 # $smallsampletoggle is toggle to turn off small sample correction (1 to turn off)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
137 # $input_kind can be $AA or $NA or undef
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
138 my ($input_data_r, $params) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
139
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
140 # gary 040119: adjust for formats (Unix is \n, Mac is \r, Windows is \r\n)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
141 $input_data_r = normalizeData($input_data_r);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
142
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
143 # gets sequences, sets $kind temporarily
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
144 my ($goodlength, $maxreslength, $badline, $validformat);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
145 ($seqs_r, $desc_r, $maxreslength, $goodlength, $badline, $validformat) =
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
146 getSeqs($input_data_r, $params->{input_kind});
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
147
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
148 # for(my $i = 0; $i < scalar @$seqs_r ; $i++) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
149 # print STDERR ($desc_r->[$i] . "\n" . $seqs_r->[$i] . "\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
150 # }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
151 # print STDERR "maxreslength = $maxreslength\n";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
152 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
153 # exit(0);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
154
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
155 if ($DEBUG) { print STDERR ("point 1\n");}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
156
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
157 # check for valid format
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
158 if ((defined $validformat) && ($validformat == 1)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
159 # print("returning\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
160 return (undef, undef, undef, undef, undef, 1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
161 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
162
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
163 if ($DEBUG) { print STDERR ("point 2\n");}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
164
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
165 # check for bad length
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
166 if (!$goodlength) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
167 return (undef, undef, undef, $goodlength, $badline);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
168 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
169
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
170 # reset $kind if in $input_kind
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
171 if (defined $params->{input_kind} && isLegalKind($params->{input_kind}) ) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
172 $kind = $params->{input_kind};
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
173 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
174
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
175 # build data
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
176 buildData(@$seqs_r, $params->{smallsampletoggle}, $params->{stretch}, $maxreslength);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
177
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
178 if ($DEBUG) { print STDERR ("point 3\n");}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
179
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
180 # print STDERR ("data size = ", scalar @data, "\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
181 # foreach (@data) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
182 # print STDERR ("$_\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
183 # }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
184 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
185 # exit(0);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
186 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
187 # print STDERR ("return at 2\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
188 return (\@data, $desc_r, $kind, $goodlength, $badline);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
189 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
190
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
191 sub isLegalKind {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
192 return ($_[0] =~ /^[01]$/);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
193 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
194
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
195 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
196 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
197 # sub normalizeData($data_r) returns $data_r, with Mac/Unix/Windows newline
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
198 # style normalized to standard Unix-style newline style
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
199 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
200 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
201 sub normalizeData {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
202 my ($data_r) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
203
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
204 # check args
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
205 if (not defined $data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
206 die "data_r must be defined\n";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
207 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
208
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
209 my @normalized = ();
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
210 foreach my $pseudo_line (@$data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
211 my @split_line = split(/[\r\n]+/, $pseudo_line);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
212 push(@normalized, @split_line);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
213 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
214
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
215 return \@normalized;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
216 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
217
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
218
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
219 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
220 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
221 # sub getSeqs($data_r, $kind) returns 5 values:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
222 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
223 # * array reference to sequence strings
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
224 # * array reference to sequence names
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
225 # * length of sequence
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
226 # * 1 if all sequences have the same length, 0 else
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
227 # * line number L where sequenceLength(L) != sequenceLength(other lines), else
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
228 # undef
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
229 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
230 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
231 sub getSeqs {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
232 my ($input_data_r, $kind) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
233
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
234 unless( $input_data_r->[0] ){
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
235 return (undef, undef, undef, undef, undef, 1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
236 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
237
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
238 # skip all comment chars and lines of all spaces
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
239 while ( ($input_data_r->[0] =~ /^\s*\#/) || ($input_data_r->[0] =~ /^\s*$/) ) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
240 shift @$input_data_r;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
241 if( !defined $input_data_r->[0]) {return (undef, undef, undef, undef, undef, 1);}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
242 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
243
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
244 if (isFormat_FASTA($input_data_r)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
245 return getSeqs_FASTA($input_data_r, $kind);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
246
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
247 } elsif (isFormat_CLUSTAL($input_data_r)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
248 return getSeqs_CLUSTAL($input_data_r, $kind);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
249
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
250 } elsif (isFormat_FLAT($input_data_r)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
251 return getSeqs_FLAT($input_data_r, $kind);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
252
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
253 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
254 if ($DEBUG) {print STDERR ("format nothing\n");}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
255 return (undef, undef, undef, undef, undef, 1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
256 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
257
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
258 # if ($_[0] =~ />/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
259 # return getSeqs_FASTA(@_);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
260 # } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
261 # return getSeqs_CLUSTAL(@_);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
262 # }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
263 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
264
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
265 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
266 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
267 # sub isFormat_FASTA($data_r) returns 1 if $data_r is in FASTA format
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
268 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
269 ################################################################################s
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
270 sub isFormat_FASTA {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
271 my ($input_data_r) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
272
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
273 # check args
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
274 if (not defined $input_data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
275 Carp::confess("logo::isFormat_FASTA : input_data_r must be defined\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
276 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
277
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
278 if ($input_data_r->[0] =~ />/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
279 return 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
280 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
281 return 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
282 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
283 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
284
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
285 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
286 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
287 # sub isFormat_CLUSTAL($data_r) returns 1 if $data_r is in CLUSTAL format
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
288 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
289 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
290 sub isFormat_CLUSTAL {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
291 my ($input_data_r) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
292
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
293 # check args
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
294 if (not defined $input_data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
295 Carp::confess("logo::isFormat_CLUSTAL : input_data_r must be " .
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
296 "defined\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
297 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
298
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
299 my $i=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
300
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
301 # # skip spaces or just "*" and "." and ":"
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
302 # while ($input_data_r->[$i] =~ /^[\*\:\s]*$/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
303 # $i++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
304 # }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
305
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
306 # if it looks like CLUSTAL W (version) ... , then it must be clustal
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
307 if ($input_data_r->[$i] =~ /^\s*CLUSTAL/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
308 return 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
309 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
310
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
311 # CLUSTAL looks like: "name seq"
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
312 if ($input_data_r->[$i] =~ /^\s*(\S+)\s+(\S+)\s*$/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
313 return 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
314 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
315 return 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
316 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
317 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
318
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
319 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
320 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
321 # sub isFormat_FLAT($data_r) returns 1 if $data_r is in FLAT format
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
322 #
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
323 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
324 sub isFormat_FLAT {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
325 my ($input_data_r) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
326
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
327 # check args
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
328 if (not defined $input_data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
329 Carp::confess("logo::isFormat_FLAT : input_data_r must be defined\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
330 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
331
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
332 # print("checking flat\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
333 # print("first element = -->", $input_data_r->[0], "<--\n");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
334
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
335 if ($input_data_r->[0] =~ /^[a-zA-Z\-]+\s*$/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
336 return 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
337 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
338 return 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
339 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
340 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
341
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
342 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
343 ###### FORMATTING FUNCTIONS ######
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
344 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
345
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
346 # the flat sequence format is as follows:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
347 # sequence1
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
348 # sequence2
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
349 # sequence3
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
350 # ...
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
351 # sequenceN
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
352 sub getSeqs_FLAT {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
353
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
354 if ($DEBUG) {print STDERR "DOING FLAT\n";}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
355
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
356 my ($input_data_r, $input_kind) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
357
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
358 my $linelength = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
359 my $seqCount = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
360 my $total_residues = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
361 my (@returnVal, @desc) = ();
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
362 my $prevlinelength = undef;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
363 my $NA_count = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
364
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
365 foreach my $seq (@$input_data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
366 # chomp $seq;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
367 $seq =~ s/\s+$//;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
368
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
369 my @chars = split(//,$seq);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
370
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
371 my $char;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
372 foreach (@chars) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
373 $total_residues++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
374 $linelength++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
375
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
376 # set $char
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
377 if (defined $input_kind) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
378 if ($input_kind == $AA) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
379 $char = (isAA($_)) ? $_ : "-";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
380 } else { # == $NA
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
381 $char = (isNA($_)) ? $_ : "-";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
382 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
383 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
384 $char = $_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
385 if (isNA($char)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
386 $NA_count++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
387 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
388 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
389
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
390 $returnVal[$seqCount] .= $char;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
391 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
392 $desc[$seqCount] = "no name";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
393
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
394 if ($seqCount == 0) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
395 $prevlinelength = $linelength;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
396 } elsif ($prevlinelength != $linelength) { # different number of residues, so complain
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
397 return (undef, undef, undef, 0, $seq); # 0 for not same length, $seq is name
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
398 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
399 $linelength=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
400
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
401 $seqCount++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
402 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
403
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
404 # determine whether to use $NA or $AA
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
405 if (!defined $input_kind) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
406 if ($NA_count / ($total_residues+1) >= $CONFIDENCE_LIMIT) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
407 $kind = $NA;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
408 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
409 $kind = $AA;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
410 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
411 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
412
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
413 return (\@returnVal, \@desc, $prevlinelength, 1, undef);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
414 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
415
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
416 sub getSeqs_CLUSTAL {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
417
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
418 if ($DEBUG) {print STDERR "DOING CLUSTAL\n";}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
419
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
420 my ($input_data_r, $input_kind) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
421
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
422 my @returnVal;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
423 my @desc;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
424 my $seqCount=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
425 my $reslength = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
426 my ($name, $seq);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
427
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
428 # my $input_kind = pop @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
429 # my $CONFIDENCE_LIMIT = 0.90;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
430 my $NA_count = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
431 my $total_residues = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
432 my ($prevlinelength, $linelength) = (0,0);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
433
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
434 # foreach (@_) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
435 foreach (@$input_data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
436 # chomp;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
437
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
438 if ($DEBUG) {print STDERR ("line = $_\n")};
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
439
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
440 $_ =~ s/\s+$//;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
441
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
442 # skip if it is a comment character -- first character is "#"
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
443 next if (/^\s*\#/);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
444
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
445 # skil if it is a CLUSTAL W header line
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
446 next if (/^\s*CLUSTAL/);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
447
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
448 # if spaces or just "*" and "." and ":"
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
449 if (/^[\*\.\:\s]*$/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
450 $seqCount=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
451 $prevlinelength=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
452 next;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
453 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
454
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
455 ($name,$seq) = (/^\s*(\S+)\s+(\S+)\s*$/);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
456
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
457 if ($DEBUG) { print STDERR ("name, seq = $name, $seq\n"); }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
458
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
459 # add new entry
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
460 if (!defined $desc[$seqCount]) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
461 $desc[$seqCount] = $name;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
462 $returnVal[$seqCount] = "";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
463 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
464
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
465 if(!defined $seq) {return (undef, undef, undef, undef, undef, 1);} # Something has gone terrible wrong
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
466
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
467 my @chars = split(//,$seq);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
468 my $char;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
469 foreach (@chars) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
470 if ($seqCount == 0) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
471 $reslength++; # all sequences have same residue length, so only count first one
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
472 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
473
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
474 $total_residues++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
475 $linelength++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
476
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
477 # set $char
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
478 if (defined $input_kind) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
479 if ($input_kind == $AA) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
480 $char = (isAA($_)) ? $_ : "-";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
481 } else { # == $NA
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
482 $char = (isNA($_)) ? $_ : "-";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
483 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
484 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
485 $char = $_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
486 if (isNA($char)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
487 $NA_count++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
488 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
489 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
490
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
491 $returnVal[$seqCount] .= $char;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
492 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
493
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
494 if ($seqCount == 0) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
495 $prevlinelength = $linelength;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
496 } elsif ($prevlinelength != $linelength) { # different number of residues, so complain
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
497 return (undef, undef, undef, 0, $name);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
498 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
499 $linelength=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
500
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
501 $seqCount++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
502 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
503
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
504 # determine whether to use $NA or $AA
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
505 if (!defined $input_kind ) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
506 if ($NA_count / ($total_residues+1) >= $CONFIDENCE_LIMIT) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
507 $kind = $NA;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
508 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
509 $kind = $AA;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
510 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
511 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
512
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
513 return (\@returnVal, \@desc, $reslength, 1, undef);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
514 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
515
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
516 # if $input_kind is defined, residues that are not defined are set to space
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
517 sub getSeqs_FASTA {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
518
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
519 if ($DEBUG) {print STDERR "DOING FASTA\n";}
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
520
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
521 my ($input_data_r, $input_kind) = @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
522
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
523 my @returnVal;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
524 my @desc;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
525 my $count=-1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
526 my $newElem=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
527
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
528 # my $input_kind = pop @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
529
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
530 # my $CONFIDENCE_LIMIT = 0.90;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
531 my $NA_count = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
532 my $total_residues = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
533 my $reslength = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
534 my $maxreslength = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
535
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
536 my ($goodlength, $currline, $prevline);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
537
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
538
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
539 # # skip all lines that are all spaces
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
540 # while ($_[0] =~ /^\s*$/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
541 # shift @_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
542 # }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
543
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
544 # foreach (@_) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
545 foreach (@$input_data_r) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
546
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
547 $_ =~ s/\s+$//;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
548
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
549 # skip if it is a comment character -- first character is "#"
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
550 next if (/^\s*\#/);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
551
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
552 # skip all lines that are all spaces
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
553 next if (/^\s*$/);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
554
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
555 $_ =~ s/\s+$//; # cut trailing white space
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
556 $_ =~ s/^\s+//; # cut leading white space
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
557 if (/>/) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
558 $currline = $_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
559 ($desc[scalar @desc]) = ($_ =~ />\s*(.+)$/);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
560
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
561 if (not $newElem) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
562 $count++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
563 $newElem = 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
564 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
565 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
566 if ($newElem) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
567 $maxreslength = $reslength if $maxreslength == 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
568 if (($maxreslength != 0) && ($maxreslength != $reslength)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
569 return (undef, undef, undef, 0, $prevline);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
570 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
571
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
572 $maxreslength = $reslength;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
573 $reslength = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
574 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
575
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
576 my @chars = split(//,$_);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
577 my $char;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
578 foreach (@chars) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
579 $reslength++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
580 $total_residues++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
581
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
582 # set $char
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
583 if (defined $input_kind) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
584 if ($input_kind == $AA) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
585 $char = (isAA($_)) ? $_ : "-";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
586 } else { # == $NA
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
587 $char = (isNA($_)) ? $_ : "-";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
588 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
589 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
590 $char = ($_ =~ /[a-zA-Z]/) ? $_ : "-"; # if not alpha char, use space
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
591 if (isNA($char) && !isSpace($char)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
592 $NA_count++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
593 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
594 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
595
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
596 if ($newElem) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
597 $returnVal[$count] = $char;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
598 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
599 $returnVal[$count] .= $char;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
600 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
601 $newElem = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
602 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
603 $prevline = $currline if $currline =~ />/;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
604 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
605 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
606
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
607 # check if last is biggest
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
608 if (($maxreslength != 0) && ($maxreslength != $reslength)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
609 return (undef, undef, undef, 0, $prevline);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
610 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
611 # $maxreslength = ($reslength > $maxreslength) ? $reslength : $maxreslength;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
612
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
613 # determine whether to use $NA or $AA
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
614 if (!defined $input_kind) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
615 if ($NA_count / ($total_residues+1) >= $CONFIDENCE_LIMIT) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
616 $kind = $NA;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
617 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
618 $kind = $AA;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
619 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
620 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
621
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
622 return (\@returnVal, \@desc, $maxreslength || $reslength, 1, undef); # 1 for good lengths
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
623 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
624
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
625 sub isSpace {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
626 return $_[0] =~ /[ \-]/;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
627 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
628
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
629 sub isAA {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
630 return (defined $AMINOACIDS{lc $_[0]});
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
631 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
632
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
633 sub isNA {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
634 return (defined $BASES{lc $_[0]});
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
635 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
636
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
637 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
638 ###### DATA BUILDING FUNCTIONS ######
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
639 ################################################################################
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
640
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
641
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
642 # arguments: takes reference to array and lines aligned sequences of bases or
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
643 # amino acids
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
644 # returns: updated reference array to reflect contents of sequences sorted
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
645 # vertically by height as described by (1)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
646 sub buildData {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
647
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
648 my $currentx = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
649 my $h;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
650 my $count=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
651 my $maxreslength = pop (@_);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
652 my $stretch = pop(@_);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
653 my $smallsampletoggle = pop (@_);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
654 my $totalsize = $#_+1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
655
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
656 while ($currentx < $maxreslength) { #(length $_[0])) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
657 my $allspaces = 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
658 my $spaceCount=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
659
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
660 # get vertical sequence
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
661 my @vert=();
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
662 foreach (@_) { # foreach sequence
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
663 my $currentchar;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
664
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
665 # set currentchar, set to " " if $_ is not long enough
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
666 if ($currentx >= (length $_)) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
667 $currentchar = " ";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
668 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
669 $currentchar = substr($_,$currentx,1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
670 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
671
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
672 # in all sequences, "-" is considered as a space
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
673 # don't count " " and "-"
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
674 if (($currentchar ne "-") && ($currentchar ne " ")) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
675 $vert[scalar @vert] = uc substr($_,$currentx,1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
676 $allspaces = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
677 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
678 $spaceCount++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
679 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
680 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
681
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
682 if ($allspaces) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
683 # build @vert
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
684 @vert = (" 0", ">0");
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
685
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
686 # place in @data
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
687 $data[scalar @data] = \@vert;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
688
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
689 $currentx++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
690 next;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
691 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
692
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
693 my $error;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
694 if ($smallsampletoggle) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
695 $error=getError($kind,$totalsize - $spaceCount);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
696 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
697 $error = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
698 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
699
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
700 # sort vertical sequence by amino acid or base
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
701 @vert = sort(@vert);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
702 my $total = $#vert + 1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
703
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
704 # find H(l) -- must be done before collapsing
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
705 $h = getH(@vert);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
706
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
707 # collect like terms
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
708 @vert = collapse(@vert);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
709
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
710 # get R(l)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
711 my $r;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
712 if (!defined $stretch || !$stretch) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
713 $r= getR($kind, $h, $error);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
714 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
715 $r = ($kind == $NA) ? 2 : (log(20) / log(2));
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
716 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
717
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
718 # place heights
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
719 my $count=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
720 my $height;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
721 my $elem;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
722 foreach $elem (@vert) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
723 $height = getHeight(substr($elem, 2) / $total,$r);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
724 $vert[$count] = substr($elem,0,1) . $height;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
725 $count++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
726 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
727
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
728 # sort by height
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
729 @vert = sort height_sort @vert;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
730
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
731 # put in error bar size
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
732 $vert[$count] = ">$error";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
733
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
734 # place in @data
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
735 $data[scalar @data] = \@vert;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
736
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
737 $currentx++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
738 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
739 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
740
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
741 # uses error approximation given by:
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
742 # e := (s-1) / (2 * ln2 * ntrue);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
743 sub getError {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
744 return ((($_[0] == $NA) ? 4 : 20) - 1) / (2 * log(2) * $_[1]);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
745 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
746
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
747 sub height_sort {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
748 my ($lettera, $heighta) = ($a =~ /^(.{1})(\S+)$/); #substr($a,1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
749 my ($letterb, $heightb) = ($b =~ /^(.{1})(\S+)$/); #substr($b,1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
750
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
751 # compare by height first, then letter
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
752 if ($heighta <=> $heightb) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
753 return $heighta <=> $heightb;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
754 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
755 return $letterb cmp $lettera; #reversed for some reason...
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
756 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
757 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
758
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
759 sub collapse {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
760 my @returnVal;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
761 my $current = $_[0];
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
762 my $count=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
763 my $freq;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
764
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
765 foreach (@_) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
766 if ($current eq $_) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
767 $count++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
768 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
769 $returnVal[scalar @returnVal] = "$current $count";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
770
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
771 $current = $_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
772 $count=1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
773 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
774 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
775
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
776 # add last element
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
777 $returnVal[scalar @returnVal] = "$current $count";
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
778
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
779 return @returnVal;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
780 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
781
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
782 # arguments : $_[0] : list of bases or amino acids
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
783 sub getH {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
784 my $h = 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
785 my (@vert) = @_; # vertical sequence (comparing multiple sequences)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
786
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
787 my $current = $vert[0];
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
788 my $count=0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
789 my $freq;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
790
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
791 foreach (@vert) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
792 if ($current eq $_) {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
793 $count++;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
794 } else {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
795 $freq = $count / ($#vert + 1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
796 $h += $freq * (log ($freq) / log(2));
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
797
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
798 $current = $_;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
799 $count=1;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
800 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
801 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
802
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
803 # add last element
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
804 $freq = $count / ($#vert + 1);
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
805 $h += $freq * (log ($freq) / log(2));
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
806
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
807 return -$h;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
808 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
809
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
810 # arguments : $_[0] : $AA or $NA
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
811 # $_[1] : uncertainty = H(l)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
812 # $_[2] : error correction factor for small number of sequences
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
813 # = e(n) ; assummed to be 0 if not given
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
814 sub getR {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
815 my $max = ($_[0] == $NA) ? 2 : (log(20) / log(2));
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
816 my $e = (defined $_[2]) ? $_[2] : 0;
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
817 return ($max - ($_[1] + $e));
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
818 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
819
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
820 # arguments: $_[0] : frequency = f(b,l)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
821 # $_[1] : R(l)
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
822 sub getHeight {
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
823 return $_[0] * $_[1]
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
824 }
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
825
2f4298673519 Uploaded
davidvanzessen
parents:
diff changeset
826 1;