annotate variant_effect_predictor/Bio/EnsEMBL/Utils/CigarString.pm @ 0:21066c0abaf5 draft

Uploaded
author willmclaren
date Fri, 03 Aug 2012 10:04:48 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
1 =head1 LICENSE
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
2
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
3 Copyright (c) 1999-2012 The European Bioinformatics Institute and
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
4 Genome Research Limited. All rights reserved.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
5
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
6 This software is distributed under a modified Apache license.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
7 For license details, please see
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
8
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
9 http://www.ensembl.org/info/about/code_licence.html
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
10
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
11 =head1 CONTACT
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
12
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
13 Please email comments or questions to the public Ensembl
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
14 developers list at <dev@ensembl.org>.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
15
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
16 Questions may also be sent to the Ensembl help desk at
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
17 <helpdesk@ensembl.org>.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
18
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
19 =head1 AUTHOR
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
20
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
21 Juguang Xiao <juguang@tll.org.sg>
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
22
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
23 =cut
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
24
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
25 =head1 NAME
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
26
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
27 Bio::EnsEMBL::Utils::CigarString, a utilites module to generate cigar
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
28 strings
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
29
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
30 =head1 DESCRIPTION
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
31
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
32 Sequence alignment hits were previously stored within the core database
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
33 as ungapped alignments. This imposed 2 major constraints on alignments:
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
34
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
35 a) alignments for a single hit record would require multiple rows in the
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
36 database, and
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
37 b) it was not possible to accurately retrieve the exact original
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
38 alignment.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
39
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
40 Therefore, in the new branch sequence alignments are now stored as
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
41 ungapped alignments in the cigar line format (where CIGAR stands for
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
42 Concise Idiosyncratic Gapped Alignment Report).
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
43
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
44 In the cigar line format alignments are sotred as follows:
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
45
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
46 M: Match
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
47 D: Deletino
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
48 I: Insertion
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
49
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
50 An example of an alignment for a hypthetical protein match is shown
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
51 below:
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
52
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
53
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
54 Query: 42 PGPAGLP----GSVGLQGPRGLRGPLP-GPLGPPL...
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
55 PG P G GP R PLGP
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
56 Sbjct: 1672 PGTP*TPLVPLGPWVPLGPSSPR--LPSGPLGPTD...
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
57
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
58 protein_align_feature table as the following cigar line:
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
59
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
60 7M4D12M2I2MD7M
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
61
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
62 =cut
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
63
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
64 package Bio::EnsEMBL::Utils::CigarString;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
65
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
66 use strict;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
67 use vars qw(@ISA);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
68 use Bio::EnsEMBL::Root;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
69
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
70 @ISA = qw(Bio::EnsEMBL::Root);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
71
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
72 =head2 split_hsp
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
73
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
74 Name : split_hsp (this name is derived from the original sub in BlastWorn)
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
75 Usage : my $hsp; # a ready Bio::Search::HSP::GenericHSP object.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
76 my $factory = new Bio::EnsEMBL::Utils::CigarString;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
77 my $cigar_string = $factory->split_hsp($hsp);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
78
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
79 Function: generate cigar string.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
80 Argument: a HSP object.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
81 Returns : a text string.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
82
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
83 =cut
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
84
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
85 sub split_hsp {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
86 my ($self, $hsp) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
87
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
88 $self->throw("a defined object needed") unless($hsp && defined($hsp));
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
89 unless(ref($hsp) && $hsp->isa('Bio::Search::HSP::GenericHSP')){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
90 $self->throw("a HSP object needed");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
91 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
92
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
93 my ($qtype, $htype) = $self->_findTypes($hsp);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
94 my ($qstrand, $hstrand) = $self->_findStrands($hsp);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
95 my ($qinc, $hinc) = $self->_findIncrements($qstrand,$hstrand,$qtype,$htype);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
96
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
97 my @gaps = ();
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
98 my @qchars = split(//, $hsp->query_string);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
99 my @hchars = split(//, $hsp->hit_string);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
100 my $qstart;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
101 if($qstrand == 1){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
102 $qstart = $hsp->query->start;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
103 }elsif($qstart == -1){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
104 $qstart = $hsp->query->end;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
105 }else{
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
106 $self->warn("[$qstart], invalid strand value on query");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
107 $qstart = $hsp->query->start;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
108 # Is this a SearchIO's bug???
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
109 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
110
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
111 my $hstart;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
112 if($hstrand == 1){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
113 $hstart = $hsp->subject->start;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
114 }elsif($hstrand != -1){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
115 $hstart = $hsp->subject->end;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
116 }else{
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
117 $self->throw("[$hstart], invalid strand value on subject");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
118 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
119
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
120 my $qend = $qstart;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
121 my $hend = $hstart;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
122 my $count = 0;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
123 my $found = 0;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
124
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
125 my @align_coordinates = ();
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
126 while($count <= $#qchars){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
127 if($qchars[$count] ne '-' && $hchars[$count] ne '-') {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
128 $qend += $qinc;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
129 $hend += $hinc;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
130 $found = 1;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
131 }else{ # gapped region
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
132 push(@align_coordinates, [$qstart, $hstart]) if($found == 1);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
133
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
134 $qstart = $qend;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
135 $qstart += $qinc if($qchars[$count] ne '-');
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
136
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
137 $hstart = $hend;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
138 $hstart += $hinc if($hchars[$count] ne '-');
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
139
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
140 $qend = $qstart;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
141 $hend = $hstart;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
142 $found = 0;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
143 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
144 $count++;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
145 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
146
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
147 if($found){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
148 push(@align_coordinates, [$qstart, $hstart]);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
149 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
150
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
151 my $cigar_string = "";
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
152 my $last = $#align_coordinates;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
153 if($last >= 0){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
154 for(my $i=0; $i<$last; $i++){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
155 my $q_this_start = $align_coordinates[$i]->[0];
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
156 my $q_next_start = $align_coordinates[$i+1]->[0];
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
157 my $q_length = ($q_next_start-$q_this_start-1)*$qinc;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
158 $q_length = abs($q_length);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
159 my $h_this_start = $align_coordinates[$i]->[1];
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
160 my $h_next_start = $align_coordinates[$i+1]->[1];
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
161 my $h_length = ($h_next_start-$h_this_start-1)*$hinc;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
162 $h_length = abs($h_length);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
163
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
164 my $diff = $q_length - $h_length;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
165 if($diff > 0){ # Insertion
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
166 $cigar_string .= $diff unless($diff == 1);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
167 $cigar_string .= 'I';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
168 }elsif($diff < 0){ # Deletion
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
169 $cigar_string .= -$diff unless($diff == -1);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
170 $cigar_string .= 'D';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
171 }else{ # e.g $diff == 0, Match
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
172 $cigar_string .= $q_length unless($q_length == 1);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
173 $cigar_string .= 'M';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
174 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
175
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
176 } # for
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
177 } # if
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
178
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
179 return $cigar_string;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
180 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
181
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
182
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
183 sub _findStrands {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
184 my ($self,$hsp) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
185
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
186 my $qstrand = $hsp->query->strand;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
187 unless($qstrand == 1 || $qstrand == -1){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
188 $self->warn("query's strand value is neither 1 or -1");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
189 $qstrand = 1;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
190 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
191
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
192 my $hstrand = $hsp->subject->strand;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
193 unless($hstrand == 1 || $hstrand == -1){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
194 $self->warn("subject's strand value is neither 1 or -1");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
195 $hstrand = 1;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
196 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
197
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
198 return ( $qstrand, $hstrand);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
199 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
200
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
201 sub _findTypes {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
202 my ($self,$hsp) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
203
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
204 my $type1;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
205 my $type2;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
206 my $len1 = $hsp->query->length();
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
207 my $len2 = $hsp->subject->length();
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
208
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
209 if ($len1/$len2 > 2) {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
210 $type1 = 'dna';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
211 $type2 = 'pep';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
212 } elsif ($len2/$len1 > 2) {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
213 $type1 = 'pep';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
214 $type2 = 'dna';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
215 } else {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
216 $type1 = 'dna';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
217 $type2 = 'dna';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
218 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
219
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
220 return ($type1,$type2);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
221 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
222
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
223 sub _findIncrements {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
224 my ($self,$qstrand,$hstrand,$qtype,$htype) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
225
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
226 my $qinc = 1 * $qstrand;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
227 my $hinc = 1 * $hstrand;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
228
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
229 if ($qtype eq 'dna' && $htype eq 'pep') {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
230 $qinc = 3 * $qinc;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
231 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
232 if ($qtype eq 'pep' && $htype eq 'dna') {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
233 $hinc = 3 * $hinc;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
234 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
235
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
236 return ($qinc,$hinc);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
237 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
238
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
239 # This is a core logic of cigar string. The finite state machine theory is
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
240 # apply. See the below table, x-axis represents the input, with 3 options:
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
241 # (+/+) -- Both current query and subject bases are non-gap. Match
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
242 # (-/+) -- The current query base is gap, but subject not. Deletion
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
243 # (+/-) -- The current subject base is gap, but query not. Insertion
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
244 # While the y-axis means the current state with letter 'M', 'D', 'I'
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
245 #
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
246 # The content of this table is the action taken in response of the input and
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
247 # the current state.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
248 # R remain the state, counter increment.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
249 # G;X generate the cigar line based on the current state and counter;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
250 # clear the counter to zero and change to the state X
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
251 #
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
252 # || +/+ | -/+ | +/- |
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
253 # -------+----------------------+
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
254 # M || R | G;D | G;I |
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
255 # ------------------------------+
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
256 # D || G;M | R | G;I |
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
257 # ------------------------------+
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
258 # I || G;M | G;D | R |
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
259 #
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
260
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
261 =head2 generate_cigar_string
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
262
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
263 Name : generate_cigar_string
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
264 Usage: $cigar_string = $self->generate_cigar_string(\@qchars, \@hchars);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
265 Function: generate the cigar string for a piece of alignment.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
266 Args: 2 array references. The lengths of 2 arrays are the same
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
267 Return: a cigar string
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
268
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
269 =cut
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
270
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
271 # Developer's Note: The method is originally abstracted from the concept of
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
272 # cigar string. It only asks the essential information of 2 sequence characters
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
273 # of the alignment, while the BlastWorn::split_HSP asks more unused information
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
274 # for cigar string, which is useful to form align_coordinates. - Juguang
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
275
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
276 my ($count, $state); # strictly only used in the following 2 subs
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
277
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
278 sub generate_cigar_string {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
279
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
280 # my ($self, $qstart, $hstart, $qinc, $hinc, $qchars_ref, $hchars_ref) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
281
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
282 my ($self, $qchars_ref, $hchars_ref) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
283
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
284 my @qchars = @{$qchars_ref};
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
285 my @hchars = @{$hchars_ref};
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
286
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
287 unless(scalar(@qchars) == scalar(@hchars)){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
288 $self->throw("two sequences are not equal in lengths");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
289 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
290
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
291 $count = 0;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
292 $state = 'M'; # the current state of gaps, (M, D, I)
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
293
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
294 my $cigar_string = '';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
295 for(my $i=0; $i <= $#qchars; $i++){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
296 my $qchar = $qchars[$i];
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
297 my $hchar = $hchars[$i];
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
298 if($qchar ne '-' && $hchar ne '-'){ # Match
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
299 $cigar_string .= $self->_sub_cigar_string('M');
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
300 }elsif($qchar eq '-'){ # Deletion
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
301 $cigar_string .= $self->_sub_cigar_string('D');
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
302 }elsif($hchar eq '-'){ # Insertion
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
303 $cigar_string .= $self->_sub_cigar_string('I');
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
304 }else{
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
305 $self->throw("Impossible state that 2 gaps on each seq aligned");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
306 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
307 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
308 $cigar_string .= $self->_sub_cigar_string('X'); # not forget the tail.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
309 return $cigar_string;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
310 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
311
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
312 sub _sub_cigar_string {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
313 my ($self, $new_state) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
314 my $sub_cigar_string = '';
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
315 if($state eq $new_state){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
316 $count++; # Remain the state and increase the counter
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
317 }else{
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
318 $sub_cigar_string .= $count unless $count == 1;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
319 $sub_cigar_string .= $state;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
320 $count = 1;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
321 $state = $new_state;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
322 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
323 return $sub_cigar_string;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
324 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
325
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
326 =head2 generate_cigar_string_by_hsp
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
327
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
328 Name : generate_cigar_string_by_hsp
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
329 Usage : my $hsp; # a ready GenericHSP object
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
330 my $cigar_string = $self->generate_cigar_string_by_hsp($hsp);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
331 Function: generate a cigar string by given HSP object.
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
332 Args : a GenericHSP object
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
333 Returns: a text string of cigar string
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
334
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
335 =cut
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
336
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
337 sub generate_cigar_string_by_hsp {
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
338 my ($self, $hsp) = @_;
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
339
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
340 unless(ref($hsp) && $hsp->isa('Bio::Search::HSP::GenericHSP')){
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
341 $self->throw("A GenericHSP object needed");
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
342 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
343
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
344 my @qchars = split(//, $hsp->query_string);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
345 my @hchars = split(//, $hsp->hit_string);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
346
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
347 return $self->generate_cigar_string(\@qchars, \@hchars);
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
348 }
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
349
21066c0abaf5 Uploaded
willmclaren
parents:
diff changeset
350 1;