annotate splitbc.pl @ 0:da4101033e10 draft default tip

planemo upload
author oinizan
date Wed, 18 Oct 2017 05:30:40 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
da4101033e10 planemo upload
oinizan
parents:
diff changeset
1 #!/usr/bin/perl
da4101033e10 planemo upload
oinizan
parents:
diff changeset
2
da4101033e10 planemo upload
oinizan
parents:
diff changeset
3 # FASTX-toolkit - FASTA/FASTQ preprocessing tools. fastx_barcode_splitter.pl
da4101033e10 planemo upload
oinizan
parents:
diff changeset
4 # Copyright (C) 2009 A. Gordon (gordon@cshl.edu)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
5 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
6 # This program is free software: you can redistribute it and/or modify
da4101033e10 planemo upload
oinizan
parents:
diff changeset
7 # it under the terms of the GNU Affero General Public License as
da4101033e10 planemo upload
oinizan
parents:
diff changeset
8 # published by the Free Software Foundation, either version 3 of the
da4101033e10 planemo upload
oinizan
parents:
diff changeset
9 # License, or (at your option) any later version.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
10 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
11 # This program is distributed in the hope that it will be useful,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
12 # but WITHOUT ANY WARRANTY; without even the implied warranty of
da4101033e10 planemo upload
oinizan
parents:
diff changeset
13 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
da4101033e10 planemo upload
oinizan
parents:
diff changeset
14 # GNU Affero General Public License for more details.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
15 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
16 # You should have received a copy of the GNU Affero General Public License
da4101033e10 planemo upload
oinizan
parents:
diff changeset
17 # along with this program. If not, see <http://www.gnu.org/licenses/>.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
18 use strict;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
19 use warnings;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
20 use IO::Handle;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
21 use IO::Zlib;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
22 use PerlIO::gzip;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
23 use Data::Dumper;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
24 use Getopt::Long;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
25 use Carp;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
26 use Pod::Usage;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
27
da4101033e10 planemo upload
oinizan
parents:
diff changeset
28 =pod
da4101033e10 planemo upload
oinizan
parents:
diff changeset
29
da4101033e10 planemo upload
oinizan
parents:
diff changeset
30 =head1 NAME
da4101033e10 planemo upload
oinizan
parents:
diff changeset
31
da4101033e10 planemo upload
oinizan
parents:
diff changeset
32 splitbc.pl
da4101033e10 planemo upload
oinizan
parents:
diff changeset
33
da4101033e10 planemo upload
oinizan
parents:
diff changeset
34 =head1 SYNOPSIS
da4101033e10 planemo upload
oinizan
parents:
diff changeset
35
da4101033e10 planemo upload
oinizan
parents:
diff changeset
36 splitbc.pl read_file_r1 [read_file_r2] --bcfile FILE --prefix-r1 r1.%.fq [--prefix-r2 r2.%.fq] --bol|--eol]
da4101033e10 planemo upload
oinizan
parents:
diff changeset
37 [--mismatches N] [--exact] [--partial N] [--trim|--trim2] [--no_adapt] [--rad STR --radTAG STR] [--help] [--quiet] [--debug]
da4101033e10 planemo upload
oinizan
parents:
diff changeset
38
da4101033e10 planemo upload
oinizan
parents:
diff changeset
39 =head1 DESCRIPTION
da4101033e10 planemo upload
oinizan
parents:
diff changeset
40
da4101033e10 planemo upload
oinizan
parents:
diff changeset
41 splitbc.pl classify fasta/fastq single or paired end reads in function of barcode forward or reverse in the first or both reads.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
42 Also, whether reads are RAD sequence, it checks that barcode is followed by RAD tag.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
43
da4101033e10 planemo upload
oinizan
parents:
diff changeset
44 =head1 OPTIONS
da4101033e10 planemo upload
oinizan
parents:
diff changeset
45
da4101033e10 planemo upload
oinizan
parents:
diff changeset
46 =over
da4101033e10 planemo upload
oinizan
parents:
diff changeset
47 =item B<--bcfile FILE>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
48
da4101033e10 planemo upload
oinizan
parents:
diff changeset
49 Barcodes file name. (splitbc.pl --doc for details)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
50
da4101033e10 planemo upload
oinizan
parents:
diff changeset
51 =item B<--prefix-r1 PREFIX>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
52
da4101033e10 planemo upload
oinizan
parents:
diff changeset
53 File prefix. The sample name will be placed where the % is placed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
54
da4101033e10 planemo upload
oinizan
parents:
diff changeset
55 =item B<--prefix-r2 PREFIX>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
56
da4101033e10 planemo upload
oinizan
parents:
diff changeset
57 File prefix. The sample name will be placed where the % is placed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
58
da4101033e10 planemo upload
oinizan
parents:
diff changeset
59 =item B<--bol>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
60
da4101033e10 planemo upload
oinizan
parents:
diff changeset
61 Try to match barcodes at the BEGINNING of sequences.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
62 (What biologists would call the 5' end, and programmers
da4101033e10 planemo upload
oinizan
parents:
diff changeset
63 would call index 0.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
64
da4101033e10 planemo upload
oinizan
parents:
diff changeset
65 =item B<--eol>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
66
da4101033e10 planemo upload
oinizan
parents:
diff changeset
67 Try to match barcodes at the END of sequences.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
68 (What biologists would call the 3' end, and programmers
da4101033e10 planemo upload
oinizan
parents:
diff changeset
69 would call the end of the string.). Give barcode as they appear in the sequence file, so reverse complement if needed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
70 NOTE: one of --bol, --eol must be specified, but not both.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
71
da4101033e10 planemo upload
oinizan
parents:
diff changeset
72 =item B<--mismatches N>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
73
da4101033e10 planemo upload
oinizan
parents:
diff changeset
74 Max. number of mismatches allowed. default is 1.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
75
da4101033e10 planemo upload
oinizan
parents:
diff changeset
76 =item B<--exact>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
77
da4101033e10 planemo upload
oinizan
parents:
diff changeset
78 Same as '--mismatches 0'. If both --exact and --mismatches are specified, '--exact' takes precedence.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
79
da4101033e10 planemo upload
oinizan
parents:
diff changeset
80 =item B <--partial N>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
81
da4101033e10 planemo upload
oinizan
parents:
diff changeset
82 Allow partial overlap of barcodes. (splitbc.pl --doc for details)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
83 (Default is not partial matching)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
84
da4101033e10 planemo upload
oinizan
parents:
diff changeset
85 item B<--trim>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
86
da4101033e10 planemo upload
oinizan
parents:
diff changeset
87 Should the barecode be trimmed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
88
da4101033e10 planemo upload
oinizan
parents:
diff changeset
89 =item B<--trim2>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
90
da4101033e10 planemo upload
oinizan
parents:
diff changeset
91 Shoud the read 2 be trimmed to have the same length as the read1
da4101033e10 planemo upload
oinizan
parents:
diff changeset
92 NOTE: one of --trim, --trim2 must be specified, but not both.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
93
da4101033e10 planemo upload
oinizan
parents:
diff changeset
94 =item B<--no_adapt>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
95
da4101033e10 planemo upload
oinizan
parents:
diff changeset
96 there is no adaptator (T or A ) between barcode and the sequence
da4101033e10 planemo upload
oinizan
parents:
diff changeset
97
da4101033e10 planemo upload
oinizan
parents:
diff changeset
98 =item B<--rad STR>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
99
da4101033e10 planemo upload
oinizan
parents:
diff changeset
100 sequence are RADSeq, barcode is immediately followed by rad sequence <STR>.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
101 NOTE: --rad implies --no_adapt
da4101033e10 planemo upload
oinizan
parents:
diff changeset
102
da4101033e10 planemo upload
oinizan
parents:
diff changeset
103 =item B<--radTAG STR>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
104
da4101033e10 planemo upload
oinizan
parents:
diff changeset
105 sequence retain at the begining of the sequence after cliping.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
106 NOTE: if define rad, radTAG must be defined to, and conversely
da4101033e10 planemo upload
oinizan
parents:
diff changeset
107 Barcode will be splited if they are folowed by radTAG, otherwise they will be recorded in the unmatched file.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
108
da4101033e10 planemo upload
oinizan
parents:
diff changeset
109 =item B<--TAG_mismatch N>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
110
da4101033e10 planemo upload
oinizan
parents:
diff changeset
111 Max. number of mismatches allowed in the radTAG sequence. default is 0.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
112
da4101033e10 planemo upload
oinizan
parents:
diff changeset
113 =item B<--quiet>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
114
da4101033e10 planemo upload
oinizan
parents:
diff changeset
115 Don't print counts and summary at the end of the run.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
116 (Default is to print.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
117
da4101033e10 planemo upload
oinizan
parents:
diff changeset
118 =item B<--debug>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
119
da4101033e10 planemo upload
oinizan
parents:
diff changeset
120 Print lots of useless debug information to STDERR.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
121
da4101033e10 planemo upload
oinizan
parents:
diff changeset
122 =item B<--help>
da4101033e10 planemo upload
oinizan
parents:
diff changeset
123
da4101033e10 planemo upload
oinizan
parents:
diff changeset
124 This helpful help screen.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
125
da4101033e10 planemo upload
oinizan
parents:
diff changeset
126 =back
da4101033e10 planemo upload
oinizan
parents:
diff changeset
127
da4101033e10 planemo upload
oinizan
parents:
diff changeset
128 =head1 AUTHORS
da4101033e10 planemo upload
oinizan
parents:
diff changeset
129
da4101033e10 planemo upload
oinizan
parents:
diff changeset
130 Jérôme Mariette
da4101033e10 planemo upload
oinizan
parents:
diff changeset
131 modified by Maria Bernard
da4101033e10 planemo upload
oinizan
parents:
diff changeset
132
da4101033e10 planemo upload
oinizan
parents:
diff changeset
133 =head1 VERSION
da4101033e10 planemo upload
oinizan
parents:
diff changeset
134
da4101033e10 planemo upload
oinizan
parents:
diff changeset
135 1.1
da4101033e10 planemo upload
oinizan
parents:
diff changeset
136
da4101033e10 planemo upload
oinizan
parents:
diff changeset
137 =head1 DATE
da4101033e10 planemo upload
oinizan
parents:
diff changeset
138
da4101033e10 planemo upload
oinizan
parents:
diff changeset
139 modified 21/05/2013
da4101033e10 planemo upload
oinizan
parents:
diff changeset
140
da4101033e10 planemo upload
oinizan
parents:
diff changeset
141 =head1 KEYWORDS
da4101033e10 planemo upload
oinizan
parents:
diff changeset
142
da4101033e10 planemo upload
oinizan
parents:
diff changeset
143 Barcode RAD
da4101033e10 planemo upload
oinizan
parents:
diff changeset
144
da4101033e10 planemo upload
oinizan
parents:
diff changeset
145 =head1 EXAMPLE
da4101033e10 planemo upload
oinizan
parents:
diff changeset
146
da4101033e10 planemo upload
oinizan
parents:
diff changeset
147 (Assuming 's_2_100.txt' is a FASTQ file, 'mybarcodes.txt' is the barcodes file):
da4101033e10 planemo upload
oinizan
parents:
diff changeset
148 $0 cat s_2_100.txt | $0 --bcfile mybarcodes.txt --bol --mismatches 2 \\
da4101033e10 planemo upload
oinizan
parents:
diff changeset
149 --prefix-r1 /tmp/bla_%.txt"
da4101033e10 planemo upload
oinizan
parents:
diff changeset
150
da4101033e10 planemo upload
oinizan
parents:
diff changeset
151 =head1 CHANGELOG
da4101033e10 planemo upload
oinizan
parents:
diff changeset
152
da4101033e10 planemo upload
oinizan
parents:
diff changeset
153 1.1: 18/10/2013 Maria
da4101033e10 planemo upload
oinizan
parents:
diff changeset
154 construct list of best barcode to deal with equal match. In this case reads will be written in "ambiguous" files
da4101033e10 planemo upload
oinizan
parents:
diff changeset
155
da4101033e10 planemo upload
oinizan
parents:
diff changeset
156 =cut
da4101033e10 planemo upload
oinizan
parents:
diff changeset
157
da4101033e10 planemo upload
oinizan
parents:
diff changeset
158 ##
da4101033e10 planemo upload
oinizan
parents:
diff changeset
159 ## This program splits a FASTQ/FASTA file into several smaller files,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
160 ## Based on barcode matching.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
161 ##
da4101033e10 planemo upload
oinizan
parents:
diff changeset
162 ## run with "--help" for usage information
da4101033e10 planemo upload
oinizan
parents:
diff changeset
163 ##
da4101033e10 planemo upload
oinizan
parents:
diff changeset
164 ## Assaf Gordon <gordon@cshl.edu> , 11sep2008
da4101033e10 planemo upload
oinizan
parents:
diff changeset
165
da4101033e10 planemo upload
oinizan
parents:
diff changeset
166 # Forward declarations
da4101033e10 planemo upload
oinizan
parents:
diff changeset
167 sub load_barcode_file ($);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
168 sub parse_command_line ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
169 sub match_sequences ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
170 sub mismatch_count($$) ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
171 sub print_results;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
172 sub open_and_detect_input_format;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
173 sub read_record;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
174 sub write_record($);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
175 sub usage();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
176
da4101033e10 planemo upload
oinizan
parents:
diff changeset
177 # Global flags and arguments,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
178 # Set by command line argumens
da4101033e10 planemo upload
oinizan
parents:
diff changeset
179 my $barcode_file ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
180 my $barcodes_at_eol = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
181 my $barcodes_at_bol = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
182 my $exact_match = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
183 my $allow_partial_overlap = 0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
184 my $allowed_mismatches = 1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
185 my $newfile_prefix_r1 = '';
da4101033e10 planemo upload
oinizan
parents:
diff changeset
186 my $newfile_prefix_r2 = '';
da4101033e10 planemo upload
oinizan
parents:
diff changeset
187 my $quiet = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
188 my $debug = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
189 my $trim = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
190 my $trim2 = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
191 my $adapt = 1 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
192 my $no_adapt = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
193 my $rad;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
194 my $radTAG;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
195 my $TAG_mm=0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
196 my $fastq_format = 1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
197
da4101033e10 planemo upload
oinizan
parents:
diff changeset
198 # Global variables
da4101033e10 planemo upload
oinizan
parents:
diff changeset
199 # Populated by 'parse_command_line'
da4101033e10 planemo upload
oinizan
parents:
diff changeset
200 my $radTAG_length=0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
201 my $check_rad=0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
202
da4101033e10 planemo upload
oinizan
parents:
diff changeset
203 # Global variables
da4101033e10 planemo upload
oinizan
parents:
diff changeset
204 # Populated by 'create_output_files'
da4101033e10 planemo upload
oinizan
parents:
diff changeset
205 my %filenames;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
206 my %files;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
207 my %counts = ( 'unmatched' => 0 );
da4101033e10 planemo upload
oinizan
parents:
diff changeset
208 my $barcodes_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
209 my @barcodes;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
210 my $input_file_io;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
211 my $input_file_io_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
212 my $paired = 0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
213 my $write_r2 = 0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
214
da4101033e10 planemo upload
oinizan
parents:
diff changeset
215 # The Four lines per record in FASTQ format.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
216 # (when using FASTA format, only the first two are used)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
217 my $seq_name;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
218 my $seq_bases;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
219 my $seq_name2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
220 my $seq_qualities;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
221 my $seq_len;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
222 my $seq_name_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
223 my $seq_bases_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
224 my $seq_name2_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
225 my $seq_qualities_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
226 my $seq_len_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
227
da4101033e10 planemo upload
oinizan
parents:
diff changeset
228 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
229 # Start of Program
da4101033e10 planemo upload
oinizan
parents:
diff changeset
230 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
231 parse_command_line ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
232
da4101033e10 planemo upload
oinizan
parents:
diff changeset
233 load_barcode_file ( $barcode_file ) ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
234
da4101033e10 planemo upload
oinizan
parents:
diff changeset
235 open_and_detect_input_format;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
236
da4101033e10 planemo upload
oinizan
parents:
diff changeset
237 match_sequences ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
238
da4101033e10 planemo upload
oinizan
parents:
diff changeset
239 print_results unless $quiet;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
240
da4101033e10 planemo upload
oinizan
parents:
diff changeset
241 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
242 # End of program
da4101033e10 planemo upload
oinizan
parents:
diff changeset
243 ########################################################################
da4101033e10 planemo upload
oinizan
parents:
diff changeset
244
da4101033e10 planemo upload
oinizan
parents:
diff changeset
245
da4101033e10 planemo upload
oinizan
parents:
diff changeset
246 sub parse_command_line {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
247 my $help;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
248 my $doc;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
249 my $version;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
250
da4101033e10 planemo upload
oinizan
parents:
diff changeset
251 pod2usage(1) if (scalar @ARGV==0);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
252
da4101033e10 planemo upload
oinizan
parents:
diff changeset
253 my $result = GetOptions ( "bcfile=s" => \$barcode_file,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
254 "eol" => \$barcodes_at_eol,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
255 "bol" => \$barcodes_at_bol,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
256 "exact" => \$exact_match,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
257 "prefix-r1=s" => \$newfile_prefix_r1,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
258 "prefix-r2=s" => \$newfile_prefix_r2,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
259 "quiet" => \$quiet,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
260 "partial=i" => \$allow_partial_overlap,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
261 "trim" => \$trim,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
262 "trim2" => \$trim2,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
263 "debug" => \$debug,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
264 "mismatches=i" => \$allowed_mismatches,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
265 "no_adapt"=> \$no_adapt,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
266 "rad=s"=> \$rad,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
267 "radTAG=s"=> \$radTAG,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
268 "TAG_mismatch=i"=> \$TAG_mm,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
269 "help" => \$help,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
270 "doc" => \$doc,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
271 "version" => \$version
da4101033e10 planemo upload
oinizan
parents:
diff changeset
272 ) ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
273
da4101033e10 planemo upload
oinizan
parents:
diff changeset
274 pod2usage(1) if ($help);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
275 doc() if ($doc);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
276 version() if ($version);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
277
da4101033e10 planemo upload
oinizan
parents:
diff changeset
278 die "Error: barcode file not specified (use '--bcfile [FILENAME]')\n" unless defined $barcode_file;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
279 die "Error: prefix path/filename not specified (use '--prefix-r1 [PATH]%.fq')\n" unless defined $newfile_prefix_r1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
280 # If a read2 file is provided
da4101033e10 planemo upload
oinizan
parents:
diff changeset
281 if ($#ARGV == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
282 die "Error: prefix path/filename not specified (use '--prefix-r2 [PATH]%.fq')\n" unless defined $newfile_prefix_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
283 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
284
da4101033e10 planemo upload
oinizan
parents:
diff changeset
285 if ($barcodes_at_bol == $barcodes_at_eol) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
286 die "Error: can't specify both --eol & --bol\n" if $barcodes_at_eol;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
287 die "Error: must specify either --eol or --bol\n" ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
288 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
289
da4101033e10 planemo upload
oinizan
parents:
diff changeset
290 die "Error: invalid for value partial matches (valid values are 0 or greater)\n" if $allow_partial_overlap<0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
291
da4101033e10 planemo upload
oinizan
parents:
diff changeset
292 $allowed_mismatches = 0 if $exact_match;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
293
da4101033e10 planemo upload
oinizan
parents:
diff changeset
294 die "Error: invalid value for mismatches (valid values are 0 or more)\n" if ($allowed_mismatches<0);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
295
da4101033e10 planemo upload
oinizan
parents:
diff changeset
296 die "Error: partial overlap value ($allow_partial_overlap) bigger than " .
da4101033e10 planemo upload
oinizan
parents:
diff changeset
297 "max. allowed mismatches ($allowed_mismatches)\n" if ($allow_partial_overlap > $allowed_mismatches);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
298
da4101033e10 planemo upload
oinizan
parents:
diff changeset
299 $trim = 1 if $trim2 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
300
da4101033e10 planemo upload
oinizan
parents:
diff changeset
301 if ((defined $rad && !defined $radTAG) || (!defined $rad && defined $radTAG)){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
302 die "Error: you must defined --rad AND --radTAG option\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
303 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
304 if ( $no_adapt){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
305 $adapt=0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
306 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
307 if (defined $rad ){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
308 chomp $rad;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
309 chomp $radTAG;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
310 die "Error: $rad does not ended by $radTAG\n" if ($rad !~ /$radTAG$/g);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
311 $adapt=0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
312 $check_rad=1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
313 #~ warn $rad, " ",$radTAG;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
314 $radTAG_length=length($radTAG);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
315 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
316 exit unless $result;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
317 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
318
da4101033e10 planemo upload
oinizan
parents:
diff changeset
319
da4101033e10 planemo upload
oinizan
parents:
diff changeset
320
da4101033e10 planemo upload
oinizan
parents:
diff changeset
321 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
322 # Read the barcode file
da4101033e10 planemo upload
oinizan
parents:
diff changeset
323 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
324 sub load_barcode_file ($) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
325 my $filename = shift or croak "Missing barcode file name";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
326
da4101033e10 planemo upload
oinizan
parents:
diff changeset
327 open BCFILE,"<$filename" or die "Error: failed to open barcode file ($filename)\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
328 while (<BCFILE>) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
329 next if m/^#/;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
330 chomp $_;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
331 my @arr = split(' ',$_);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
332 my $ident = $arr[0];
da4101033e10 planemo upload
oinizan
parents:
diff changeset
333 my $barcode = uc($arr[1]);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
334
da4101033e10 planemo upload
oinizan
parents:
diff changeset
335 # Sanity checks on the barcodes
da4101033e10 planemo upload
oinizan
parents:
diff changeset
336 die "Error: bad data at barcode file ($filename) line $.\n" unless defined $barcode;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
337 die "Error: bad barcode value ($barcode) at barcode file ($filename) line $.\n"
da4101033e10 planemo upload
oinizan
parents:
diff changeset
338 unless $barcode =~ m/^[AGCT]+$/;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
339
da4101033e10 planemo upload
oinizan
parents:
diff changeset
340 die "Error: bad identifier value ($ident) at barcode file ($filename) line $. (must be alphanumeric)\n"
da4101033e10 planemo upload
oinizan
parents:
diff changeset
341 unless $ident =~ m/^\w+$/;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
342
da4101033e10 planemo upload
oinizan
parents:
diff changeset
343 die "Error: badcode($ident, $barcode) is shorter or equal to maximum number of " .
da4101033e10 planemo upload
oinizan
parents:
diff changeset
344 "mismatches ($allowed_mismatches). This makes no sense. Specify fewer mismatches.\n"
da4101033e10 planemo upload
oinizan
parents:
diff changeset
345 if length($barcode)<=$allowed_mismatches;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
346
da4101033e10 planemo upload
oinizan
parents:
diff changeset
347 $barcodes_length = length($barcode) unless defined $barcodes_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
348 die "Error: found barcodes in different lengths. this feature is not supported yet.\n"
da4101033e10 planemo upload
oinizan
parents:
diff changeset
349 unless $barcodes_length == length($barcode);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
350
da4101033e10 planemo upload
oinizan
parents:
diff changeset
351 push @barcodes, [$ident, $barcode];
da4101033e10 planemo upload
oinizan
parents:
diff changeset
352
da4101033e10 planemo upload
oinizan
parents:
diff changeset
353 if ($allow_partial_overlap>0) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
354 foreach my $i (1 .. $allow_partial_overlap) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
355 substr $barcode, ($barcodes_at_bol)?0:-1, 1, '';
da4101033e10 planemo upload
oinizan
parents:
diff changeset
356 push @barcodes, [$ident, $barcode];
da4101033e10 planemo upload
oinizan
parents:
diff changeset
357 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
358 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
359 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
360 close BCFILE;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
361
da4101033e10 planemo upload
oinizan
parents:
diff changeset
362 if ($debug) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
363 print STDERR "barcode\tsequence\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
364 foreach my $barcoderef (@barcodes) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
365 my ($ident, $seq) = @{$barcoderef};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
366 print STDERR $ident,"\t", $seq ,"\t",length($seq),"\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
367 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
368 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
369 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
370
da4101033e10 planemo upload
oinizan
parents:
diff changeset
371 # Create one output file for each barcode.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
372 # (Also create a file for the dummy 'unmatched' barcode)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
373 sub create_output_files {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
374 my %barcodes = map { $_->[0] => 1 } @barcodes; #generate a uniq list of barcode identifiers;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
375 $barcodes{'unmatched'} = 1 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
376 $barcodes{'ambiguous'} = 1 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
377
da4101033e10 planemo upload
oinizan
parents:
diff changeset
378 foreach my $ident (keys %barcodes) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
379
da4101033e10 planemo upload
oinizan
parents:
diff changeset
380 my $new_filename = $newfile_prefix_r1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
381 $new_filename =~ s/%/$ident/g;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
382 $filenames{$ident} = $new_filename;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
383 open my $file, ">$new_filename" or die "Error: failed to create output file ($new_filename)\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
384 $files{$ident} = $file ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
385
da4101033e10 planemo upload
oinizan
parents:
diff changeset
386 if (defined $rad && $ident ne "unmatched" && $ident ne "ambiguous"){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
387 my $new_filename = $newfile_prefix_r1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
388 $new_filename =~ s/%/${ident}_2rad/g;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
389 $filenames{$ident."_2rad"} = $new_filename;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
390 open my $file, ">$new_filename" or die "Error: failed to create output file ($new_filename)\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
391 $files{$ident."_2rad"} = $file ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
392 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
393
da4101033e10 planemo upload
oinizan
parents:
diff changeset
394 if ($paired == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
395 my $new_filename_r2 = $newfile_prefix_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
396 $new_filename_r2 =~ s/%/$ident/g;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
397 $filenames{$ident."r2"} = $new_filename_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
398 open my $file, ">$new_filename_r2" or die "Error: failed to create output file ($new_filename)\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
399 $files{$ident."r2"} = $file ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
400
da4101033e10 planemo upload
oinizan
parents:
diff changeset
401 if (defined $rad && $ident ne "unmatched" && $ident ne "ambiguous"){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
402 my $new_filename_r2 = $newfile_prefix_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
403 $new_filename_r2 =~ s/%/${ident}_2rad/g;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
404 $filenames{$ident."_2radr2"} = $new_filename_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
405 open my $file, ">$new_filename_r2" or die "Error: failed to create output file ($new_filename)\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
406 $files{$ident."_2radr2"} = $file ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
407 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
408 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
409 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
410 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
411
da4101033e10 planemo upload
oinizan
parents:
diff changeset
412 sub reverse_complement {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
413 my $dna = shift;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
414
da4101033e10 planemo upload
oinizan
parents:
diff changeset
415 # reverse the DNA sequence
da4101033e10 planemo upload
oinizan
parents:
diff changeset
416 my $revcomp = reverse($dna);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
417
da4101033e10 planemo upload
oinizan
parents:
diff changeset
418 # complement the reversed DNA sequence
da4101033e10 planemo upload
oinizan
parents:
diff changeset
419 $revcomp =~ tr/ACGTacgt/TGCAtgca/;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
420 return $revcomp;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
421 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
422
da4101033e10 planemo upload
oinizan
parents:
diff changeset
423 sub match_sequences {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
424
da4101033e10 planemo upload
oinizan
parents:
diff changeset
425 my %barcodes = map { $_->[0] => 1 } @barcodes; #generate a uniq list of barcode identifiers;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
426 $barcodes{'unmatched'} = 1 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
427 $barcodes{'ambiguous'} = 1 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
428
da4101033e10 planemo upload
oinizan
parents:
diff changeset
429 #reset counters
da4101033e10 planemo upload
oinizan
parents:
diff changeset
430 foreach my $ident ( keys %barcodes ) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
431 $counts{$ident} = 0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
432 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
433
da4101033e10 planemo upload
oinizan
parents:
diff changeset
434 create_output_files;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
435
da4101033e10 planemo upload
oinizan
parents:
diff changeset
436 # Read file FASTQ file
da4101033e10 planemo upload
oinizan
parents:
diff changeset
437 # split accotding to barcodes
da4101033e10 planemo upload
oinizan
parents:
diff changeset
438 do {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
439 chomp $seq_bases;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
440 #~ print STDERR "sequence ".length($seq_bases)." $seq_bases: \n" if $debug;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
441 if ($fastq_format){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
442 chomp $seq_qualities;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
443 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
444
da4101033e10 planemo upload
oinizan
parents:
diff changeset
445 if ($paired == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
446 chomp $seq_bases_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
447 if ($fastq_format){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
448 chomp $seq_qualities_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
449 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
450 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
451 my $best_barcode_mismatches_count = $barcodes_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
452 my $best_barcode_ident = undef;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
453 my $sequence_fragment;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
454 my $sequence_fragment_radTAG;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
455 my $start = 0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
456 my $cut_len = $seq_len;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
457 my $start2 = 0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
458 my $cut_len2 = $seq_len_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
459 # warn $start , " " , $cut_len , " " , $start2 , " " , $cut_len2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
460 #Try all barcodes, find the one with the lowest mismatch count
da4101033e10 planemo upload
oinizan
parents:
diff changeset
461 if ($barcodes_at_bol) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
462 $sequence_fragment = substr $seq_bases, 0, $barcodes_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
463 if (defined $rad)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
464 {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
465 $sequence_fragment_radTAG = substr $seq_bases, $barcodes_length, $radTAG_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
466 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
467 if ($trim){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
468 $start=$barcodes_length+$adapt;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
469 $cut_len=$seq_len-$barcodes_length-$adapt;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
470 if ($trim2) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
471 $start2 = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
472 $cut_len2=$cut_len;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
473 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
474 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
475 } elsif($barcodes_at_eol) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
476 if ($paired != 1){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
477 $sequence_fragment = substr $seq_bases, - $barcodes_length ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
478 if (defined $rad){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
479 $sequence_fragment_radTAG = substr $seq_bases, - ($barcodes_length+$adapt+$radTAG_length), $radTAG_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
480 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
481 if ($trim){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
482 $start=0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
483 $cut_len=$seq_len-$barcodes_length-$adapt;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
484 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
485 } else {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
486 $sequence_fragment = substr $seq_bases_r2, - $barcodes_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
487 if (defined $rad){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
488 $sequence_fragment_radTAG = substr $seq_bases_r2, - ($barcodes_length+$adapt+$radTAG_length), $radTAG_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
489 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
490 if ($trim){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
491 $start2=0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
492 $cut_len2=$seq_len_r2-$barcodes_length-$adapt;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
493 if ($trim2) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
494 $start = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
495 $cut_len=$cut_len2 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
496 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
497 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
498 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
499 #~ warn "sequence fragment : $sequence_fragment\n" if $debug;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
500 #~ $sequence_fragment = reverse_complement $sequence_fragment;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
501 #~ warn "sequence fragment revcomp: $sequence_fragment \n" if $debug;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
502 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
503 # warn $start," ", $cut_len, " ",$start2, " ",$cut_len2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
504
da4101033e10 planemo upload
oinizan
parents:
diff changeset
505 foreach my $barcoderef (@barcodes) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
506 my ($ident, $barcode) = @{$barcoderef};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
507
da4101033e10 planemo upload
oinizan
parents:
diff changeset
508 # Get DNA fragment (in the length of the barcodes)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
509 # The barcode will be tested only against this fragment
da4101033e10 planemo upload
oinizan
parents:
diff changeset
510 # (no point in testing the barcode against the whole sequence)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
511 my $mm=$barcodes_length;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
512
da4101033e10 planemo upload
oinizan
parents:
diff changeset
513 # check validity of RAD.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
514 if (defined $rad){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
515 my $mm_rad = mismatch_count($sequence_fragment_radTAG, $radTAG) ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
516 if ($mm_rad<= $TAG_mm){ $mm = mismatch_count($sequence_fragment, $barcode) ; }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
517 }else {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
518 $mm = mismatch_count($sequence_fragment, $barcode) ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
519 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
520
da4101033e10 planemo upload
oinizan
parents:
diff changeset
521 # if this is a partial match, add the non-overlap as a mismatch
da4101033e10 planemo upload
oinizan
parents:
diff changeset
522 # (partial barcodes are shorter than the length of the original barcodes)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
523 $mm += ($barcodes_length - length($barcode));
da4101033e10 planemo upload
oinizan
parents:
diff changeset
524
da4101033e10 planemo upload
oinizan
parents:
diff changeset
525 if ( $mm < $best_barcode_mismatches_count ) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
526 $best_barcode_mismatches_count = $mm ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
527 $best_barcode_ident = $ident ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
528 }elsif($mm == $best_barcode_mismatches_count){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
529 $best_barcode_ident = "ambiguous"
da4101033e10 planemo upload
oinizan
parents:
diff changeset
530 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
531 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
532
da4101033e10 planemo upload
oinizan
parents:
diff changeset
533 if ($best_barcode_mismatches_count<=$allowed_mismatches && $best_barcode_ident ne 'ambiguous'){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
534 $seq_bases = substr $seq_bases, $start, $cut_len;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
535 if ($fastq_format) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
536 $seq_qualities = substr $seq_qualities, $start, $cut_len;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
537 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
538 if ($paired == 1 ) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
539 $seq_bases_r2 = substr $seq_bases_r2, $start2, $cut_len2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
540 if ($fastq_format) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
541 $seq_qualities_r2 = substr $seq_qualities_r2, $start2, $cut_len2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
542 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
543 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
544 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
545
da4101033e10 planemo upload
oinizan
parents:
diff changeset
546 if ( (!defined $best_barcode_ident) || ($best_barcode_mismatches_count>$allowed_mismatches)){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
547 $best_barcode_ident = 'unmatched';
da4101033e10 planemo upload
oinizan
parents:
diff changeset
548 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
549
da4101033e10 planemo upload
oinizan
parents:
diff changeset
550 #get the file associated with the matched barcode.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
551 #(note: there's also a file associated with 'unmatched' barcode)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
552
da4101033e10 planemo upload
oinizan
parents:
diff changeset
553 if (defined $rad && $best_barcode_ident ne 'unmatched' && $best_barcode_ident ne 'ambiguous'){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
554 my $nb_rad =()= ($seq_bases =~ /$rad/g);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
555 #~ warn $seq_bases," ",$nb_rad;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
556 $best_barcode_ident.='_2rad' if ($nb_rad > 0);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
557 #~ warn $best_barcode_ident;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
558 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
559
da4101033e10 planemo upload
oinizan
parents:
diff changeset
560 $counts{$best_barcode_ident}++;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
561 my $file = $files{$best_barcode_ident};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
562 #~ warn "$seq_name , $best_barcode_ident \n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
563 #~ warn $file ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
564 $write_r2 = 0;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
565 write_record($file);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
566 if ($paired == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
567 #~ warn $best_barcode_ident."r2";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
568 my $file = $files{$best_barcode_ident."r2"};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
569 $write_r2 = 1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
570 #~ warn $best_barcode_ident."r2";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
571 write_record($file);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
572 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
573
da4101033e10 planemo upload
oinizan
parents:
diff changeset
574 } while ( read_record );
da4101033e10 planemo upload
oinizan
parents:
diff changeset
575 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
576
da4101033e10 planemo upload
oinizan
parents:
diff changeset
577 #Quickly calculate hamming distance between two strings
da4101033e10 planemo upload
oinizan
parents:
diff changeset
578 #
da4101033e10 planemo upload
oinizan
parents:
diff changeset
579 #NOTE: Strings must be same length.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
580 # returns number of different characters.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
581 #see http://www.perlmonks.org/?node_id=500235
da4101033e10 planemo upload
oinizan
parents:
diff changeset
582 sub mismatch_count($$) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
583 length( $_[ 0 ] ) - ( ( $_[ 0 ] ^ $_[ 1 ] ) =~ tr[\0][\0] )
da4101033e10 planemo upload
oinizan
parents:
diff changeset
584 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
585
da4101033e10 planemo upload
oinizan
parents:
diff changeset
586
da4101033e10 planemo upload
oinizan
parents:
diff changeset
587
da4101033e10 planemo upload
oinizan
parents:
diff changeset
588 sub print_results
da4101033e10 planemo upload
oinizan
parents:
diff changeset
589 {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
590 print "Barcode\tCount\tLocation\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
591 my $total = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
592 foreach my $ident (sort keys %counts) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
593 print $ident, "\t", $counts{$ident};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
594 if ($paired) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
595 print "(*2)";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
596 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
597 print "\t",$filenames{$ident};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
598 if ($paired) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
599 print " & ".$filenames{$ident."r2"};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
600 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
601 print "\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
602 $total += $counts{$ident};
da4101033e10 planemo upload
oinizan
parents:
diff changeset
603 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
604 print "total\t",$total;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
605 if ($paired) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
606 print "(*2)";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
607 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
608 print "\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
609 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
610
da4101033e10 planemo upload
oinizan
parents:
diff changeset
611
da4101033e10 planemo upload
oinizan
parents:
diff changeset
612 sub read_record
da4101033e10 planemo upload
oinizan
parents:
diff changeset
613 {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
614 $seq_name = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
615
da4101033e10 planemo upload
oinizan
parents:
diff changeset
616 return undef unless defined $seq_name; # End of file?
da4101033e10 planemo upload
oinizan
parents:
diff changeset
617
da4101033e10 planemo upload
oinizan
parents:
diff changeset
618 $seq_bases = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
619 die "Error: bad input file, expecting line with sequences\n" unless defined $seq_bases;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
620 $seq_len = length($seq_bases)-1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
621 # If using FASTQ format, read two more lines
da4101033e10 planemo upload
oinizan
parents:
diff changeset
622 if ($fastq_format) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
623 $seq_name2 = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
624 die "Error: bad input file, expecting line with sequence name2\n" unless defined $seq_name2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
625
da4101033e10 planemo upload
oinizan
parents:
diff changeset
626 $seq_qualities = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
627 die "Error: bad input file, expecting line with quality scores\n" unless defined $seq_qualities;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
628 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
629
da4101033e10 planemo upload
oinizan
parents:
diff changeset
630 if ($paired == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
631 $seq_name_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
632
da4101033e10 planemo upload
oinizan
parents:
diff changeset
633 return undef unless defined $seq_name_r2; # End of file?
da4101033e10 planemo upload
oinizan
parents:
diff changeset
634
da4101033e10 planemo upload
oinizan
parents:
diff changeset
635 $seq_bases_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
636 die "Error: bad input file, expecting line with sequences\n" unless defined $seq_bases_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
637 $seq_len_r2 = length($seq_bases_r2)-1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
638 # If using FASTQ format, read two more lines
da4101033e10 planemo upload
oinizan
parents:
diff changeset
639 if ($fastq_format) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
640 $seq_name2_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
641 die "Error: bad input file, expecting line with sequence name2\n" unless defined $seq_name2_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
642
da4101033e10 planemo upload
oinizan
parents:
diff changeset
643 $seq_qualities_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
644 die "Error: bad input file, expecting line with quality scores\n" unless defined $seq_qualities_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
645 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
646 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
647
da4101033e10 planemo upload
oinizan
parents:
diff changeset
648 return 1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
649 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
650
da4101033e10 planemo upload
oinizan
parents:
diff changeset
651 sub write_record($)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
652 {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
653 if ($write_r2 == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
654 my $file = shift;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
655 croak "Bad file handle" unless defined $file;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
656
da4101033e10 planemo upload
oinizan
parents:
diff changeset
657 print $file $seq_name_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
658 print $file $seq_bases_r2,"\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
659
da4101033e10 planemo upload
oinizan
parents:
diff changeset
660 #if using FASTQ format, write two more lines
da4101033e10 planemo upload
oinizan
parents:
diff changeset
661 if ($fastq_format) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
662 print $file $seq_name2_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
663 print $file $seq_qualities_r2,"\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
664 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
665 } else {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
666 my $file = shift;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
667 croak "Bad file handle $seq_name" unless defined $file;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
668
da4101033e10 planemo upload
oinizan
parents:
diff changeset
669 print $file $seq_name;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
670 print $file $seq_bases,"\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
671
da4101033e10 planemo upload
oinizan
parents:
diff changeset
672 #if using FASTQ format, write two more lines
da4101033e10 planemo upload
oinizan
parents:
diff changeset
673 if ($fastq_format) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
674 print $file $seq_name2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
675 print $file $seq_qualities,"\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
676 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
677 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
678 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
679
da4101033e10 planemo upload
oinizan
parents:
diff changeset
680 sub open_and_detect_input_format
da4101033e10 planemo upload
oinizan
parents:
diff changeset
681 {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
682 # $input_file_io = new IO::Handle;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
683 # die "Failed to open STDIN " unless $input_file_io->fdopen(fileno(STDIN),"r");
da4101033e10 planemo upload
oinizan
parents:
diff changeset
684
da4101033e10 planemo upload
oinizan
parents:
diff changeset
685 if ($ARGV[0] =~ /.gz$/){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
686 die "Failed to open '$ARGV[0]' " unless open ($input_file_io, "<:gzip", $ARGV[0] ) ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
687 #~ $input_file_io = IO::Zlib->new();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
688 #~ die "Failed to open '$ARGV[0]' " unless $input_file_io->open($ARGV[0],"r");
da4101033e10 planemo upload
oinizan
parents:
diff changeset
689 }else{
da4101033e10 planemo upload
oinizan
parents:
diff changeset
690 $input_file_io= IO::Handle->new();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
691 die "Failed to open '$ARGV[0]' " unless open($input_file_io , $ARGV[0]);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
692 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
693
da4101033e10 planemo upload
oinizan
parents:
diff changeset
694
da4101033e10 planemo upload
oinizan
parents:
diff changeset
695 # If a read2 file is provided
da4101033e10 planemo upload
oinizan
parents:
diff changeset
696 if ($#ARGV == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
697 if ($ARGV[1] =~ /.gz$/){
da4101033e10 planemo upload
oinizan
parents:
diff changeset
698 die "Failed to open '$ARGV[0]' " unless open ($input_file_io_r2 , "<:gzip", $ARGV[1] ) ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
699 #~ $input_file_io_r2 = IO::Zlib->new();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
700 #~ die "Failed to open '$ARGV[0]' " unless $input_file_io_r2->open($ARGV[1],"r");
da4101033e10 planemo upload
oinizan
parents:
diff changeset
701 }else{
da4101033e10 planemo upload
oinizan
parents:
diff changeset
702 $input_file_io_r2 = IO::Handle->new();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
703 die "Failed to open '$ARGV[1]' " unless open($input_file_io_r2 , $ARGV[1]);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
704 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
705 $paired = 1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
706 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
707
da4101033e10 planemo upload
oinizan
parents:
diff changeset
708 # Get the first characeter, and push it back
da4101033e10 planemo upload
oinizan
parents:
diff changeset
709 my $first_char = $input_file_io->getc();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
710 $input_file_io->ungetc(ord $first_char);
da4101033e10 planemo upload
oinizan
parents:
diff changeset
711 $seq_name = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
712 $seq_bases = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
713 $seq_len = length($seq_bases)-1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
714 if ($paired == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
715 $seq_name_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
716 $seq_bases_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
717 $seq_len_r2 = length($seq_bases_r2)-1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
718 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
719 if ($first_char eq '>') {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
720 # FASTA format
da4101033e10 planemo upload
oinizan
parents:
diff changeset
721 $fastq_format = 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
722 print STDERR "Detected FASTA format\n" if $debug;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
723 } elsif ($first_char eq '@') {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
724 # FASTQ format
da4101033e10 planemo upload
oinizan
parents:
diff changeset
725 $fastq_format = 1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
726 $seq_name2 = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
727 die "Error: bad input file, expecting line with sequence name2\n" unless defined $seq_name2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
728 $seq_qualities = $input_file_io->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
729 die "Error: bad input file, expecting line with quality scores\n" unless defined $seq_qualities;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
730 if ($paired == 1) {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
731 $seq_name2_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
732 die "Error: bad input file, expecting line with sequence name2\n" unless defined $seq_name2_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
733 $seq_qualities_r2 = $input_file_io_r2->getline();
da4101033e10 planemo upload
oinizan
parents:
diff changeset
734 die "Error: bad input file, expecting line with quality scores\n" unless defined $seq_qualities_r2;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
735 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
736 print STDERR "Detected FASTQ format\n" if $debug;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
737 } else {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
738 die "Error: unknown file format. First character = '$first_char' (expecting > or \@)\n";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
739 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
740 # warn "len1: ", $seq_len, "; len2: ", $seq_len_r2, " ;barcode len: ", $barcodes_length, "; no adapt: ", $adapt;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
741 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
742
da4101033e10 planemo upload
oinizan
parents:
diff changeset
743 sub doc()
da4101033e10 planemo upload
oinizan
parents:
diff changeset
744 {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
745 print<<EOF;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
746 Barcode Splitter, by Assaf Gordon (gordon\@cshl.edu), 11sep2008
da4101033e10 planemo upload
oinizan
parents:
diff changeset
747
da4101033e10 planemo upload
oinizan
parents:
diff changeset
748 This program reads FASTA/FASTQ file and splits it into several smaller files,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
749 Based on barcode matching.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
750 FASTA/FASTQ data is read from STDIN (format is auto-detected.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
751 Output files will be writen to disk.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
752 Summary will be printed to STDOUT.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
753
da4101033e10 planemo upload
oinizan
parents:
diff changeset
754 usage: $0 read_file_r1 [read_file_r2] --bcfile FILE --prefix-r1 r1.%.fq [--prefix-r2 r2.%.fq] --bol|--eol]
da4101033e10 planemo upload
oinizan
parents:
diff changeset
755 [--mismatches N] [--exact] [--partial N] [--trim|--trim2] [--no_adapt] [--rad STR --radTAG STR] [--help] [--quiet] [--debug]
da4101033e10 planemo upload
oinizan
parents:
diff changeset
756
da4101033e10 planemo upload
oinizan
parents:
diff changeset
757 Arguments:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
758
da4101033e10 planemo upload
oinizan
parents:
diff changeset
759 --bcfile FILE - Barcodes file name. (see explanation below.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
760 --prefix-r1 PREFIX - File prefix. The sample name will be placed where the % is placed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
761 --prefix-r2 PREFIX - File prefix. The sample name will be placed where the % is placed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
762 --bol - Try to match barcodes at the BEGINNING of sequences.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
763 (What biologists would call the 5' end, and programmers
da4101033e10 planemo upload
oinizan
parents:
diff changeset
764 would call index 0.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
765 --eol - Try to match barcodes at the END of sequences.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
766 (What biologists would call the 3' end, and programmers
da4101033e10 planemo upload
oinizan
parents:
diff changeset
767 would call the end of the string.). Give barcode as they appear in the sequence file, so reverse complement if needed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
768 NOTE: one of --bol, --eol must be specified, but not both.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
769 --mismatches N - Max. number of mismatches allowed. default is 1.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
770 --exact - Same as '--mismatches 0'. If both --exact and --mismatches
da4101033e10 planemo upload
oinizan
parents:
diff changeset
771 are specified, '--exact' takes precedence.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
772 --partial N - Allow partial overlap of barcodes. (see explanation below.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
773 (Default is not partial matching)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
774 --trim - Should the barecode be trimmed.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
775 --trim2 - Shoud the read 2 be trimmed to have the same length as the read1
da4101033e10 planemo upload
oinizan
parents:
diff changeset
776 NOTE: one of --trim, --trim2 must be specified, but not both.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
777 --no_adapt - there is no adaptator (T or A ) between barcode and the sequence
da4101033e10 planemo upload
oinizan
parents:
diff changeset
778 --rad STR - sequence are RADSeq, barcode is immediately followed by rad sequence <STR>.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
779 NOTE: --rad implies --no_adapt
da4101033e10 planemo upload
oinizan
parents:
diff changeset
780 --radTAG STR - sequence retain at the begining of the sequence after cliping.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
781 NOTE: if define rad, radTAG must be defined to, and conversely
da4101033e10 planemo upload
oinizan
parents:
diff changeset
782 Barcode will be splited if they are folowed by radTAG, otherwise they will be recorded in the unmatched file.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
783 --TAG_mismatch N - Max. number of mismatches allowed in the radTAG sequence. default is 0.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
784 --quiet - Don't print counts and summary at the end of the run.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
785 (Default is to print.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
786 --debug - Print lots of useless debug information to STDERR.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
787 --help - This helpful help screen.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
788
da4101033e10 planemo upload
oinizan
parents:
diff changeset
789 Example (Assuming 's_2_100.txt' is a FASTQ file, 'mybarcodes.txt' is
da4101033e10 planemo upload
oinizan
parents:
diff changeset
790 the barcodes file):
da4101033e10 planemo upload
oinizan
parents:
diff changeset
791
da4101033e10 planemo upload
oinizan
parents:
diff changeset
792 \$ cat s_2_100.txt | $0 --bcfile mybarcodes.txt --bol --mismatches 2 \\
da4101033e10 planemo upload
oinizan
parents:
diff changeset
793 --prefix-r1 /tmp/bla_%.txt"
da4101033e10 planemo upload
oinizan
parents:
diff changeset
794
da4101033e10 planemo upload
oinizan
parents:
diff changeset
795 Barcode file format
da4101033e10 planemo upload
oinizan
parents:
diff changeset
796 -------------------
da4101033e10 planemo upload
oinizan
parents:
diff changeset
797 Barcode files are simple text files. Each line should contain an identifier
da4101033e10 planemo upload
oinizan
parents:
diff changeset
798 (descriptive name for the barcode), and the barcode itself (A/C/G/T),
da4101033e10 planemo upload
oinizan
parents:
diff changeset
799 separated by a TAB character. Example:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
800
da4101033e10 planemo upload
oinizan
parents:
diff changeset
801 #This line is a comment (starts with a 'number' sign)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
802 BC1 GATCT
da4101033e10 planemo upload
oinizan
parents:
diff changeset
803 BC2 ATCGT
da4101033e10 planemo upload
oinizan
parents:
diff changeset
804 BC3 GTGAT
da4101033e10 planemo upload
oinizan
parents:
diff changeset
805 BC4 TGTCT
da4101033e10 planemo upload
oinizan
parents:
diff changeset
806
da4101033e10 planemo upload
oinizan
parents:
diff changeset
807 For each barcode, a new FASTQ file will be created (with the barcode's
da4101033e10 planemo upload
oinizan
parents:
diff changeset
808 identifier as part of the file name). Sequences matching the barcode
da4101033e10 planemo upload
oinizan
parents:
diff changeset
809 will be stored in the appropriate file.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
810
da4101033e10 planemo upload
oinizan
parents:
diff changeset
811 Running the above example (assuming "mybarcodes.txt" contains the above
da4101033e10 planemo upload
oinizan
parents:
diff changeset
812 barcodes), will create the following files:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
813 /tmp/bla_BC1.txt
da4101033e10 planemo upload
oinizan
parents:
diff changeset
814 /tmp/bla_BC2.txt
da4101033e10 planemo upload
oinizan
parents:
diff changeset
815 /tmp/bla_BC3.txt
da4101033e10 planemo upload
oinizan
parents:
diff changeset
816 /tmp/bla_BC4.txt
da4101033e10 planemo upload
oinizan
parents:
diff changeset
817 /tmp/bla_unmatched.txt
da4101033e10 planemo upload
oinizan
parents:
diff changeset
818 The 'unmatched' file will contain all sequences that didn't match any barcode.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
819
da4101033e10 planemo upload
oinizan
parents:
diff changeset
820 Barcode matching
da4101033e10 planemo upload
oinizan
parents:
diff changeset
821 ----------------
da4101033e10 planemo upload
oinizan
parents:
diff changeset
822
da4101033e10 planemo upload
oinizan
parents:
diff changeset
823 ** Without partial matching:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
824
da4101033e10 planemo upload
oinizan
parents:
diff changeset
825 Count mismatches between the FASTA/Q sequences and the barcodes.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
826 The barcode which matched with the lowest mismatches count (providing the
da4101033e10 planemo upload
oinizan
parents:
diff changeset
827 count is small or equal to '--mismatches N') 'gets' the sequences.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
828
da4101033e10 planemo upload
oinizan
parents:
diff changeset
829 Example (using the above barcodes):
da4101033e10 planemo upload
oinizan
parents:
diff changeset
830 Input Sequence:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
831 GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
da4101033e10 planemo upload
oinizan
parents:
diff changeset
832
da4101033e10 planemo upload
oinizan
parents:
diff changeset
833 Matching with '--bol --mismatches 1':
da4101033e10 planemo upload
oinizan
parents:
diff changeset
834 GATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
da4101033e10 planemo upload
oinizan
parents:
diff changeset
835 GATCT (1 mismatch, BC1)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
836 ATCGT (4 mismatches, BC2)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
837 GTGAT (3 mismatches, BC3)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
838 TGTCT (3 mismatches, BC4)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
839
da4101033e10 planemo upload
oinizan
parents:
diff changeset
840 This sequence will be classified as 'BC1' (it has the lowest mismatch count).
da4101033e10 planemo upload
oinizan
parents:
diff changeset
841 If '--exact' or '--mismatches 0' were specified, this sequence would be
da4101033e10 planemo upload
oinizan
parents:
diff changeset
842 classified as 'unmatched' (because, although BC1 had the lowest mismatch count,
da4101033e10 planemo upload
oinizan
parents:
diff changeset
843 it is above the maximum allowed mismatches).
da4101033e10 planemo upload
oinizan
parents:
diff changeset
844
da4101033e10 planemo upload
oinizan
parents:
diff changeset
845 Matching with '--eol' (end of line) does the same, but from the other side
da4101033e10 planemo upload
oinizan
parents:
diff changeset
846 of the sequence.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
847
da4101033e10 planemo upload
oinizan
parents:
diff changeset
848 ** With partial matching (very similar to indels):
da4101033e10 planemo upload
oinizan
parents:
diff changeset
849
da4101033e10 planemo upload
oinizan
parents:
diff changeset
850 Same as above, with the following addition: barcodes are also checked for
da4101033e10 planemo upload
oinizan
parents:
diff changeset
851 partial overlap (number of allowed non-overlapping bases is '--partial N').
da4101033e10 planemo upload
oinizan
parents:
diff changeset
852
da4101033e10 planemo upload
oinizan
parents:
diff changeset
853 Example:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
854 Input sequence is ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
da4101033e10 planemo upload
oinizan
parents:
diff changeset
855 (Same as above, but note the missing 'G' at the beginning.)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
856
da4101033e10 planemo upload
oinizan
parents:
diff changeset
857 Matching (without partial overlapping) against BC1 yields 4 mismatches:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
858 ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
da4101033e10 planemo upload
oinizan
parents:
diff changeset
859 GATCT (4 mismatches)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
860
da4101033e10 planemo upload
oinizan
parents:
diff changeset
861 Partial overlapping would also try the following match:
da4101033e10 planemo upload
oinizan
parents:
diff changeset
862 -ATTTACTATGTAAAGATAGAAGGAATAAGGTGAAG
da4101033e10 planemo upload
oinizan
parents:
diff changeset
863 GATCT (1 mismatch)
da4101033e10 planemo upload
oinizan
parents:
diff changeset
864
da4101033e10 planemo upload
oinizan
parents:
diff changeset
865 Note: scoring counts a missing base as a mismatch, so the final
da4101033e10 planemo upload
oinizan
parents:
diff changeset
866 mismatch count is 2 (1 'real' mismatch, 1 'missing base' mismatch).
da4101033e10 planemo upload
oinizan
parents:
diff changeset
867 If running with '--mismatches 2' (meaning allowing upto 2 mismatches) - this
da4101033e10 planemo upload
oinizan
parents:
diff changeset
868 seqeunce will be classified as BC1.
da4101033e10 planemo upload
oinizan
parents:
diff changeset
869
da4101033e10 planemo upload
oinizan
parents:
diff changeset
870 EOF
da4101033e10 planemo upload
oinizan
parents:
diff changeset
871
da4101033e10 planemo upload
oinizan
parents:
diff changeset
872 exit 1;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
873 }
da4101033e10 planemo upload
oinizan
parents:
diff changeset
874
da4101033e10 planemo upload
oinizan
parents:
diff changeset
875 sub version()
da4101033e10 planemo upload
oinizan
parents:
diff changeset
876 {
da4101033e10 planemo upload
oinizan
parents:
diff changeset
877 print STDOUT "1.0";
da4101033e10 planemo upload
oinizan
parents:
diff changeset
878 exit 0 ;
da4101033e10 planemo upload
oinizan
parents:
diff changeset
879 }