annotate preprocessing.sh @ 2:14d6929f8aa1 draft default tip

Tool tests or/and test-data are missing.
author nikos
date Thu, 11 Sep 2014 08:30:05 -0400
parents
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
2
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
1 #!/bin/bash
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
2
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
3 #Preprocessing workflow - trimming adapters, barcodes etc
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
4
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
5 # Barcode in the oligo used: NWTRYSNNNN
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
6
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
7 # Which means that each proper read must begin with: NNNN(C|G)(A|G)(C|T)A(A|T)N
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
8
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
9 # As regex:
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
10
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
11 # ^[ACGT][ACGT][ACGT][ACGT][CG][AG][CT][A][AT][ACGT]
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
12
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
13
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
14 READ1=${1}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
15 READ2=${2}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
16 BARCODE=$3
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
17 ADAPTER1=$4
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
18 ADAPTER2=$5
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
19 CUTOFF=$6
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
20 CUTADAPT_LOG=${7}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
21 TRIM_LENGTH=$8
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
22 OVERLAP_LENGTH=$9
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
23 OUT_R1=${10}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
24 OUT_R2=${11}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
25 OUT_BARCODES=${12}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
26
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
27 BAR_LENGTH=`eval echo ${#BARCODE}`
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
28
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
29 #########################################################################################
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
30
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
31 # Reverse complement the barcode sequence and create a regular expression.
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
32
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
33 REV_COMPLEMENT=$(perl -0777ne's/\n //g; tr/ATGCatgcNnYyRrKkMmBbVvDdHh/TACGtacgNnRrYyMmKkVvBbHhDd/; print scalar reverse $_;' <(echo $BARCODE))
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
34
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
35 REG_EXP=$(sed 's/A/[A]/g;s/C/[C]/g;s/G/[G]/g;s/T/[T]/g;s/R/[AG]/g;s/Y/[CT]/g;s/S/[GC]/g;s/W/[AT]/g;s/K/[GT]/g;s/M/[AC]/g;s/B/[CGT]/g;s/D/[AGT]/g;s/H/[ACT]/g;s/V/[ACG]/g;s/N/[AGCT]/g;' <<< $REV_COMPLEMENT)
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
36
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
37 ########################################################################
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
38 #1. Remove all the reads that do not start with the signature
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
39
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
40 # (first awk removes them, second awk removes corresponding quality strings) followed by cutadapt.
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
41
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
42 # After cutadapt, remove last 15 nt - may be derived from the random primer [should be optional, as the user may: 1) use different random primer length, 2) have short reasd and would lose too much info] and remove all the reads that are shorter than 30 nt
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
43
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
44 # (10 nt barcode + 20 nt for mapping)
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
45
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
46 echo -e '------------------------Read1------------------------\n' > $CUTADAPT_LOG
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
47
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
48 awk '{if(NR%4==2){if(/^'"$REG_EXP"'/){print}else{print ""}}else{print}}' $READ1 |
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
49
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
50 awk 'BEGIN{trim_flag=0; trimming_stats=0; all_processed=0}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
51 {
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
52 if(NR%4==1){print; all_processed++}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
53 if(NR%4==2){if(length($1)==0){trim_flag=1;trimming_stats++}else{trim_flag=0};print}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
54 if(NR%4==3){print}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
55 if(NR%4==0){if(trim_flag==1){print ""}else{print $0}}
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
56 }END{print(trimming_stats, all_processed) > "trimming_stats.error"}' |
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
57
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
58 cutadapt -a $ADAPTER1 -q $CUTOFF --format=fastq -O $OVERLAP_LENGTH - 2>>$CUTADAPT_LOG |
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
59
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
60 awk -v len="${TRIM_LENGTH}" '{if(NR%2==0){print(substr($1,0,length($1)-len))}else{print}}' |
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
61
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
62 awk -v len="${BAR_LENGTH}" '{if(NR%2==0 && length($1)<20+len){printf("\n")}else{print}}' | gzip > R1.fastq.gz &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
63
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
64 wait
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
65
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
66 ########################################################################
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
67 #2. Trim the adapter, primer and possible random barcode from the second read
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
68
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
69 echo -e '------------------------Read2------------------------\n' >> $CUTADAPT_LOG
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
70
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
71 cutadapt -a $ADAPTER2 -q $CUTOFF --format=fastq -O $OVERLAP_LENGTH $READ2 2>>$CUTADAPT_LOG |
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
72
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
73 awk -v len1="${TRIM_LENGTH}" -v len2="${BAR_LENGTH}" '{if(NR%2==0){print(substr($0,len1+1,(length($0)-len1-len2)))}else{print($0)}}' - |
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
74
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
75 awk '{if(NR%2==0 && length($1)<20){printf("\n")}else{print}}' | gzip > R2.fastq.gz &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
76
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
77 wait
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
78
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
79 ########################################################################
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
80 #3. Remove empty reads - remove each pair from for which at least one read of the pair got removed (they are problematic for tophat mapping)
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
81
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
82 #first define which lines to keep from both fastq files (k for keep, d for discard in the lines_to_keep file)
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
83
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
84 paste <(zcat R1.fastq.gz) <(zcat R2.fastq.gz) | awk 'BEGIN{OFS="\n"}{if(NR%4==2 && NF==2){print("k","k","k","k")}else{if(NR%4==2 && NF<2){print("d","d","d","d")}}}' > lines_to_keep
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
85
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
86 paste lines_to_keep <(zcat R1.fastq.gz) | awk '{if($1=="k")print($2,$3)}' | gzip > R1_readsANDbarcodes.fastq.gz &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
87
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
88 paste lines_to_keep <(zcat R2.fastq.gz) | awk '{if($1=="k")print($2,$3)}' > $OUT_R2 &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
89
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
90 wait
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
91
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
92 ########################################################################
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
93 #4. Extract the barcode sequence from the first read:
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
94 zcat R1_readsANDbarcodes.fastq.gz | awk -v len="${BAR_LENGTH}" '{if(NR%2==0 && length($1)<20+len){printf("\n")}else{if(NR%2==0){print(substr($0,len+1,length($0)))}else{print($0)}}}' > $OUT_R1 &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
95
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
96 zcat R1_readsANDbarcodes.fastq.gz | awk -v len="${BAR_LENGTH}" '{if(NR%4==1){print($1)}else{if(NR%4==2){print(substr($0,0,len))}}}' | paste - - > ${OUT_BARCODES} &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
97
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
98 wait
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
99
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
100 ########################################################################
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
101 #6. Remove temporary fastq files"
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
102
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
103 rm R1.fastq.gz R2.fastq.gz R1_readsANDbarcodes.fastq.gz
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
104 rm lines_to_keep
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
105
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
106 ########################################################################
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
107 #7. problem! Spaces added at the end of the strings in fastq files (AWK induced). I will remove them:
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
108
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
109 mv $OUT_R1 R1.temp.fastq
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
110 mv $OUT_R2 R2.temp.fastq
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
111
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
112 awk '{print($1)}' R1.temp.fastq > $OUT_R1 &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
113 awk '{print($1)}' R2.temp.fastq > $OUT_R2 &
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
114
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
115 wait
14d6929f8aa1 Tool tests or/and test-data are missing.
nikos
parents:
diff changeset
116 rm R?.temp.fastq