hrf_seq: hrf-seq/preprocessing.sh annotate

annotate hrf-seq/preprocessing.sh @ 0:3ecde8c1bd83 draft

Uploaded. Tool tests or/and test files are missing.

author	nikos
date	Thu, 11 Sep 2014 08:21:15 -0400
parents
children

rev	line source
0 3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	1 #!/bin/bash
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	2
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	3 #Preprocessing workflow - trimming adapters, barcodes etc
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	4
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	5 # Barcode in the oligo used: NWTRYSNNNN
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	6
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	7 # Which means that each proper read must begin with: NNNN(C\|G)(A\|G)(C\|T)A(A\|T)N
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	8
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	9 # As regex:
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	10
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	11 # ^[ACGT][ACGT][ACGT][ACGT][CG][AG][CT][A][AT][ACGT]
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	12
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	13
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	14 READ1=${1}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	15 READ2=${2}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	16 BARCODE=$3
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	17 ADAPTER1=$4
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	18 ADAPTER2=$5
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	19 CUTOFF=$6
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	20 CUTADAPT_LOG=${7}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	21 TRIM_LENGTH=$8
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	22 OVERLAP_LENGTH=$9
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	23 OUT_R1=${10}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	24 OUT_R2=${11}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	25 OUT_BARCODES=${12}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	26
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	27 BAR_LENGTH=`eval echo ${#BARCODE}`
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	28
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	29 #########################################################################################
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	30
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	31 # Reverse complement the barcode sequence and create a regular expression.
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	32
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	33 REV_COMPLEMENT=$(perl -0777ne's/\n //g; tr/ATGCatgcNnYyRrKkMmBbVvDdHh/TACGtacgNnRrYyMmKkVvBbHhDd/; print scalar reverse $_;' <(echo $BARCODE))
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	34
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	35 REG_EXP=$(sed 's/A/[A]/g;s/C/[C]/g;s/G/[G]/g;s/T/[T]/g;s/R/[AG]/g;s/Y/[CT]/g;s/S/[GC]/g;s/W/[AT]/g;s/K/[GT]/g;s/M/[AC]/g;s/B/[CGT]/g;s/D/[AGT]/g;s/H/[ACT]/g;s/V/[ACG]/g;s/N/[AGCT]/g;' <<< $REV_COMPLEMENT)
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	36
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	37 ########################################################################
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	38 #1. Remove all the reads that do not start with the signature
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	39
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	40 # (first awk removes them, second awk removes corresponding quality strings) followed by cutadapt.
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	41
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	42 # After cutadapt, remove last 15 nt - may be derived from the random primer [should be optional, as the user may: 1) use different random primer length, 2) have short reasd and would lose too much info] and remove all the reads that are shorter than 30 nt
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	43
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	44 # (10 nt barcode + 20 nt for mapping)
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	45
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	46 echo -e '------------------------Read1------------------------\n' > $CUTADAPT_LOG
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	47
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	48 awk '{if(NR%4==2){if(/^'"$REG_EXP"'/){print}else{print ""}}else{print}}' $READ1 \|
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	49
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	50 awk 'BEGIN{trim_flag=0; trimming_stats=0; all_processed=0}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	51 {
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	52 if(NR%4==1){print; all_processed++}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	53 if(NR%4==2){if(length($1)==0){trim_flag=1;trimming_stats++}else{trim_flag=0};print}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	54 if(NR%4==3){print}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	55 if(NR%4==0){if(trim_flag==1){print ""}else{print $0}}
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	56 }END{print(trimming_stats, all_processed) > "trimming_stats.error"}' \|
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	57
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	58 cutadapt -a $ADAPTER1 -q $CUTOFF --format=fastq -O $OVERLAP_LENGTH - 2>>$CUTADAPT_LOG \|
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	59
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	60 awk -v len="${TRIM_LENGTH}" '{if(NR%2==0){print(substr($1,0,length($1)-len))}else{print}}' \|
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	61
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	62 awk -v len="${BAR_LENGTH}" '{if(NR%2==0 && length($1)<20+len){printf("\n")}else{print}}' \| gzip > R1.fastq.gz &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	63
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	64 wait
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	65
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	66 ########################################################################
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	67 #2. Trim the adapter, primer and possible random barcode from the second read
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	68
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	69 echo -e '------------------------Read2------------------------\n' >> $CUTADAPT_LOG
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	70
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	71 cutadapt -a $ADAPTER2 -q $CUTOFF --format=fastq -O $OVERLAP_LENGTH $READ2 2>>$CUTADAPT_LOG \|
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	72
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	73 awk -v len1="${TRIM_LENGTH}" -v len2="${BAR_LENGTH}" '{if(NR%2==0){print(substr($0,len1+1,(length($0)-len1-len2)))}else{print($0)}}' - \|
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	74
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	75 awk '{if(NR%2==0 && length($1)<20){printf("\n")}else{print}}' \| gzip > R2.fastq.gz &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	76
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	77 wait
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	78
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	79 ########################################################################
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	80 #3. Remove empty reads - remove each pair from for which at least one read of the pair got removed (they are problematic for tophat mapping)
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	81
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	82 #first define which lines to keep from both fastq files (k for keep, d for discard in the lines_to_keep file)
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	83
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	84 paste <(zcat R1.fastq.gz) <(zcat R2.fastq.gz) \| awk 'BEGIN{OFS="\n"}{if(NR%4==2 && NF==2){print("k","k","k","k")}else{if(NR%4==2 && NF<2){print("d","d","d","d")}}}' > lines_to_keep
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	85
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	86 paste lines_to_keep <(zcat R1.fastq.gz) \| awk '{if($1=="k")print($2,$3)}' \| gzip > R1_readsANDbarcodes.fastq.gz &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	87
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	88 paste lines_to_keep <(zcat R2.fastq.gz) \| awk '{if($1=="k")print($2,$3)}' > $OUT_R2 &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	89
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	90 wait
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	91
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	92 ########################################################################
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	93 #4. Extract the barcode sequence from the first read:
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	94 zcat R1_readsANDbarcodes.fastq.gz \| awk -v len="${BAR_LENGTH}" '{if(NR%2==0 && length($1)<20+len){printf("\n")}else{if(NR%2==0){print(substr($0,len+1,length($0)))}else{print($0)}}}' > $OUT_R1 &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	95
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	96 zcat R1_readsANDbarcodes.fastq.gz \| awk -v len="${BAR_LENGTH}" '{if(NR%4==1){print($1)}else{if(NR%4==2){print(substr($0,0,len))}}}' \| paste - - > ${OUT_BARCODES} &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	97
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	98 wait
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	99
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	100 ########################################################################
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	101 #6. Remove temporary fastq files"
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	102
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	103 rm R1.fastq.gz R2.fastq.gz R1_readsANDbarcodes.fastq.gz
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	104 rm lines_to_keep
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	105
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	106 ########################################################################
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	107 #7. problem! Spaces added at the end of the strings in fastq files (AWK induced). I will remove them:
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	108
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	109 mv $OUT_R1 R1.temp.fastq
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	110 mv $OUT_R2 R2.temp.fastq
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	111
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	112 awk '{print($1)}' R1.temp.fastq > $OUT_R1 &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	113 awk '{print($1)}' R2.temp.fastq > $OUT_R2 &
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	114
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	115 wait
3ecde8c1bd83 Uploaded. Tool tests or/and test files are missing. nikos parents: diff changeset	116 rm R?.temp.fastq

Mercurial > repos > nikos > hrf_seq

annotate hrf-seq/preprocessing.sh @ 0:3ecde8c1bd83 draft