annotate change_fasta_deflines.py @ 0:6201f462adb7 draft

planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
author public-health-bioinformatics
date Wed, 09 Jan 2019 15:03:03 -0500
parents
children 909a1db7c7ce
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
1 #!/usr/bin/env python
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
2 import sys, argparse
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
3 '''Accepts either csv (default) or tab-delimited files with old/new sequence names, creating a dictionary of
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
4 respective key:value pairs. Parses an input fasta file for 'old' names, replacing them with 'new' names, writing
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
5 renamed sequences to a fasta file. NOTE: use of tab-delim text file for renaming requires '-t' on cmd line.'''
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
6 #USAGE EXAMPLE 1: python change_fasta_def_lines.py csv_rename_file.csv fasta_2_rename.fasta renamedSequences.fasta
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
7 #USAGE EXAMPLE 2: python change_fasta_def_lines.py tab_delim_rename_file.txt -t fasta_2_rename.fasta renamedSequences.fasta
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
8
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
9 '''Author: Diane Eisler, Molecular Microbiology & Genomics, BCCDC Public Health Laboratory,Sept 2017'''
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
10
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
11 #parse command line arguments
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
12 parser = argparse.ArgumentParser()
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
13 parser.add_argument ("-t", "--tab_delim", help = "name fasta definition lines from tab-delim file", action = "store_true")
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
14 parser.add_argument("inFileHandle") #csv file with current fasta file names in column 1 and desired names in col 2
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
15 parser.add_argument("inFileHandle2") #fasta file containing sequences requiring name replacement
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
16 parser.add_argument("outFileHandle") #user-specified output filename
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
17 args = parser.parse_args()
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
18
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
19 #open a writable output file that will be over-written if it already exists
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
20 outfile= open(args.outFileHandle,'w')
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
21 dict = {} #dictionary to hold old_name:new_name key:value pairs
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
22 splitter = ',' #default char to split lines at
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
23 #determine if input naming file is csv (default) or tab delim text
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
24 if args.tab_delim:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
25 splitter = '\t' #change splitter to tab if comd line args contain '-t'
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
26
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
27 #create dictionary using key/value pairs from csv file of old/new names
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
28 with open(args.inFileHandle,'r') as inputFile:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
29 #read in each line and split at comma into key:value pairs
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
30 for line in inputFile:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
31 #remove whitespace from end of lines, split at comma, assigning to key:value pairs
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
32 line2 = line.rstrip()
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
33 splitLine = line2.split(splitter)
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
34 old_name = splitLine[0]
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
35 new_name = splitLine[1]
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
36 dict[old_name] = new_name
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
37
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
38 #parse fasta deflines for 'old' names and, if found, replace with new names
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
39 with open(args.inFileHandle2,'r') as inputFile2:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
40 for line in inputFile2:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
41 #find the definition lines, remove trailing whitespace & '>'
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
42 if ">" in line:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
43 originalDefline = line.rstrip().replace(">","",1)
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
44 #check for a match to any of the dict key
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
45 if dict.has_key(originalDefline):
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
46 #find the index of that item in the list
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
47 newDefline= dict[originalDefline]
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
48 #print("the new name"), newDefline
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
49 # print each item to make sure the right name is being entered
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
50 outfile.write(">" + newDefline + "\n")
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
51 else:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
52 #write out the original defline sequence name
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
53 print ("Defline not in dictionary: "), originalDefline
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
54 outfile.write(">" + originalDefline + "\n")
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
55 else:
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
56 #in lines without ">", write out sequence as it was
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
57 seq = line.rstrip()
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
58 outfile.write(seq+"\n")
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
59
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
60 inputFile.close()
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
61 inputFile2.close()
6201f462adb7 planemo upload for repository https://github.com/Public-Health-Bioinformatics/flu_classification_suite commit 561cde8c8bd4a6164b1bef19ecff9809ac3340e0-dirty
public-health-bioinformatics
parents:
diff changeset
62 outfile.close()