Mercurial > repos > abims-sbr > pairwise
changeset 4:6709645eff5d draft
planemo upload for repository https://github.com/abims-sbr/adaptsearch commit cf1b9c905931ca2ca25faa4844d45c908756472f
line wrap: on
line diff
--- a/pairwise.xml Wed Sep 27 10:01:55 2017 -0400 +++ b/pairwise.xml Wed Jan 17 08:53:53 2018 -0500 @@ -22,6 +22,7 @@ #end for #set $infiles = $infiles[:-1] + ln -s $__tool_directory__/scripts/functions.py . && ln -s $__tool_directory__/scripts/S02_xxx_patron_pipeline.sh . && ln -s $__tool_directory__/scripts/S03_run_blast_with_k_filter.sh . && ln -s $__tool_directory__/scripts/S04_run_blast2_with_k_filter.sh . && @@ -54,6 +55,16 @@ </outputs> <tests> + <test> + <param name="inputs" ftype="fasta" value="inputs2/PfPfiji_trinity.fasta,inputs2/ApApomp_trinity.fasta,inputs2/AmAmphi_trinity.fasta,inputs2/AcAcaud_trinity.fasta" /> + <param name="e-value" value="1e-5" /> + <output_collection name="output_fasta_dna" type="list"> + <element name="DNAalignment_corresponding_to_protein_from_RBH_AmAmphi_AcAcaud" value="outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_AmAmphi_AcAcaud.fasta" /> + <element name="DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_AcAcaud" value="outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_AcAcaud.fasta" /> + <element name="DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_AmAmphi" value="outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_AmAmphi.fasta" /> + <element name="DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_ApApomp" value="outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_ApApomp.fasta" /> + </output_collection> + </test> <test> <param name="inputs" ftype="fasta" value="inputs/PfPfiji_Trinity.fasta,inputs/ApApomp_Trinity.fasta,inputs/AmAmphi_Trinity.fasta,inputs/AcAcaud_Trinity.fasta" /> <param name="e-value" value="1e-5" /> @@ -94,185 +105,45 @@ <help> -@HELP_AUTHORS@ - -============ -What it does -============ - -| This tool takes a 'data collection list' containing nucleic fasta sequence files and searches different homologous genes from pairwise comparaisons. -| There are 3 outputs. -| - --------- + @HELPAUTHORS@ + +<![CDATA[ +--------- -========== -Parameters -========== - -The choice of parameters is possible : +**Description** -**-e** : - | is the option for the choice of the e-value. - | By default it's 10. - | - +This tool searches for different homologous genes from pairwise comparisons between a set of fasta files (one file per species). + -------- -======= -Outputs -======= - -This tool, produces the following files : - -**Pairwise**: - | is the general output. It gives the information about what the tool is doing (for each pairwise). - | +**Parameters** -**Pairwise DNA**: - | is the output wich contains nucleic sequences (of the pairwise) that are homologues. The sequences are with nucleotides. Shows: - | the name of the query sequence - | the part of the sequence in nucleotides - | the name of the match sequence - | the part of the sequence in nucleotides - | - -**Pairwise PROT**: - | is the output wich contains proteic sequences (of the pairwise) that are homologues. The sequences are with protein. Shows: - | the name of the query sequence (the name of the sequence || the position (Start and End) of the homologous sequences || divergence || number of gaps || real divergence || the length of the homologous sequence) - | the part of the sequence in protein - | the name of the match sequence (the name of the sequence || the position (Start and End) of the homologous sequences || divergence || number of gaps || real divergence || the length of the homologous sequence) - | the part of the sequence in protein + - 'Input files' : a collection of fasta files (one file per species) + - 'e_value' : the blast e-value. By default it's 1e-5. -------- -=============== -Working Example -=============== - ---------------------------- -The input files and options ---------------------------- +**Outputs** -**Input files** - | 3 files with 200 nucleic sequences each : Ap.fasta, Ac.fasta et Pp.fasta - | -**Parameters** - | e-value = 1e-20 - | - ----------------- -The output files ----------------- - -**Pairwise** + - 'Pairwise' : the general output. It gives the information about what the tool has done for each pairwise. -| -------------------- Pairwise Pp_Ap -------------------- -| -| database : Pp.fasta -| query file : Ap.fasta -| -| ***** START run BLAST ***** -| ***** END run BLAST ***** -| -| -| database : Ap.fasta -| query file : only the sequences of Pp.fasta who matched during the last BLAST -| -| ***** START run BLAST ***** -| ***** END run BLAST ***** -| -| [3/5] Get pairs of sequences ... -| Get list of fasta name involved in RBH -| Number of pairwises parsed = 15 -| Get subset of Alvinella db -| Get subset of Paralvinella db -| -| -------------------- Pairwise Pp_Ac -------------------- -| -| database : Pp.fasta -| query file : Ac.fasta -| -| ***** START run BLAST ***** -| ***** END run BLAST ***** -| -| -| database : Ac.fasta -| query file : only the sequences of Pp.fasta who matched during the last BLAST -| -| ***** START run BLAST ***** -| ***** END run BLAST ***** -| -| [3/5] Get pairs of sequences ... -| Get list of fasta name involved in RBH -| Number of pairwises parsed = 13 -| Get subset of Alvinella db -| Get subset of Paralvinella db -| -| -| -------------------- Pairwise Ap_Ac -------------------- -| -| database : Ap.fasta -| query file : Ac.fasta -| -| ***** START run BLAST ***** -| ***** END run BLAST ***** -| -| -| database : Ac.fasta -| query file : only the sequences of Ap.fasta who matched during the last BLAST -| -| ***** START run BLAST ***** -| ***** END run BLAST ***** -| -| [3/5] Get pairs of sequences ... -| Get list of fasta name involved in RBH -| Number of pairwises parsed = 24 -| Get subset of Alvinella db -| Get subset of Paralvinella db -| -| + - 'Pairwise_DNA' : the output which contains nucleic sequences (of the pairwise) that are homologous. The sequences are with nucleotides. It shows for both the query and match : + the name + the sequence in nucleotides + + - 'Pairwise_PROT' : the output which contains proteic sequences (of the pairwise) that are homologous. The sequences are with protein. It shows : + Name, position, length, and part of the sequence in protein for query and match sequences + Divergence + Number of gaps + Real divergence -**Pairwise_output_file_PROT** +-------- -| For example the 4 last sequences of the file 19_ReciprocalBestHits_Pp_Ap.fasta -| -| >Ap123_1/1_1.000_748||254...478||[[1/1]][[1/6]]||29.3333333333||0||29.3333333333||75.0 -| FVRITVGDEMSRRPKFAMITWVGPEVSPMKRAKVSTDKAFVKQIFQNFAKEIQTSERSELEEEYVRQEVMKAGGA -| >Pp_146_1/2_1.000_713||259...483||[[1/1]][[1/6]]||29.3333333333||0||29.3333333333||75.0 -| FAYIRCTNEESKRSKFAMITWIGQGVEAMKRAKVSMDKQFLKEIFQNFAREFQTSEKSELDEVCIKHALAIDDGA -| >Ap66_1/1_1.000_400||192...398||[[1/1]][[1/6]]||21.7391304348||0||21.7391304348||69.0 -| LSTSLLNWRKHTLCF*GMKLILIILLISFIIPAILFLLSIFTTMRMPESREKFRPYECGFDPNHSARTP -| >Pp_201_2/2_1.000_691||14...220||[[1/1]][[1/6]]||21.7391304348||0||21.7391304348||69.0 -| LSTSLLN*RKQPFASEEMKLLILLLFISALIPRILIILSIFTSIRTPKNREKSSPYECGFDPNHSARTP -| -| - -**Pairwise_output_file_DNA** +**The AdaptSearch Pipeline** -| For example the 4 last sequences of the file 25_DNAalignement_corresponding_to_protein_from_19_RBH_Pp_Ap.fasta -| -| >Ap123_1/1_1.000_748 -| CCAGTAACAAGCCGCCACGGGTCCGTCGTGTCTTCTCTTCAAGGAAAGGTTGACAGATTCTCGTACGCTAGACGTCGCCACCTACTCGTCCTGGACTCCGGTGCCGTAGGTGGCGCCACCTGCTTTCATCACTTCCTGCCTA -| ACGTACTCCTCTTCTAGCTCCGATCTCTCGCTCGTCTGGATCTCTTTGGCAAAGTTCTGGAATATCTGCTTGACGAACGCCTTGTCCGTGCTGACTTTGGCGCGCTTCATTGGGCTCACTTCCGGTCCGACCCACGTGATCA -| TGGCGAACTTCGGTCTTCTGCTCATTTCGTCCCCGACGGTAATACGGACAAAGGCGAACGCCCGCTGGTCATCTTGTAGTTTTGATAACAGATCCTCGTATTCGGTTCCTGTAGAGTCCAGTATAATATTGTCGCCATCATA -| CGTCACAAACGCCCAGTTTGTCTCCGTCGCGTCGCTCCTGACGTCTTCGTAAGCCTGTCCGATAGCCTCTCTGTCGATGTCTGCCATGCTGCTGGTCCCGCTCTCGACGCTAATGAGCCAATCACGACTTCTGACAGACGAG -| TAGACATGCAGACAGCCAGACGGACTGACGGACTGACG -| >Pp_146_1/2_1.000_713 -| CATTAATTGTGTGTCTGGTTGTGGGTGTGTGTTATAAGAGACATCACTTAGTGTATACTGATGTCCACGTGGTAGTTGACCAGCATGTCGAATATGGATAGGGACTCGATCTTGAATGGCTATGAGGAGGTTCGCAACGACGA -| CTCGGACATTAACTGGGCTTTCGTAACGTATTCACCTGACAACAAACTAGTACTTGATTCAACTGGCACAGACTACTTCCAGCTCCAGGAGAAATATCAAGATGATATGCGAGGATTTGCTTACATCCGGTGCACTAACGAGG -| AGAGTAAACGTTCTAAATTTGCCATGATTACCTGGATTGGACAAGGAGTGGAAGCAATGAAGCGTGCCAAGGTCAGCATGGACAAACAGTTCCTAAAGGAAATCTTCCAGAATTTCGCAAGAGAATTTCAGACGAGTGAAAAG -| TCAGAGCTTGATGAGGTCTGTATTAAACACGCGCTTGCCATTGACGATGGAGCTGGTTGCAAAGTGGAAAGCGAGGACACGAGAAAAGGGGCCTTTCTCAGGAAAGAGGATGACACTGAAGTGGAAAGGGAAACTAATGTCAA -| CAATGTCTCCGGTGTCGTGGAAGAAGATGATGACGCAAAAAATGCAAATGATTTTAATTACGAAGAGGACTGTAACAATGAATAGGTGCATGTCGATGATTTATATAGAGAACTAGACTTCGCACTCGCTAGGTGGTTGAT -| >Ap66_1/1_1.000_400 -| TGATCGTCTTATAAACCTAACTTGAAAAACCTTCCTACCATTTAGGGCTAGCAGCCCTATTAATTATCACACCTATCGCAGCGCTCTCACTATAATTATAAGTATTGCGCCGGGTTTGAACGGATAGCTCTGATGCTGCTAATT -| ACGGGACCTAATAATCCCCAATACTTTATCCTTAGAGAGCTGTACCTCTTAGCACCAGTCTTTTAAACTGGCGAAAGCACACTTTATGCTTCTAAGGAATGAAACTAATTCTTATAATCCTACTAATCTCTTTTATCATCCCCG -| CCATTCTATTTTTACTCTCGATCTTTACTACTATGCGCATGCCAGAGAGCCGTGAAAAATTTAGGCCCTACGAGTGCGGGTTTGACCCCAATCACTCGGCCCGAACCCCATT -| >Pp_201_2/2_1.000_691 -| ATCGTAGGGAAAAAGGTGTTCGTGCAGAATGATTGGGGTCAAATCCACATTCGTAGGGGCTAGATTTTTCACGGTTTTTAGGTGTACGAATAGAGGTGAAGATTGATAGGATGATTAAAATTCTTGGGATTAATGCTGAAATAAA -| GAGAAGTAGGATTAAAAGTTTCATTTCCTCAGAAGCAAAGGGTTGCTTTCGTCAGTTTAAAAGACTGGTGCTAAGTAGGTACAGCTCTCTAAGGG +.. image:: ../../adaptsearch_picture_helps.png :heigth: 593 :width: 852 ---------------------------------------------------- +--------- Changelog --------- @@ -284,10 +155,11 @@ **Version 1.0 - 13/04/2017** - - TEST: Add funtional test with planemo + - TEST: Add functional test with planemo - IMPROVEMENT: Use conda dependencies for blast, samtools and python + ]]> </help> <expand macro="citations" />
--- a/scripts/S05_script_extract_match_v20_blastx.py Wed Sep 27 10:01:55 2017 -0400 +++ b/scripts/S05_script_extract_match_v20_blastx.py Wed Jan 17 08:53:53 2018 -0500 @@ -28,193 +28,10 @@ ## 3/ change "Strand" by "Frame" in function "get_information_on_matches" -######################################## -### DEF 1. Split each "BLASTN" event ### -######################################## -def split_file(path_in, keyword): - - file_in = open(path_in, "r") - RUN = '' - BASH1={} - - while 1: - nextline = file_in.readline() - - ################################## - ### [A] FORMATTING QUERY NAME ### - - ### Get query name ### - if nextline[0:6]=='Query=': - L1 = string.split(nextline, "||") - L2 = string.split(L1[0], " ") - query = L2[1] - if query[-1] == "\n": - query = query[:-1] - - ### [A] END FORMATTING QUERY NAME ### - ###################################### - - - ### split the file with keyword ### - if keyword in nextline: - - # Two cases here: - #1# If it is the first "RUN" in the block (i.e. the first occurence of "BLASTN" in the file), we have just to add the new lines in the "RUN" list ... 2nd , we have also to detect the 'key' of bash1, which is the "query" name ... and third we will have to save this "RUN" in the bash1, once we will have detected a new "RUN" (i.e. a new line beginning with "BLASTN". - #2# If it isn't the first run, we have the save the previous "RUN" in the "bash1", before to re-initialize the RUN list (RUN =[]), before to append lines to the new "RUN" - - if RUN == '': # case #1# - RUN = RUN + nextline # we just added the first line of the file - - else: # case #2# (there was a run before) - BASH1[query] = RUN # add the previous run to the bash - RUN = '' # re-initialize the "RUN" - RUN = RUN + nextline # add the line starting with the keyword ("BLASTN") (except the first line of the file (the first "RUN") - - else: # Treatment of the subsequent lines of the one starting with the keyword ("BLASTN") (which is not treated here but previously) - RUN = RUN + nextline - - - if not nextline: # when no more line, we should record the last "RUN" in the bash1 - BASH1[query] = RUN # add the last "RUN" - break - - file_in.close() - return(BASH1) -######################################################### - - -################################################ -### DEF2 : Parse blast output for each query ### -################################################ -### detect matches (i.e. 'Sequences producing significant alignments:' ### - -def detect_Matches(query, MATCH, WORK_DIR): - F5 = open("%s/blastRun2.tmp" %WORK_DIR, 'w') - F5.write(bash1[query]) - F5.close() - - F6 = open("%s/blastRun2.tmp" %WORK_DIR, 'r') - list1 =[] - list2 =[] - - while 1: - nexteu = F6.readline() - if not nexteu : break - - if "***** No hits found ******" in nexteu : - hit = 0 - break - - if 'Sequences producing significant alignments:' in nexteu: - hit = 1 - F6.readline() # jump a line - - while 1: - nexteu2 = F6.readline() - if nexteu2[0]==">": break - - ###################################### - ### [B] FORMAT MATCH NAME 1st STEP ### - - if nexteu2 != '\n': - LL1 = string.split(nexteu2, " ") # specific NORTH database names !!!!!!! - match = LL1[0] #### SOUTH databank // NORTH will have "|" separators - list1.append(match) - - match2 = ">" + LL1[0] # more complete name // still specific NORTH database names !!!!!!! - list2.append(match2) #### SOUTH databank // NORTH will have "|" separators - - if MATCH == 0: ## Only read the 1rst line (i.e. the First Match) - break - else: ## Read the other lines (i.e. All the Matches) - continue - - ### [B] END FORMAT MATCH NAME 1st STEP ### - ########################################## - - break - - F6.close() - return(list1, list2, hit) # list1 = short name // list2 = more complete name -####################################### - - -######################################### -### DEF3 : Get Information on matches ### -######################################### -### Function used in the next function (2.3.) - -def get_information_on_matches(list_of_line): - for line in list_of_line: - - ## Score and Expect - if "Score" in line: - line = line[:-1] # remove "\n" - S_line = string.split(line, " = ") - Expect = S_line[-1] ## ***** Expect - S_line2 = string.split(S_line[1], " bits ") - Score = string.atof(S_line2[0]) - - ## Identities/gaps/percent/divergence/length_matched - elif "Identities" in line: - line = line[:-1] # remove "\n" - g = 0 - - if "Gaps" in line: - pre_S_line = string.split(line, ",") - identity_line = pre_S_line[0] - gaps_line = pre_S_line[1] - g = 1 - else: - identity_line = line - g = 0 - - ## treat identity line - S_line = string.split(identity_line, " ") - - identities = S_line[-2] ## ***** identities - - S_line2 = string.split(identities, "/") - hits = string.atof(S_line2[0]) ## ***** hits - length_matched = string.atof(S_line2[1]) ## ***** length_matched - abs_nb_differences = length_matched - hits ## ***** abs_nb_differences - - - identity_percent = hits/length_matched * 100 ## ***** identity_percent - - divergence_percent = abs_nb_differences/length_matched*100 ## ***** divergence_percent - - ## treat gap line if any - if g ==1: # means there are gaps - S_line3 = string.split(gaps_line, " ") - gaps_part = S_line3[-2] - S_line4 = string.split(gaps_part, "/") - gaps_number = string.atoi(S_line4[0]) ## ***** gaps_number - - real_differences = abs_nb_differences - gaps_number ## ***** real_differences - real_divergence_percent = (real_differences/length_matched)*100 ## ***** real_divergence_percent - else: - gaps_number = 0 - real_differences = 0 - real_divergence_percent = divergence_percent - - ## Frame - elif "Frame" in line: - line = line[:-1] - S_line = string.split(line, " = ") - frame = S_line[1] - - list_informations=[length_matched, Expect, Score, identities, hits, identity_percent, divergence_percent,gaps_number, real_divergence_percent, frame, length_matched] - - return(list_informations) -######################################## - - ############################ ### DEF4 : get sequences ### ############################ ### [+ get informations from the function 2.2.] - def get_sequences(query, list2, SUBMATCHEU,WORK_DIR): list_Pairwise = [] @@ -235,11 +52,12 @@ i = l miniList.append(i) # content positions in the list "text1", of all begining of match (e.g. >gnl|UG|Apo#S51012099 [...]) - miniList.reverse() + miniList.reverse() + if miniList != []: length = len(miniList) + ii = 0 - Listing1 = [] while ii < length: iii = miniList[ii] @@ -279,12 +97,15 @@ BigFastaName = e1 ### LIST OF LINES <=> What is remaining after removing all the hit with "Score =", so all the text comprise between ">" and the first "Score =" ==> Include Match name & "Length & empty lines - SmallFastaName = BigFastaName[0] ## First line <=> MATCH NAME + SmallFastaName = BigFastaName[0] ## First line <=> MATCH NAME + SmallFastaName = SmallFastaName[1:-1] ### remove ">" and "\n" + """ + 3 lines below : only difference with S08 + """ if SmallFastaName[-1] == " ": - SmallFastaName = SmallFastaName[:-1] - + SmallFastaName = SmallFastaName[:-1] PutInFastaName1 = SmallFastaName ### [C] END FORMAT MATCH NAME 2nd STEP ### @@ -389,20 +210,24 @@ ######################################### -###################### -### 2. RUN RUN RUN ### -###################### +################### +### RUN RUN RUN ### +################### import string, os, time, re, sys +from functions import split_file, detect_Matches, get_information_on_matches ## 1 ## INPUT/OUTPUT -SHORT_FILE = sys.argv[1] #short-name-query_short-name-db +SHORT_FILE = sys.argv[1] ## short-name-query_short-name-db +""" +04 and 06 for S05, 11 and 13 for S08 +""" path_in = "%s/04_outputBlast_%s.txt" %(SHORT_FILE, SHORT_FILE) file_out = open("%s/06_PairwiseMatch_%s.fasta" %(SHORT_FILE, SHORT_FILE),"w") ## 2 ## RUN ## create Bash1 ## -bash1 = split_file(path_in, "TBLASTX") ## DEF1 ## +bash1 = split_file(path_in, "TBLASTX") ### DEF1 ### ## detect and save match ## list_hits =[] @@ -414,7 +239,7 @@ j = j+1 ## 2.1. detect matches ## - list_match, list_match2, hit=detect_Matches(query, MATCH, SHORT_FILE) ### DEF2 ### + list_match, list_match2, hit=detect_Matches(query, MATCH, SHORT_FILE, bash1) ### DEF2 ### if hit == 1: # match(es) list_hits.append(query) @@ -423,7 +248,7 @@ ## 2.2. get sequences ## if hit ==1: - list_pairwiseMatch, list_info = get_sequences(query, list_match2, SUBMATCH, SHORT_FILE) ### FUNCTION ### + list_pairwiseMatch, list_info = get_sequences(query, list_match2, SUBMATCH, SHORT_FILE)# # divergencve divergence = list_info[6] @@ -445,6 +270,9 @@ len_query_seq = len(query_seq) + """ + 4 lines below : only in S05 + """ Lis1 = string.split(query_name, "||") short_query_name = Lis1[0] Lis2 = string.split(match_name, "||")
--- a/scripts/S06_post_processing_of_pairwise.py Wed Sep 27 10:01:55 2017 -0400 +++ b/scripts/S06_post_processing_of_pairwise.py Wed Jan 17 08:53:53 2018 -0500 @@ -2,96 +2,13 @@ ## AUTHOR: Eric Fontanillas ## LAST VERSION: 14/08/14 by Julie BAFFARD -MINIMUM_LENGTH = 1 #bp - - -############################ -##### DEF1 : Get Pairs ##### -############################ -def get_pairs(fasta_file_path): - F2 = open(fasta_file_path, "r") - list_pairwises = [] - while 1: - next2 = F2.readline() - if not next2: - break - if next2[0] == ">": - fasta_name_query = next2[:-1] - next3 = F2.readline() - fasta_seq_query = next3[:-1] - next3 = F2.readline() ## jump one empty line (if any after the sequence) - fasta_name_match = next3[:-1] - next3 = F2.readline() - fasta_seq_match = next3[:-1] - pairwise = [fasta_name_query,fasta_seq_query,fasta_name_match,fasta_seq_match] - - ## ADD pairwise with condition - list_pairwises.append(pairwise) - F2.close() - - return(list_pairwises) -############################################## - -################################# -##### DEF2 : Extract length ##### -################################# -def extract_length(length_string): # format length string = 57...902 - l3 = string.split(length_string, "...") - n1 = string.atoi(l3[0]) - n2 = string.atoi(l3[1]) - length = n2-n1 - - return(length) -############################################## - - -#################################### -##### DEF3 : Remove Redondancy ##### -#################################### -def filter_redondancy_and_length(list_paireu, MIN_LENGTH): - - bash1 = {} - list_pairout = [] - - for pair in list_paireu: - query_name = pair[0] - query_seq = pair[1] - match_name = pair[2] - match_seq = pair[3] - - l1 = string.split(query_name, "||") - short_query_name = l1[0][1:] - length_matched = extract_length(l1[1]) ### DEF2 ### - l2 = string.split(match_name, "||") - short_match_name = l2[0][1:] - binom = "%s_%s" %(short_query_name, short_match_name) - - ## TEST FOR REDONDANCY - ## REDONDANCY OF BINOME!!!! => MATCHE BETWEEN THE SAME 2 CONTIGS, BUT AT DIFFERENT POSITIONS ON THE CONTIG - ## REDONDANCY NOT REMOVED HERE: - ## 1/ Several "TERA" match with one "APN" (Counted in script "09_formatMatch_getBackNucleotides.py") - ## 2/ Several "APN" match with one "TERA" (Counted - if binom not in bash1.keys(): - bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] - else: - old_length = bash1[binom][-1] - if length_matched > old_length: - bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] - - - for bino in bash1.keys(): - length = bash1[bino][-1] - if length > MIN_LENGTH: - list_pairout.append(bash1[bino]) - - return(list_pairout) -############################################## - +MINIMUM_LENGTH = 1 ####################### ##### RUN RUN RUN ##### ####################### import string, os, time, re, sys +from functions import get_pairs, extract_length, filter_redondancy ## 1 ## INPUT/OUTPUT SHORT_FILE = sys.argv[1] #short-name-query_short-name-db @@ -106,7 +23,7 @@ ## 2 ## RUN list_pairwises = get_pairs(F_IN) ### DEF1 ### -list_pairwises_filtered1 = filter_redondancy_and_length(list_pairwises, MINIMUM_LENGTH) ### DEF3 ### +list_pairwises_filtered1 = filter_redondancy(list_pairwises, MINIMUM_LENGTH) ### DEF3 ### i = 0 for pair in list_pairwises_filtered1:
--- a/scripts/S08_script_extract_match_v20_blastx.py Wed Sep 27 10:01:55 2017 -0400 +++ b/scripts/S08_script_extract_match_v20_blastx.py Wed Jan 17 08:53:53 2018 -0500 @@ -6,7 +6,8 @@ ### MATCH = Only the first match keeped MATCH = 0 # Only 1rst match Wanted -#MATCH = 1 # All match want +#MATCH = 1 # All match wanted + ### SUBMATCH = several part of a same sequence match with the query SUBMATCH = 0 # SUBMATCH NOT WANTED (ONLY 1rst HIT) #SUBMATCH =1 # SUBMATCH WANTED @@ -27,212 +28,6 @@ ## 3/ change "Strand" by "Frame" in function "get_information_on_matches" - -######################################## -### DEF 1. Split each "BLASTN" event ### -######################################## -def split_file(path_in, keyword): - - file_in = open(path_in, "r") - - RUN = '' - BASH1={} - - while 1: - nextline = file_in.readline() - - ################################## - ### [A] FORMATTING QUERY NAME ### - - # Get query name - if nextline[0:6]=='Query=': - L1 = string.split(nextline, "||") - L2 = string.split(L1[0], " ") - query = L2[1] - if query[-1] == "\n": - query = query[:-1] - - ### [A] END FORMATTING QUERY NAME ### - ###################################### - - - ### split the file with keyword ### - if keyword in nextline: - # Two cases here: - #1# If it is the first "RUN" in the block (i.e. the first occurence of "BLASTN" in the file), we have just to add the new lines in the "RUN" list ... 2nd , we have also to detect the 'key' of bash1, which is the "query" name ... and third we will have to save this "RUN" in the bash1, once we will have detected a new "RUN" (i.e. a new line beginning with "BLASTN". - #2# If it isn't the first run, we have the save the previous "RUN" in the "bash1", before to re-initialize the RUN list (RUN =[]), before to append lines to the new "RUN" - - if RUN == '': # case #1# - RUN = RUN + nextline # we just added the first line of the file - - else: # case #2# (there was a run before) - BASH1[query] = RUN # add the previous run to the bash - RUN = '' # re-initialize the "RUN" - RUN = RUN + nextline # add the line starting with the keyword ("BLASTN") (except the first line of the file (the first "RUN") - - else: # Treatment of the subsequent lines of the one starting with the keyword ("BLASTN") (which is not treated here but previously) - RUN = RUN + nextline - - - if not nextline: # when no more line, we should record the last "RUN" in the bash1 - BASH1[query] = RUN # add the last "RUN" - break - - - file_in.close() - return(BASH1) -######################################################### - - -############################################### -### DEF2 : Parse blast output for each query### -############################################### - -### 2.1. detect matches (i.e. 'Sequences producing significant alignments:' ### -def detect_Matches(query, MATCH, WORK_DIR): - F5 = open("%s/blastRun2.tmp" %WORK_DIR, 'w') - F5.write(bash1[query]) - F5.close() - - F6 = open("%s/blastRun2.tmp" %WORK_DIR, 'r') - list1 =[] - list2 =[] - - while 1: - nexteu = F6.readline() - - if not nexteu : break - - if "***** No hits found ******" in nexteu : - hit = 0 - break - - if 'Sequences producing significant alignments:' in nexteu: - hit = 1 - F6.readline() # jump a line - - while 1: - nexteu2 = F6.readline() - - if nexteu2[0]==">": break - - ###################################### - ### [B] FORMAT MATCH NAME 1st STEP ### - - if nexteu2 != '\n': - LL1 = string.split(nexteu2, " ") # specific NORTH database names !!!!!!! - match = LL1[0] #### SOUTH databank // NORTH will have "|" separators - list1.append(match) - - match2 = ">" + LL1[0] # more complete name // still specific NORTH database names !!!!!!! - list2.append(match2) #### SOUTH databank // NORTH will have "|" separators - - if MATCH == 0: ## Only read the 1rst line (i.e. the First Match) - break - else: ## Read the other lines (i.e. All the Matches) - continue - - ### [B] END FORMAT MATCH NAME 1st STEP ### - ########################################## - - break - - F6.close() - return(list1, list2, hit) # list1 = short name // list2 = more complete name -###################################### - - -######################################### -### DEF3 : Get Information on matches ### -######################################### -### Function used in the next function (2.3.) -def get_information_on_matches(list_of_line): - - for line in list_of_line: - - ## Score and Expect - if "Score" in line: - #print line - line = line[:-1] # remove "\n" - S_line = string.split(line, " = ") - Expect = S_line[-1] ## ***** Expect - S_line2 = string.split(S_line[1], " bits ") - Score = string.atof(S_line2[0]) - - - ## Identities/gaps/percent/divergence/length_matched - elif "Identities" in line: - line = line[:-1] # remove "\n" - - g = 0 - if "Gaps" in line: - #print "HIT!!!" - pre_S_line = string.split(line, ",") - identity_line = pre_S_line[0] - gaps_line = pre_S_line[1] - g = 1 - else: - identity_line = line - g = 0 - - ## treat identity line - S_line = string.split(identity_line, " ") - - identities = S_line[-2] ## ***** identities - #print "\t\tIdentities = %s" %identities - - S_line2 = string.split(identities, "/") - hits = string.atof(S_line2[0]) ## ***** hits - length_matched = string.atof(S_line2[1]) ## ***** length_matched - abs_nb_differences = length_matched - hits ## ***** abs_nb_differences - - - ## identity_percent = S_line[-1] - identity_percent = hits/length_matched * 100 ## ***** identity_percent - #print "\t\tIdentity (percent) = %.2f" %identity_percent - - divergence_percent = abs_nb_differences/length_matched*100 ## ***** divergence_percent - #print "\t\tDivergence (percent) = %.2f" %divergence_percent - - ## treat gap line if any - if g ==1: # means there are gaps - S_line3 = string.split(gaps_line, " ") - gaps_part = S_line3[-2] - S_line4 = string.split(gaps_part, "/") - gaps_number = string.atoi(S_line4[0]) ## ***** gaps_number - #print "\t\tGaps number = %s" %gaps_number - - real_differences = abs_nb_differences - gaps_number ## ***** real_differences - real_divergence_percent = (real_differences/length_matched)*100 ## ***** real_divergence_percent - #print "\t\tReal divergence (percent)= %.2f" %real_divergence_percent - else: - gaps_number = 0 - #print "\t\tGaps number = %s" %gaps_number - real_differences = 0 - real_divergence_percent = divergence_percent - - ## Strand - #elif "Strand" in line: - # line = line[:-1] # remove "\n" - # S_line = string.split(line, " = ") - # strand = S_line[1] - # print "\t\tStrand = %s" %strand - - ## Frame - elif "Frame" in line: - line = line[:-1] # remove "\n" - S_line = string.split(line, " = ") - frame = S_line[1] - #print "\t\tFrame = %s" %frame - - - - list_informations=[length_matched, Expect, Score, identities, hits, identity_percent, divergence_percent,gaps_number, real_divergence_percent, frame, length_matched] - - return(list_informations) -################################## - - ############################ ### DEF4 : get sequences ### ############################ @@ -276,7 +71,6 @@ Listing2 = Listing1[1:] # remove the first thing ("BLASTN ...") and keep only table beginning with a line with ">" SEK = len(Listing2) - NB_SEK = 0 for e1 in Listing2: # "Listing2" contents all the entries begining with ">" @@ -290,8 +84,7 @@ index = l list51.append(l) # index of the lines with score - list51.reverse() - + list51.reverse() Listing3 = [] for i5 in list51: @@ -309,11 +102,11 @@ SmallFastaName = SmallFastaName[1:-2] ### remove ">" and "\n" - + """ + 3 lines below : only difference with S05 + """ S1 = string.split(SmallFastaName, "||") - S2 = string.split(S1[0], " ") - - + S2 = string.split(S1[0], " ") PutInFastaName1 = S2[0] ### [C] END FORMAT MATCH NAME 2nd STEP ### @@ -322,16 +115,15 @@ SUBSEK = len(Listing3) NB_SUBSEK = 0 list_inBatch = [] + ### IF NO SUBMATCH WANTED !!!! => ONLY KEEP THE FIRST HIT OF "LISTING3": if SUBMATCHEU == 0: # NO SUBMATCH WANTED !!!! Listing4 = [] Listing4.append(Listing3[-1]) # Remove this line if submatch wanted!!! elif SUBMATCHEU == 1: Listing4 = Listing3 - - for l in Listing4: ## "listing3" contents - + for l in Listing4: ## "listing3" contents NB_SUBSEK = NB_SUBSEK+1 ll1 = string.replace(l[0], " ", "") @@ -343,6 +135,7 @@ pos_query = [] seq_match = "" pos_match = [] + for line in l: if "Query:" in line: line = string.replace(line, " ", " ") # remove multiple spaces in line @@ -360,8 +153,7 @@ pos_query.append(pos2) seq = lll1[2] - seq_query = seq_query + seq - + seq_query = seq_query + seq if "Sbjct:" in line: line = string.replace(line, " ", " ") # remove multiple spaces in line @@ -381,29 +173,25 @@ seq = lll2[2] seq_match = seq_match + seq - ## Get the query and matched sequences and the corresponding positions - + ## Get the query and matched sequences and the corresponding positions pos_query.sort() # rank small to big pos_query_start = pos_query[0] # get the smaller pos_query_end = pos_query[-1] # get the bigger PutInFastaName3 = "%d...%d" %(pos_query_start, pos_query_end) - ###################################### ### [D] FORMAT QUERY NAME 2nd STEP ### FINAL_fasta_Name_Query = ">" + query + "||"+ PutInFastaName3 + "||[[%d/%d]][[%d/%d]]" %(NB_SEK, SEK, NB_SUBSEK,SUBSEK) ### [D] END FORMAT QUERY NAME 2nd STEP ### - ########################################## - + ########################################## pos_match.sort() pos_match_start = pos_match[0] pos_match_end = pos_match[-1] PutInFastaName4 = "%d...%d" %(pos_match_start, pos_match_end) - ###################################### ### [E] FORMAT MATCH NAME 3rd STEP ### @@ -412,14 +200,11 @@ ### [E] END FORMAT MATCH NAME 3rd STEP ### ########################################## - - Pairwise = [FINAL_fasta_Name_Query , seq_query , FINAL_fasta_Name_Match , seq_match] # list with 4 members list_Pairwise.append(Pairwise) ### Get informations about matches - list_info = get_information_on_matches(l) ### DEF3 ### - + list_info = get_information_on_matches(l) ### DEF3 ### F8.close() return(list_Pairwise, list_info) @@ -430,16 +215,20 @@ ### RUN RUN RUN ### ################### import string, os, time, re, sys +from functions import split_file, detect_Matches, get_information_on_matches ## 1 ## INPUT/OUTPUT SHORT_FILE = sys.argv[1] ## short-name-query_short-name-db +""" +04 and 06 for S05, 11 and 13 for S08 +""" path_in = "%s/11_outputBlast_%s.txt" %(SHORT_FILE, SHORT_FILE) file_out = open("%s/13_PairwiseMatch_%s.fasta" %(SHORT_FILE, SHORT_FILE),"w") ## 2 ## RUN ## create Bash1 ## -bash1 = split_file(path_in, "TBLASTX") ### DEF1 ### +bash1 = split_file(path_in, "TBLASTX") ### DEF1 ### ## detect and save match ## list_hits =[] @@ -451,7 +240,7 @@ j = j+1 ## 2.1. detect matches ## - list_match, list_match2, hit=detect_Matches(query, MATCH, SHORT_FILE) ### DEF2 ### + list_match, list_match2, hit=detect_Matches(query, MATCH, SHORT_FILE, bash1) ### DEF2 ### if hit == 1: # match(es) list_hits.append(query) @@ -460,7 +249,7 @@ ## 2.2. get sequences ## if hit ==1: - list_pairwiseMatch, list_info = get_sequences(query, list_match2, SUBMATCH, SHORT_FILE) ### DEF4 ### + list_pairwiseMatch, list_info = get_sequences(query, list_match2, SUBMATCH, SHORT_FILE) # divergencve divergence = list_info[6] @@ -482,7 +271,6 @@ len_query_seq = len(query_seq) - # If NO CONTROL FOR LENGTH, USE THE FOLLOWING LINES INSTEAD: file_out.write("%s||%s||%s||%s||%s" %(query_name,divergence,gap_number,real_divergence,length_matched))
--- a/scripts/S09_post_processing_of_pairwise.py Wed Sep 27 10:01:55 2017 -0400 +++ b/scripts/S09_post_processing_of_pairwise.py Wed Jan 17 08:53:53 2018 -0500 @@ -4,88 +4,11 @@ MINIMUM_LENGTH = 1 -############################ -##### DEF1 : Get Pairs ##### -############################ -def get_pairs(fasta_file_path): - F2 = open(fasta_file_path, "r") - list_pairwises = [] - while 1: - next2 = F2.readline() - if not next2: - break - if next2[0] == ">": - fasta_name_query = next2[:-1] - next3 = F2.readline() - fasta_seq_query = next3[:-1] - next3 = F2.readline() ## jump one empty line (if any after the sequence) - fasta_name_match = next3[:-1] - next3 = F2.readline() - fasta_seq_match = next3[:-1] - pairwise = [fasta_name_query,fasta_seq_query,fasta_name_match,fasta_seq_match] - - ## ADD pairwise with condition - list_pairwises.append(pairwise) - F2.close() - return(list_pairwises) -############################################## - - -################################# -##### DEF2 : Extract length ##### -################################# -def extract_length(length_string): # format length string = 57...902 - l3 = string.split(length_string, "...") - n1 = string.atoi(l3[0]) - n2 = string.atoi(l3[1]) - length = n2-n1 - return(length) -############################################## - - -#################################### -##### DEF3 : Remove Redondancy ##### -#################################### -def filter_redondancy(list_paireu, MIN_LENGTH): - - bash1 = {} - list_pairout = [] - - for pair in list_paireu: - query_name = pair[0] - query_seq = pair[1] - match_name = pair[2] - match_seq = pair[3] - - l1 = string.split(query_name, "||") - short_query_name = l1[0][1:] - length_matched = extract_length(l1[1]) ### DEF2 ### - l2 = string.split(match_name, "||") - short_match_name = l2[0][1:] - binom = "%s_%s" %(short_query_name, short_match_name) - - if binom not in bash1.keys(): - bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] - else: - old_length = bash1[binom][-1] - if length_matched > old_length: - bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] - - - for bino in bash1.keys(): - length = bash1[bino][-1] - if length > MIN_LENGTH: - list_pairout.append(bash1[bino]) - - - return(list_pairout) -############################################## - - ####################### ##### RUN RUN RUN ##### ####################### import string, os, time, re, sys +from functions import get_pairs, extract_length, filter_redondancy ## 1 ## INPUT/OUTPUT SHORT_FILE = sys.argv[1] ## short-name-query_short-name-db
--- a/scripts/S11_post_processing_of_pairwise.py Wed Sep 27 10:01:55 2017 -0400 +++ b/scripts/S11_post_processing_of_pairwise.py Wed Jan 17 08:53:53 2018 -0500 @@ -4,88 +4,13 @@ MINIMUM_LENGTH = 1 -############################ -##### DEF1 : Get Pairs ##### -############################ -def get_pairs(fasta_file_path): - F2 = open(fasta_file_path, "r") - list_pairwises = [] - while 1: - next2 = F2.readline() - if not next2: - break - if next2[0] == ">": - fasta_name_query = next2[:-1] - next3 = F2.readline() - fasta_seq_query = next3[:-1] - next3 = F2.readline() ## jump one empty line (if any after the sequence) - fasta_name_match = next3[:-1] - next3 = F2.readline() - fasta_seq_match = next3[:-1] - pairwise = [fasta_name_query,fasta_seq_query,fasta_name_match,fasta_seq_match] - - ## ADD pairwise with condition - list_pairwises.append(pairwise) - F2.close() - return(list_pairwises) -############################################## - - -################################# -##### DEF2 : Extract length ##### -################################# -def extract_length(length_string): # format length string = 57...902 - l3 = string.split(length_string, "...") - n1 = string.atoi(l3[0]) - n2 = string.atoi(l3[1]) - length = n2-n1 - return(length) -############################################## - - -#################################### -##### DEF3 : Remove Redondancy ##### -#################################### -def filter_redondancy(list_paireu, MIN_LENGTH): - - bash1 = {} - list_pairout = [] - - for pair in list_paireu: - query_name = pair[0] - query_seq = pair[1] - match_name = pair[2] - match_seq = pair[3] - - l1 = string.split(query_name, "||") - short_query_name = l1[0][1:] - length_matched = extract_length(l1[1]) ### DEF2 ### - l2 = string.split(match_name, "||") - short_match_name = l2[0][1:] - binom = "%s_%s" %(short_query_name, short_match_name) - - if binom not in bash1.keys(): - bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] - else: - old_length = bash1[binom][-1] - if length_matched > old_length: - bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] - - - for bino in bash1.keys(): - length = bash1[bino][-1] - if length > MIN_LENGTH: - list_pairout.append(bash1[bino]) - - return(list_pairout) -############################################## - - ####################### ##### RUN RUN RUN ##### ####################### import string, os, time, re, sys +from functions import get_pairs, extract_length, filter_redondancy +## 1 ## INPUT/OUTPUT SHORT_FILE = sys.argv[1] ## short-name-query_short-name-db F_IN = "%s/17_ReciprocalHits_%s.fasta" %(SHORT_FILE, SHORT_FILE)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/scripts/functions.py Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,229 @@ +import string + +# Used in S05 and +def split_file(path_in, keyword): + + file_in = open(path_in, "r") + RUN = '' + BASH1={} + + with open(path_in, "r") as file_in: + for nextline in file_in.readlines(): + + ################################## + ### [A] FORMATTING QUERY NAME ### + + # Get query name + if nextline[0:6]=='Query=': + L1 = string.split(nextline, "||") + L2 = string.split(L1[0], " ") + query = L2[1] + if query[-1] == "\n": + query = query[:-1] + + ### [A] END FORMATTING QUERY NAME ### + ###################################### + + + ### split the file with keyword ### + if keyword in nextline: + # Two cases here: + #1# If it is the first "RUN" in the block (i.e. the first occurence of "BLASTN" in the file), we have just to add the new lines in the "RUN" list ... 2nd , we have also to detect the 'key' of bash1, which is the "query" name ... and third we will have to save this "RUN" in the bash1, once we will have detected a new "RUN" (i.e. a new line beginning with "BLASTN". + #2# If it isn't the first run, we have the save the previous "RUN" in the "bash1", before to re-initialize the RUN list (RUN =[]), before to append lines to the new "RUN" + + if RUN == '': # case #1# + RUN = RUN + nextline # we just added the first line of the file + + else: # case #2# (there was a run before) + BASH1[query] = RUN # add the previous run to the bash + RUN = '' # re-initialize the "RUN" + RUN = RUN + nextline # add the line starting with the keyword ("BLASTN") (except the first line of the file (the first "RUN") + + else: # Treatment of the subsequent lines of the one starting with the keyword ("BLASTN") (which is not treated here but previously) + RUN = RUN + nextline + + + if RUN: + BASH1[query] = RUN # add the last "RUN" + + + return(BASH1) + +def detect_Matches(query, MATCH, WORK_DIR, bash1): + F5 = open("%s/blastRun2.tmp" %WORK_DIR, 'w') + F5.write(bash1[query]) + F5.close() + + F6 = open("%s/blastRun2.tmp" %WORK_DIR, 'r') + list1 =[] + list2 =[] + + while 1: + nexteu = F6.readline() + + if not nexteu : break + + if "***** No hits found ******" in nexteu : + hit = 0 + break + + if 'Sequences producing significant alignments:' in nexteu: + hit = 1 + F6.readline() # jump a line + + while 1: + nexteu2 = F6.readline() + + if nexteu2[0]==">": break + + ###################################### + ### [B] FORMAT MATCH NAME 1st STEP ### + + if nexteu2 != '\n': + LL1 = string.split(nexteu2, " ") # specific NORTH database names !!!!!!! + match = LL1[0] #### SOUTH databank // NORTH will have "|" separators + list1.append(match) + + match2 = ">" + LL1[0] # more complete name // still specific NORTH database names !!!!!!! + list2.append(match2) #### SOUTH databank // NORTH will have "|" separators + + if MATCH == 0: ## Only read the 1rst line (i.e. the First Match) + break + else: ## Read the other lines (i.e. All the Matches) + continue + + ### [B] END FORMAT MATCH NAME 1st STEP ### + ########################################## + + break + + F6.close() + return(list1, list2, hit) # list1 = short name // list2 = more complete name + +def get_information_on_matches(list_of_line): + + for line in list_of_line: + + ## Score and Expect + if "Score" in line: + line = line[:-1] # remove "\n" + S_line = string.split(line, " = ") + Expect = S_line[-1] ## ***** Expect + S_line2 = string.split(S_line[1], " bits ") + Score = string.atof(S_line2[0]) + + ## Identities/gaps/percent/divergence/length_matched + elif "Identities" in line: + line = line[:-1] # remove "\n" + g = 0 + + if "Gaps" in line: + pre_S_line = string.split(line, ",") + identity_line = pre_S_line[0] + gaps_line = pre_S_line[1] + g = 1 + else: + identity_line = line + g = 0 + + ## treat identity line + S_line = string.split(identity_line, " ") + + identities = S_line[-2] ## ***** identities + + S_line2 = string.split(identities, "/") + hits = string.atof(S_line2[0]) ## ***** hits + length_matched = string.atof(S_line2[1]) ## ***** length_matched + abs_nb_differences = length_matched - hits ## ***** abs_nb_differences + + identity_percent = hits/length_matched * 100 ## ***** identity_percent + + divergence_percent = abs_nb_differences/length_matched*100 ## ***** divergence_percent + + ## treat gap line if any + if g ==1: # means there are gaps + S_line3 = string.split(gaps_line, " ") + gaps_part = S_line3[-2] + S_line4 = string.split(gaps_part, "/") + gaps_number = string.atoi(S_line4[0]) ## ***** gaps_number + + real_differences = abs_nb_differences - gaps_number ## ***** real_differences + real_divergence_percent = (real_differences/length_matched)*100 ## ***** real_divergence_percent + else: + gaps_number = 0 + real_differences = 0 + real_divergence_percent = divergence_percent + + ## Frame + elif "Frame" in line: + line = line[:-1] # remove "\n" + S_line = string.split(line, " = ") + frame = S_line[1] + + list_informations=[length_matched, Expect, Score, identities, hits, identity_percent, divergence_percent,gaps_number, real_divergence_percent, frame, length_matched] + + return(list_informations) + +# Used in S06, S09, S11 +def get_pairs(fasta_file_path): + F2 = open(fasta_file_path, "r") + list_pairwises = [] + while 1: + next2 = F2.readline() + if not next2: + break + if next2[0] == ">": + fasta_name_query = next2[:-1] + next3 = F2.readline() + fasta_seq_query = next3[:-1] + next3 = F2.readline() ## jump one empty line (if any after the sequence) + fasta_name_match = next3[:-1] + next3 = F2.readline() + fasta_seq_match = next3[:-1] + pairwise = [fasta_name_query,fasta_seq_query,fasta_name_match,fasta_seq_match] + + ## ADD pairwise with condition + list_pairwises.append(pairwise) + F2.close() + + return(list_pairwises) + +def extract_length(length_string): # format length string = 57...902 + l3 = string.split(length_string, "...") + n1 = string.atoi(l3[0]) + n2 = string.atoi(l3[1]) + length = n2-n1 + return(length) + +def filter_redondancy(list_paireu, MIN_LENGTH): + + bash1 = {} + list_pairout = [] + + for pair in list_paireu: + query_name = pair[0] + query_seq = pair[1] + match_name = pair[2] + match_seq = pair[3] + + l1 = string.split(query_name, "||") + short_query_name = l1[0][1:] + length_matched = extract_length(l1[1]) + l2 = string.split(match_name, "||") + short_match_name = l2[0][1:] + binom = "%s_%s" %(short_query_name, short_match_name) + + if binom not in bash1.keys(): + bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] + else: + old_length = bash1[binom][-1] + if length_matched > old_length: + bash1[binom] = [query_name, query_seq, match_name, match_seq, length_matched] + + + for bino in bash1.keys(): + length = bash1[bino][-1] + if length > MIN_LENGTH: + list_pairout.append(bash1[bino]) + + return(list_pairout) \ No newline at end of file
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/inputs2/AcAcaud_trinity.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,38 @@ +>Ac1_1/1_1.000_151 +TCGCTCTCCTCCGCCTTTTCTCTAAGCTTAAAAATTATGAAGAGTCTGCACCAGAGACAACTCCTTAGCACATCGACCGACCAGCTGGCCATAAAATGTCTTATTTATACCTTTGTCATGAGCCTTGATAATCCTTTGTGCCAGTGGTGGC +>Ac2_1/1_1.000_160 +ATGTTAGTAAAAGAGATTAAAGAGTACCGAGAGATAAAAGAGAAGGCTAGAACCTATCTATGTTATATTATAAGTAGTAACCTATCTTATGGTTCAAGCATAAATGAGGAGACTCTTCAAGAGAGTATGGAGATGTTAAAGAGGGCAATCCCAAAGAGTG +>Ac3_1/1_1.000_160 +ATCTGTAATGTCGTTTACCACACACTGGACACTGATATTTCCGCTCGCCAGTGTGTGGTAAACGATATTACAGATCAGATGTGCTGGCAATCCATATCAGTACACACAGCAGTGAAAAAAATCATAAATGTGACATCTGTGGCAAGGCTTTCTCAAATGC +>Ac4_1/1_1.000_160 +AAACAATGCAATCCTCTACCATTGCCAAGATATGAAGAACAAGTAAATGGCACATCAACAACAATGATAATAATAAGTGCTAATAACAATAAGAGTAATACAATTACCACAATATCTGAGAACAAGGGGCTTAAGCATAGCTATCATTATTTGGGAGGGG +>Ac5_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Ac6_1/1_1.000_160 +CAGCCTACCACTGAGAAGAGATACTTCAACATGTCTTACTGGGGTAGAAGTGGTGGTCGTACAGCGGGTGGTAATGCAGGACGTGGTCGTGGCGGCGGCAGCGGCAGTGGCAGTAGTCAAAGTGGTGGTGGCAGCTTTCTACAGGAACGTATCAAAGAGA +>Ac7_1/1_1.000_160 +GCACCTAGAATTACCCGAAGTTGCTTGGCAATAGCGACACCTAACGGTCGCCATGATATTTGCAGGAAGAAGGCATGTGGTACCATTGGGAACCGTCAAGCGTTTCCTCAGCCCTGTGGCAGCTGCCCGTCTGCGCCCGTGTTTGACCTTGAGCACCAAG +>Ac8_1/1_1.000_160 +ATCAAAGAAGAGCAACATCGAGCTACTGGCACTGGCAATGGAATCCTAATTATAGCAGAAACAAGCACTGGTTGCCTGTTGTCTGGGTCAGCAATTGGTAGTAGAGGTGTTCCTGCTGAAGAAGTTGGGGTCAAAGCAGGACAGATGCTTTTGGATAACT +>Ac9_1/1_1.000_160 +GCCATTCGTCTTAGGAGAAGTTTGTCGTCAGGAAAGATACATGAGGCCTGGATTCTTTCTGACACCGACTCGACGATGTCATTACCTTGTCCACCTGGAACCAACCCCTCATCGACTTCAGCGGATCCATATCTGGTGATCACCAGAAAAACGAACACTA +>Ac10_1/1_1.000_160 +GCCTGGGTATTATTTACCACAGTAACCTTTCATCAGTTTGTGGTGAAAGTACGTGACGTTATGCATTGGCAAGATTGGACATTTTGGTTCGCCCTGTTTTGTACGCATAATAATGTATGTAGTTGTATTTTCCAAAATAATTGTTATATTAGCTATCCAA +>Ac11_1/1_1.000_160 +ACAATTACACAGGTATCAACAAATGTTCACTGCACCTGTCAGTTCCACAAACATAAAGATTACACACATGTACACATCTTTACAAAATATTTACAATTTTGTATTCTTAATTCTATCCACTTGGCTCTGGAAGGCCTTCAGCCATCAGATGATGTGTTTA +>Ac12_1/1_1.000_160 +CAATCCAGCACTAGCAGGAGTGTTGGCCGGAAGGTTGATGATATTTTTCAGTCAAAGAATCTGCATGCTCCAGATGATCGCCTATCAGACAAGGATAACCGTGACAAGTCCAAGAACCCTTTACTTAACAATGAGATGACTCCTCAGTCATTTTCTCGAG +>Ac13_1/1_1.000_141 +GATTACATGCAAAACATAATAGAAATGTTTGTCCCAAGGTCTTACCAGTTTATAGTTTTACATTCGTGTCTTGAAATAAGAAAATGCCTTTATGAGAGTGTATTATTACTCAGTAGATGGAAATTAGCTTACCGGGGGATA +>Ac14_1/1_1.000_160 +CCTGTTGTGACTCGTTCCCTGACGTCGTGCACGCAAGCGCACGCGCGTGCGCGCCGGGTTAGGCACACATACGCGGCACAGGTGCGCAGTATTAGACAGACGCAGACGCAGGCGTCCAGACACGCCAGCCAGCACGGTTACAATGTCCATATCACAATGA +>Ac15_1/1_1.000_147 +CTGAATGTCAACCAGTCACTGACCATCAGCTACATGTCTCTAATGGTCACTAGCATGAAACATGAAATGCCTGCTTATAGTGGGTCTGTAACTGGTAGGATACTGATTACATGTGGAGGCTTATTAAAGGGGTATCCTATTATTTTT +>Ac16_1/1_1.000_160 +CTATGTTGGCTACTGCTAAGGATGTGCTACTTGCCTGATGTAAACAATTCCCAGAATGAATATAAACCAATCATAAGGAGAACTATGGAACCATCCTTAAATGTATTAATCTTATTTAAAATTATGTGCACATCTTGTTTGGCAGAAGGTACATTAAAGC +>Ac17_1/1_1.000_160 +ATCTGTAATGTCGTTTACCACACACTGGACACTGATATTTCCGCTCGCCAGTGTGTGGTAAACGATATTACAGATCAGATGTGCTGGCAATCCATATCAGTACACACAGCAGTGAAAAAAATCATAAATGTGACATCTGTGGCAAGGCTTTCTCAAATGC +>Ac18_1/1_1.000_160 +ATGTTAGTAAAAGAGATTAAAGAGTACCGAGAGATAAAAGAGAAGGCTAGAACCTATCTATGTTATATTATAAGTAGTAACCTATCTTATGGTTCAAGCATAAATGAGGAGACTCTTCAAGAGAGTATGGAGATGTTAAAGAGGGCAATCCCAAAGAGTG +>Ac19_1/1_1.000_160 +GCCAGTGAACTCATTAGGCTCTGTCTGCGCCGATATGATAAGAACGAGTTTGACAGCGATGGTTATATTATGAACAGCGAGCTCGCCTACGATGATGACTCTGAAATGCCTGATGACCTAATTGACCGTTTAGAAGCTGGAAATATTACAAGCTTTGTGC
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/inputs2/AmAmphi_trinity.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,40 @@ +>Am1_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Am2_1/1_1.000_160 +CAGCCTACCACTGAGAAGAGATACTTCAACATGTCTTACTGGGGTAGAAGTGGTGGTCGTACAGCGGGTGGTAATGCAGGACGTGGTCGTGGCGGCGGCAGCGGCAGTGGCAGTAGTCAAAGTGGTGGTGGCAGCTTTCTACAGGAACGTATCAAAGAGA +>Am3_1/1_1.000_160 +GCACCTAGAATTACCCGAAGTTGCTTGGCAATAGCGACACCTAACGGTCGCCATGATATTTGCAGGAAGAAGGCATGTGGTACCATTGGGAACCGTCAAGCGTTTCCTCAGCCCTGTGGCAGCTGCCCGTCTGCGCCCGTGTTTGACCTTGAGCACCAAG +>Am4_1/1_1.000_147 +ACAGTTCTAAATAGCATTCGCCAAATGATATTGAAAGCATATTTTATGAATAGTGGTTCACAGATGAAAGATCATTATTGGGAACCCGTTCCAGCTTTTGTAGATCATTTTGTTCTTGCTATAGATCATCGACCCAGAATACAAGTT +>Am5_1/1_1.000_160 +TACTGCTGTCGAAGTGATGGCGCTTGGAACAGTGTGATAACCTTGCCTACAAATAGGCCATTCTATTTGCTCAGATACACAAGTCAATGTCAGCTGGTGAAAGGAATGAAAGTCAGAAGGGAAGTATTCTACTGGGATAATGAAGATATTAACAATATTG +>Am6_1/1_1.000_160 +ACCACACAATCATCTAGTGAAACATTTACCAAAACACCCTCTGAAACTGTTCGGTCAGAAGAAGACAGTGCTAAAAAGCAGAAAACTTTCATCAAAAGCCCACAGGCAGTTGCTGTATCTGAAGAATCTACAAAAACAAATTTAACTTTTGTTGCTAAAT +>Am7_1/1_1.000_160 +GGGGCCCGGGATGTCGAGGATGTTGTACTCGTAGTTGCTGTGGTTGACGTGGTATTGTTGTTCCTGGCTGTGGCTCCGTTACCACTTGACGTGGCTGTTGTGGTGCTGGTGGAAAAACGACTAGTGGTTGTGTCTGCAACTAATTCCTCGCCTCTAATCA +>Am8_1/1_1.000_160 +GTATTAATAAAAGGACAAGACTATTATTTAATACCAAGAAATCTGGCCTTAATAAGCATGGTTGCTTATATCATAAGCATGGTAAATCACATTGTGTTTTCCATGTGTTTACCCATCAGATGTAAAAATATTCTGCATGAAATAAAGAGCTTCTATGGTT +>Am9_1/1_1.000_160 +AGCCATCAAAGGTTGCTAATCATGTTGAGTTGTATTGTCAAGCATTTCTACTCGAGGCAACCATTAATAATTGAAGTTATCAGTTATATTGTCAACTCAATGGAAAATCAAATGGATTTAATTAAGGAAGGAAGATTTTGTACTTGTAAATCTGCATTTT +>Am10_1/1_1.000_160 +AGCAACTTTGCCACCACCACCACCACCAACAATAACAACAACAGCTGCAACACCGCGACCAATAACAACAACAACACTGGTAACACTAACCACGTCACCACCAGAAATAACATCTTCACAACCATCACCAACACTACCATTAACTCAACTTATAACAATA +>Am11_1/1_1.000_160 +ATTTTTTGTAGATATTTTTATCTATTTACATTTCTATCTTTATTTCTCCCTACGACCGAAGGCCTCAAAGAGCATATGATTTATTTCTGTCCATTTATCGACCGTGCGTCTGTCTGTCCATCCGCCTCCCTTCCACAAACAGCTGCCGCTACTGTTATAT +>Am12_1/1_1.000_160 +ATAAAGTGGGCCATACAAAGTGAGACCTTACACATATATGAATGTCTAAATTGTCCTATGTTGTCCCATTTAACATGTTTAGTTTGTGGTATTTTTCAAACAAATATAACATGGGGTTTGAATTGGCCGGAATCTAGCACATTAGTCTCTGAGGAAATTT +>Am13_1/1_1.000_160 +ACAACTCCTCCAATAGTACTACCAGCTTGTCAGGGTGAACCAGATGGTAATGCACCTGATCCTGACTCGTGCTCACGATATGTAGTATGTCTCAATCAGGAACCGGTTAATGACTATCCATGTGATCCAAGTACCTTCTTCAACGACCTACCCGAGTACA +>Am14_1/1_1.000_160 +TACTTTCCTGAAATTAGTTCTAATGGTTTCATTGTGATGACAAAGTTTAATATGTCAAGCATAATAAACTTGATTGTGTTTTTCCACCTCAGCAGTCATTTATACATTTTATTGAATGTAGAGGGTGTTGAGACATTCAGTGTTGTTATACTTGCAACAA +>Am15_1/1_1.000_160 +ACCACACAATCATCTAGTGAAACATTTACCAAAACACCCTCTGAAACTGTTCGGTCAGAAGAAGACAGTGCTAAAAAGCAGAAAACTTTCATCAAAAGCCCACAGGCAGTTGCTGTATCTGAAGAATCTACAAAAACAAATTTAACTTTTGTTGCTAAAT +>Am16_1/1_1.000_160 +TACTGCTGTCGAAGTGATGGCGCTTGGAACAGTGTGATAACCTTGCCTACAAATAGGCCATTCTATTTGCTCAGATACACAAGTCAATGTCAGCTGGTGAAAGGAATGAAAGTCAGAAGGGAAGTATTCTACTGGGATAATGAAGATATTAACAATATTG +>Am17_1/1_1.000_160 +ACAGTTCTAAATAGCATTCGCCAAATGATATTGAAAGCATATTTTATGAATAGTGGTTCACAGATGAAAGATCATTATTGGGAACCCGTTCCAGCTTTTGTAGATCATTTTGTTCTTGCTATAGATCATCGACCCAGAATACAAAACCAGTATGGCCAAA +>Am18_1/1_1.000_160 +CGAACATCTCATGAAGTGACTCAGACCTCATTTACCCCTAAGGGCTCTATGTTAGGGGGACATGTCATTCCACAGATATGCATGGATGACTCGCATGCATCCAGGGCATTACGGTACAGTAAACGTCCAACAGATGCCCCTCAGATACGACCCATAGAGC +>Am19_1/1_1.000_160 +TTTAGTGAAGAATTTATAATGACTCATGATGTTTGTGTCAATGTACTAAACTTAATGGCACAAATTGCGAACTCATTTTTTTTCTATTATATGTGGGTCTTTTCATTCAACCAGCATTTAGTATATAGACAATTTCTATCTATTCACTTGGATAAAGCAA +>Am20_1/1_1.000_160 +ACCTGTCCCCAACCCCAATCATTAAATCTTTCCCCTTTCGCATGTTCACAGCCCAGCTGCAGCACAACTAGAAAATCCAGACATGCTGAAAATACATGTTATTTGTGTATCTGGCATGTAATTTGTTTCACTCAAAACAGTAACTCTTCTTCAGGAGGTA
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/inputs2/ApApomp_trinity.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,40 @@ +>Ap1_1/1_1.000_160 +GGTCGCCTTATAAAAACCAATCCGAAACAGTTTTCCTTTGAAACGTGCCAAAAACCAAAAACATACTTCAAATCTTCCAGTGTCTGTTATAAAGGGGTGAGCGTAGAGAGGGCACTTGTGAGATTGGTGTCTGGGTTAAAGATTTTGCCAAAAAGCGATT +>Ap2_1/1_1.000_160 +ATACTCAGGCACACAGCATTTGTCGTACTAGGCGAGAGAGAGAGAGGAACGACTAATTGCAACCACGATTACGTTACATTTGTTTACAAACCAAACGTACTGGCGTCGAAGATAATTAAGAGGAAGCTGACTGAATGCGATTGGCGTTGGTCTACGGGTT +>Ap3_1/1_1.000_160 +GCCATGCAGTACACTGGACTTCTGTTATTCTGTTTGTTTGCCTTGACGGCAGCCAAACCCGCGGAAGACCTTCAAATGCTCATCCGAGCCCTGCTCCATGAAATAGAAGAGGAAGGTGAACTCCAAGAGCGAGGCATTGGCGCCGTGAAGTATGGTGGAA +>Ap4_1/1_1.000_135 +CGTTTAACCAGGCCCTGCTACCCTCCAATCTCGTCCAATCGGTCTCTACGCATCCACTCAATAATTATTGACATATTACAATTGATTCGGATTAAAAAAATGGCGCTAGGCTTAAAACACAGACAGTTCGCTAGC +>Ap5_1/1_1.000_160 +AATCTACTGACAGATACCTGGAACGAGATGCAGGTCAAGTGGTCGTGTTGTGGTGTGGATGGCTACTCCGACTGGACGCAAGCTGAAGGTCTGGCCACGGGTCACTACGTGCCGCAGTCCTGCTGTCAGAACACGATGAGTACAAGCTGCACGTCACAGA +>Ap6_1/1_1.000_160 +TGGCGAAATGTAGTGGTCATTGATGGATTTTATTGCAATCAGTGTTACATATTACAAGCATTTCTTAATAAACAAAAAGTTGCACGAGATATTTTTTACTTAAAGGTTTTATGGGATGAACACAGTCAATTATATTCATGTAAAAGGCCTTATCCGAGAA +>Ap7_1/1_1.000_160 +TTCGTAATGAATCTTTTTGACTGGTATTCCGCAGGATACTCAATAATTATTGTCGCATTCTTCGAAGTTATCGCCATTTCTTGGATATACGGTCTCCAACGGTTCAAGAAGGACATTCAGATGATGGTTGGCAAGGGGCGATGGATCAATGCTAGTTTCT +>Ap8_1/1_1.000_160 +GCGAAAACTGGTTTTAACACAAATAATTGTTACAGTACCAGGTTTCGGAACACGTTTGCATATAACCAGCGAGAGTGGTGCTCAGTTCTGTTATGTATGACAGTCCTTCTCCTCAACATGCAACGGAAGCGAGCACTTCCATCATCACATTTGTCAATAA +>Ap9_1/1_1.000_160 +TGTCTTTACTTCTATCCTTCTCATCATGTTTTACATCATTTTTATTGCTGCCTCTCTTCTCAGCCCTTTCCACACTTTCATGTTTATCTTTTGATTTTTCAACTTCAACTCCATCTTCATCATCATTCTCATGCATTAATTCTTCTATTTCTTCTTCCAA +>Ap10_1/1_1.000_160 +GCAGTGGTGGGAAGTTGTTCACCCTGGCTTGGTGTCCCATGTTTCTCTGTAATTCCTGTTCCTTTCTCTGTAGTTCCTCAGCCTTCCTCTCCAGTTCTTCCTGACGTCTCTTCAGGTCATCTGTGGCAGCCTGGGCCGTGGTCTTGGCGGCTGAGTATGG +>Ap11_1/1_1.000_160 +ACGACAGAGGTCCTCTGCTTGATGAATATGGTTACACCAGAGGATTTGGAAGATGAAGAGGAATATGAAGAAATTTTGGAGGATGTCAAAGAAGAGTGCAGCAAATATGGTTATGTGAAGAGTATAGAGATCCCACGGCCCATTAAGGGTGTGGAAGTGC +>Ap12_1/1_1.000_160 +TGTCTTTACTTCTATCCTTCTCATCATGTTTTACATCATTTTTATTGCTGCCTCTCTTCTCAGCCCTTTCCACACTTTCATGTTTATCTTTTGATTTTTCAACTTCAACTCCATCTTCATCATCATTCTCATGCATTAATTCTTCTATTTCTTCTTCCAA +>Ap13_1/1_1.000_160 +GCGAAAACTGGTTTTAACACAAATAATTGTTACAGTACCAGGTTTCGGAACACGTTTGCATATAACCAGCGAGAGTGGTGCTCAGTTCTGTTATGTATGACAGTCCTTCTCCTCAACATGCAACGGAAGCGAGCACTTCCATCATCACATTTGTCAATAA +>Ap14_1/1_1.000_160 +TTCGTAATGAATCTTTTTGACTGGTATTCCGCAGGATACTCAATAATTATTGTCGCATTCTTCGAAGTTATCGCCATTTCTTGGATATACGGTCTCCAACGGTTCAAGAAGGACATTCAGATGATGGTTGGCAAGGGGCGATGGATCAATGCTAGTTTCT +>Ap15_1/1_1.000_160 +GTTGTCAGTGGATCTCGTGATGCAACACTGAGGCTATGGAATGTCGATACTGGCCAGTGTCTGCATGTTCTGATGGGACATATGGCAGCTGTACGGTGTGTGCAGTATGATGGCAAGCGTGTTGTTAGTGGTGCCTATGATTATACAGTTAGAGTGTGGG +>Ap16_1/1_1.000_160 +AGATTTATATTTGAGAATGTTTTGAGTACGACTTCTGTACAGACACACAGCAGAATGACCCTTGTATTGTTTAACAACGTTCAAAATTTCCTGATTCTTCTACCGAAAAAAATACATAAGAAGAGCCACCAAGACGATCAGATCACGGAGGTACTGGCAT +>Ap17_1/1_1.000_160 +GCAGACTCGGCTGGCACGGCCACCGCCTTCCTCTGTGGAGTGAAGGCTCGCTACGGAACGCTGGGTCTGGGACCGAGAGCCACACGATCTGACTGTAGACAGAGTCACATCAACAAACTGAAGTGTATAGGAGACATGGCACAACAAGCAGGTATGAGGA +>Ap18_1/1_1.000_160 +CCGGCCTGCAAGACGCCATTTTACTTCGTCTGTCAATCGAGGTCAAAGGTCACTACCGTTGTCTCCGAGAAGCACACAGACGCCGAGCTGGTTCACACGCTGTGTATTCGGCACAGATCTACTGTTGCTTGGGATATTTTAGCCGGCGAACGAGCGAAAT +>Ap19_1/1_1.000_160 +CCGGCGATCGTTCAGAGGGCCAGCGGTCTGGCCATGTCAGAGATCTATCACCTGCGCTTCTGCGATGGGGATCGGCTGAACGTCAGCTGCCCGGACAACTGGCAGATCCACATCTCGTCCAGCTACTTCGTCTACGTCAGCGGCGTCGACGGCCGCGGCG +>Ap20_1/1_1.000_160 +CATGAAGGACCTGTGTGGCAGGTGGCTTGGGCACATCCAATGTTTGGTAATCTGATAGCATCATGTAGTTATGACAGAAAGGTGATTATTTGGAAGGAGACTGGAGGGACATGGGCAAAGCTTTATGAATACAACAATCATGATTCCTCAGTTAATTCAG
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/inputs2/PfPfiji_trinity.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,40 @@ +>Pf1_1/1_1.000_160 +AGCATTGTCCGTGTTGCGCGGGTCGTCGACGTAACCTCGGTACACCTCAGCGTGCCCGGCCATCTGGTGCGTGAGCCGCTTGAAGACGACTCTCGCCGGCGGCTGCTTGCTGTCCAGCGTCATGCTCTCGAACGCCTTGATCCAGGACTCCTTGACACTG +>Pf2_1/1_1.000_160 +ATCACGCCCCCTGTCGTGGACAACAGGCTAGCCCATTGTAATGGGAAATCTACAGTGGTAAACAAACTGATATACAATATAACCAATATGTATATATATATAACAGATATGGAGACTGATCATAAAATCAACGATTCACTTTTAGGATTAGTCATGTTTG +>Pf3_1/1_1.000_160 +TGTGCGTCGGTGCCGGACTGCAGGCTGTTGTTTGTCGGGAGCACCTCCGGAGTCATCAGTGTCATCAACACCAAGTTCAACCAGAGCAAGCAAAGCCACCTGCAGGTGTTCGGCCACAAAACCTGCCTGTACGGCCATTCCGGGGCCGTCACAGCCTTCT +>Pf4_1/1_1.000_160 +GGTCGCCTTATAAAAACCAATCCGAAACAGTTTTCCTTTGAAACGTGCCAAAAACCAAAAACATACTTCAAATCTTCCAGTGTCTGTTATAAAGGGGTGAGCGTAGAGAGGGCACTTGTGAGATTGGTGTCTGGGTTAAAGATTTTGCCAAAAAGCGATT +>Pf5_1/1_1.000_160 +ATACTCAGGCACACAGCATTTGTCGTACTAGGCGAGAGAGAGAGAGGAACGACTAATTGCAACCACGATTACGTTACATTTGTTTACAAACCAAACGTACTGGCGTCGAAGATAATTAAGAGGAAGCTGACTGAATGCGATTGGCGTTGGTCTACGGGTT +>Pf6_1/1_1.000_160 +GCCATGCAGTACACTGGACTTCTGTTATTCTGTTTGTTTGCCTTGACGGCAGCCAAACCCGCGGAAGACCTTCAAATGCTCATCCGAGCCCTGCTCCATGAAATAGAAGAGGAAGGTGAACTCCAAGAGCGAGGCATTGGCGCCGTGAAGTATGGTGGAA +>Pf7_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Pf8_1/1_1.000_160 +ATCAAAGAAGAGCAACATCGAGCTACTGGCACTGGCAATGGAATCCTAATTATAGCAGAAACAAGCACTGGTTGCCTGTTGTCTGGGTCAGCAATTGGTAGTAGAGGTGTTCCTGCTGAAGAAGTTGGGGTCAAAGCAGGACAGATGCTTTTGGATAACT +>Pf9_1/1_1.000_160 +GTATTAATAAAAGGACAAGACTATTATTTAATACCAAGAAATCTGGCCTTAATAAGCATGGTTGCTTATATCATAAGCATGGTAAATCACATTGTGTTTTCCATGTGTTTACCCATCAGATGTAAAAATATTCTGCATGAAATAGGTAATTTCCCGATTA +>Pf10_1/1_1.000_160 +CGGCCGCGGCGCGTCGTTCTCAGCCAAGCTGACTTCGACTTGAGCCGTCCATTCGCTTATTTACACGACGACTGCTCGACCCTTTACGACTTAGTCACACTTCCGTTTAACCAGGCCCTGCTACCCTCCAATCTCGTCCAATCGGTCTCTACGCATCCGA +>Pf11_1/1_1.000_160 +AGCATTGTCCGTGTTGCGCGGGTCGTCGACGTAACCTCGGTACACCTCAGCGTGCCCGGCCATCTGGTGCGTGAGCCGCTTGAAGACGACTCTCGCCGGCGGCTGCTTGCTGTCCAGCGTCATGCTCTCGAACGCCTTGATCCAGGACTCCTTGACACTG +>Pf12_1/1_1.000_160 +GCCCTCGGCCACCAAGCCCAAGAGTCCCAACGTGATGCCCAACCTGCCCAAGCACGTGCTGCAGGCCATCGAAGAGAACATGATCTACTACAACAAAATGTACAGTCTCCGAGTCAAGCCGGACCTGCTCCAGGTTCACTAGAGGGCGCTGTGGTGTTCG +>Pf13_1/1_1.000_160 +CGCGTCCACGACCGCCACGCGCACCGAGGTCTACGACAAACTCGCGCCGCAGGAGGCTCCTCTCAACCTGCACAAGCCTCGCGCCGACAGCGTCCCGACCGACGGCAACGGCTGACGGCAGACACTCGAGCCTTGACTACGTGTATGCACAAAGCTACCC +>Pf14_1/1_1.000_160 +ATCACGCCCCCTGTCGTGGACAACAGGCTAGCCCATTGTAATGGGAAATCTACAGTGGTAAACAAACTGATATACAATATAACCAATATGTATATATATATAACAGATATGGAGACTGATCATAAAATCAACGATTCACTTTTAGGATTAGTCATGTTTG +>Pf15_1/1_1.000_160 +TCGTCCCAACAGCAGTCCATCAGTACAAGCAGTGTACAGAAGAAATTTGACAAAAATACTATTGATGCAGTCAAGAGATGGAACACAGAAAATCTTGACATTTATGGACCACTTCGGAACCCCAAAACCGATGGAGGTTCCTCTCCAAACCCAACCACTC +>Pf16_1/1_1.000_160 +CACGTCACGGACGTGCTCGTCTCGAAAATCATCGATATGGTCAAAAAGAAGGAAAAGAAAGGAGGGATCACCATCAAGCCATTCCAGGTCAAGAACCATGTCTGGGTGTTCGTCAACTGTCTAATAGAGAACCCGACGTTCGACTCGCAGACGAAGGAGA +>Pf17_1/1_1.000_160 +TGTGCGTCGGTGCCGGACTGCAGGCTGTTGTTTGTCGGGAGCACCTCCGGAGTCATCAGTGTCATCAACACCAAGTTCAACCAGAGCAAGCAAAGCCACCTGCAGGTGTTCGGCCACAAAACCTGCCTGTACGGCCATTCCGGGGCCGTCACAGCCTTCT +>Pf18_1/1_1.000_160 +TTACGACACCTGCCCCAGATCCTGTCCGTGTCACTGCTGAGGTTCAGCTTTGACTTCCAGAAAATGGAAAGATATAAGGAAACTGGCAAGTTTGTGTTTCCGATAGAGTTGGATATGGCACCTTATGTTGATAAGTTATCAACTGCTGGGTGCACAGAAT +>Pf19_1/1_1.000_160 +GATCTGAAAGAATCTTCTCCAGTAGCAGAGATTAATGGTCCAGAACTTATCGATGAAGCACTGCCAATTTCCGTTGATAACAGTCAAGGAACAACGGCAGAAATTAGACTTGAGCGCAGTCAAAGCCATACTGGAAAAAACGAGGCGGATTTATTTGCAC +>Pf20_1/1_1.000_160 +TTGTACATTCAAAAGCTAGGTGGCCAAATTGTGGAATTGAATACATGTAATAATTATATATCTTTCACCATGAAAAGATGTCCATGGACCAAGTCTATGCATTGCCGATTTTCGTCATTAATATATATCAGAGATGTGATAAAAGATATATGTACTCATA
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_AmAmphi_AcAcaud.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,12 @@ +>Ac5_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Am1_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Ac7_1/1_1.000_160 +GCACCTAGAATTACCCGAAGTTGCTTGGCAATAGCGACACCTAACGGTCGCCATGATATTTGCAGGAAGAAGGCATGTGGTACCATTGGGAACCGTCAAGCGTTTCCTCAGCCCTGTGGCAGCTGCCCGTCTGCGCCCGTGTTTGACCTTGAGCACCAAG +>Am3_1/1_1.000_160 +GCACCTAGAATTACCCGAAGTTGCTTGGCAATAGCGACACCTAACGGTCGCCATGATATTTGCAGGAAGAAGGCATGTGGTACCATTGGGAACCGTCAAGCGTTTCCTCAGCCCTGTGGCAGCTGCCCGTCTGCGCCCGTGTTTGACCTTGAGCACCAAG +>Ac6_1/1_1.000_160 +CAGCCTACCACTGAGAAGAGATACTTCAACATGTCTTACTGGGGTAGAAGTGGTGGTCGTACAGCGGGTGGTAATGCAGGACGTGGTCGTGGCGGCGGCAGCGGCAGTGGCAGTAGTCAAAGTGGTGGTGGCAGCTTTCTACAGGAACGTATCAAAGAGA +>Am2_1/1_1.000_160 +CAGCCTACCACTGAGAAGAGATACTTCAACATGTCTTACTGGGGTAGAAGTGGTGGTCGTACAGCGGGTGGTAATGCAGGACGTGGTCGTGGCGGCGGCAGCGGCAGTGGCAGTAGTCAAAGTGGTGGTGGCAGCTTTCTACAGGAACGTATCAAAGAGA
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_AcAcaud.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,8 @@ +>Ac8_1/1_1.000_160 +ATCAAAGAAGAGCAACATCGAGCTACTGGCACTGGCAATGGAATCCTAATTATAGCAGAAACAAGCACTGGTTGCCTGTTGTCTGGGTCAGCAATTGGTAGTAGAGGTGTTCCTGCTGAAGAAGTTGGGGTCAAAGCAGGACAGATGCTTTTGGATAACT +>Pf8_1/1_1.000_160 +ATCAAAGAAGAGCAACATCGAGCTACTGGCACTGGCAATGGAATCCTAATTATAGCAGAAACAAGCACTGGTTGCCTGTTGTCTGGGTCAGCAATTGGTAGTAGAGGTGTTCCTGCTGAAGAAGTTGGGGTCAAAGCAGGACAGATGCTTTTGGATAACT +>Ac5_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Pf7_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_AmAmphi.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,8 @@ +>Am1_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Pf7_1/1_1.000_160 +GCACCGGGATGCGGATTTGCTGACGATATGGCAAAAGCATTGTCAGCGTGCGGAACCTGTTTATGTCACACCACTGGCATCTTCCTGGCCGTCGCAGCCTTCGTTCTGACGGCACTCGGTATTGTCTGCGTCACGCGATCAGCTGACCCGAGCCTTTGGT +>Am8_1/1_1.000_160 +GTATTAATAAAAGGACAAGACTATTATTTAATACCAAGAAATCTGGCCTTAATAAGCATGGTTGCTTATATCATAAGCATGGTAAATCACATTGTGTTTTCCATGTGTTTACCCATCAGATGTAAAAATATTCTGCATGAAATAAAGAGCTTCTATGGTT +>Pf9_1/1_1.000_160 +GTATTAATAAAAGGACAAGACTATTATTTAATACCAAGAAATCTGGCCTTAATAAGCATGGTTGCTTATATCATAAGCATGGTAAATCACATTGTGTTTTCCATGTGTTTACCCATCAGATGTAAAAATATTCTGCATGAAATAGGTAATTTCCCGATTA
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/outputs_dna2/DNAalignment_corresponding_to_protein_from_RBH_PfPfiji_ApApomp.fasta Wed Jan 17 08:53:53 2018 -0500 @@ -0,0 +1,16 @@ +>Ap3_1/1_1.000_160 +GCCATGCAGTACACTGGACTTCTGTTATTCTGTTTGTTTGCCTTGACGGCAGCCAAACCCGCGGAAGACCTTCAAATGCTCATCCGAGCCCTGCTCCATGAAATAGAAGAGGAAGGTGAACTCCAAGAGCGAGGCATTGGCGCCGTGAAGTATGGTGGAA +>Pf6_1/1_1.000_160 +GCCATGCAGTACACTGGACTTCTGTTATTCTGTTTGTTTGCCTTGACGGCAGCCAAACCCGCGGAAGACCTTCAAATGCTCATCCGAGCCCTGCTCCATGAAATAGAAGAGGAAGGTGAACTCCAAGAGCGAGGCATTGGCGCCGTGAAGTATGGTGGAA +>Ap1_1/1_1.000_160 +GGTCGCCTTATAAAAACCAATCCGAAACAGTTTTCCTTTGAAACGTGCCAAAAACCAAAAACATACTTCAAATCTTCCAGTGTCTGTTATAAAGGGGTGAGCGTAGAGAGGGCACTTGTGAGATTGGTGTCTGGGTTAAAGATTTTGCCAAAAAGCGATT +>Pf4_1/1_1.000_160 +GGTCGCCTTATAAAAACCAATCCGAAACAGTTTTCCTTTGAAACGTGCCAAAAACCAAAAACATACTTCAAATCTTCCAGTGTCTGTTATAAAGGGGTGAGCGTAGAGAGGGCACTTGTGAGATTGGTGTCTGGGTTAAAGATTTTGCCAAAAAGCGATT +>Ap2_1/1_1.000_160 +ATACTCAGGCACACAGCATTTGTCGTACTAGGCGAGAGAGAGAGAGGAACGACTAATTGCAACCACGATTACGTTACATTTGTTTACAAACCAAACGTACTGGCGTCGAAGATAATTAAGAGGAAGCTGACTGAATGCGATTGGCGTTGGTCTACGGGTT +>Pf5_1/1_1.000_160 +ATACTCAGGCACACAGCATTTGTCGTACTAGGCGAGAGAGAGAGAGGAACGACTAATTGCAACCACGATTACGTTACATTTGTTTACAAACCAAACGTACTGGCGTCGAAGATAATTAAGAGGAAGCTGACTGAATGCGATTGGCGTTGGTCTACGGGTT +>Ap4_1/1_1.000_135 +CGTTTAACCAGGCCCTGCTACCCTCCAATCTCGTCCAATCGGTCTCTACGCATCCACTCAATAATTATTGACATATTACAATTGATTCGGATTAAAAAAATGGCGCTAGGCTTAAAACACAGACAGTTCGCTAGC +>Pf10_1/1_1.000_160 +CGGCCGCGGCGCGTCGTTCTCAGCCAAGCTGACTTCGACTTGAGCCGTCCATTCGCTTATTTACACGACGACTGCTCGACCCTTTACGACTTAGTCACACTTCCGTTTAACCAGGCCCTGCTACCCTCCAATCTCGTCCAATCGGTCTCTACGCATCCGA
