Mercurial > repos > peterjc > tmhmm_and_signalp
annotate tools/protein_analysis/tmhmm2.py @ 27:9e36a1b9302d draft
planemo upload for repository https://github.com/peterjc/pico_galaxy/tools/protein_analysis commit 3c6f0dca0e1318eecd1e07d177ffc5752b4f6c95
author | peterjc |
---|---|
date | Thu, 21 May 2015 10:57:40 -0400 |
parents | 20139cb4c844 |
children | 3cb02adf4326 |
rev | line source |
---|---|
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
2 """Wrapper for TMHMM v2.0 for use in Galaxy. |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
3 |
7 | 4 This script takes exactly three command line arguments - number of threads, |
5 an input protein FASTA filename, and an output tabular filename. It then | |
6 calls the standalone TMHMM v2.0 program (not the webservice) requesting | |
7 the short output (one line per protein). | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
8 |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
9 The first major feature is cleaning up the tabular output. The short form raw |
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
10 output from TMHMM v2.0 looks like this (six columns tab separated): |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
11 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
12 gi|2781234|pdb|1JLY|B len=304 ExpAA=0.01 First60=0.00 PredHel=0 Topology=o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
13 gi|4959044|gb|AAD34209.1|AF069992_1 len=600 ExpAA=0.00 First60=0.00 PredHel=0 Topology=o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
14 gi|671626|emb|CAA85685.1| len=473 ExpAA=0.19 First60=0.00 PredHel=0 Topology=o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
15 gi|3298468|dbj|BAA31520.1| len=107 ExpAA=59.37 First60=31.17 PredHel=3 Topology=o23-45i52-74o89-106i |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
16 |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
17 If there are any additional 'comment' lines starting with the hash (#) |
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
18 character these are ignored by this script. |
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
19 |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
20 In order to make it easier to use in Galaxy, this wrapper script simplifies |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
21 this to remove the redundant tags, and instead adds a comment line at the |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
22 top with the column names: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
23 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
24 #ID len ExpAA First60 PredHel Topology |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
25 gi|2781234|pdb|1JLY|B 304 0.01 60 0.00 0 o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
26 gi|4959044|gb|AAD34209.1|AF069992_1 600 0.00 0 0.00 0 o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
27 gi|671626|emb|CAA85685.1| 473 0.19 0.00 0 o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
28 gi|3298468|dbj|BAA31520.1| 107 59.37 31.17 3 o23-45i52-74o89-106i |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
29 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
30 The second major potential feature is taking advantage of multiple cores |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
31 (since TMHMM v2.0 itself is single threaded) by dividing the input FASTA file |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
32 into chunks and running multiple copies of TMHMM in parallel. I would normally |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
33 use Python's multiprocessing library in this situation but it requires at |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
34 least Python 2.6 and at the time of writing Galaxy still supports Python 2.4. |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
35 |
7 | 36 Note that this is somewhat redundant with job-splitting available in Galaxy |
37 itself (see the SignalP XML file for settings). | |
38 | |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
39 Also tmhmm2 can fail without returning an error code, for example if run on a |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
40 64 bit machine with only the 32 bit binaries installed. This script will spot |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
41 when there is no output from tmhmm2, and raise an error. |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
42 """ |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
43 import sys |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
44 import os |
7 | 45 import tempfile |
26
20139cb4c844
planemo upload for repository https://github.com/peterjc/pico_galaxy/tools/protein_analysis commit 221d4187992cbb993e02dc3ea0ef0150c7916a4a-dirty
peterjc
parents:
8
diff
changeset
|
46 from seq_analysis_utils import sys_exit, split_fasta, run_jobs, thread_count |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
47 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
48 FASTA_CHUNK = 500 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
49 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
50 if len(sys.argv) != 4: |
26
20139cb4c844
planemo upload for repository https://github.com/peterjc/pico_galaxy/tools/protein_analysis commit 221d4187992cbb993e02dc3ea0ef0150c7916a4a-dirty
peterjc
parents:
8
diff
changeset
|
51 sys_exit("Require three arguments, number of threads (int), input protein FASTA file & output tabular file") |
7 | 52 |
53 num_threads = thread_count(sys.argv[1], default=4) | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
54 fasta_file = sys.argv[2] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
55 tabular_file = sys.argv[3] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
56 |
7 | 57 tmp_dir = tempfile.mkdtemp() |
58 | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
59 def clean_tabular(raw_handle, out_handle): |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
60 """Clean up tabular TMHMM output, returns output line count.""" |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
61 count = 0 |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
62 for line in raw_handle: |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
63 if not line.strip() or line.startswith("#"): |
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
64 #Ignore any blank lines or comment lines |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
65 continue |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
66 parts = line.rstrip("\r\n").split("\t") |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
67 try: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
68 identifier, length, expAA, first60, predhel, topology = parts |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
69 except: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
70 assert len(parts)!=6 |
26
20139cb4c844
planemo upload for repository https://github.com/peterjc/pico_galaxy/tools/protein_analysis commit 221d4187992cbb993e02dc3ea0ef0150c7916a4a-dirty
peterjc
parents:
8
diff
changeset
|
71 sys_exit("Bad line: %r" % line) |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
72 assert length.startswith("len="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
73 length = length[4:] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
74 assert expAA.startswith("ExpAA="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
75 expAA = expAA[6:] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
76 assert first60.startswith("First60="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
77 first60 = first60[8:] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
78 assert predhel.startswith("PredHel="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
79 predhel = predhel[8:] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
80 assert topology.startswith("Topology="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
81 topology = topology[9:] |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
82 out_handle.write("%s\t%s\t%s\t%s\t%s\t%s\n" \ |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
83 % (identifier, length, expAA, first60, predhel, topology)) |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
84 count += 1 |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
85 return count |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
86 |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
87 #Note that if the input FASTA file contains no sequences, |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
88 #split_fasta returns an empty list (i.e. zero temp files). |
7 | 89 fasta_files = split_fasta(fasta_file, os.path.join(tmp_dir, "tmhmm"), FASTA_CHUNK) |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
90 temp_files = [f+".out" for f in fasta_files] |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
91 jobs = ["tmhmm -short %s > %s" % (fasta, temp) |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
92 for fasta, temp in zip(fasta_files, temp_files)] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
93 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
94 def clean_up(file_list): |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
95 for f in file_list: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
96 if os.path.isfile(f): |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
97 os.remove(f) |
7 | 98 try: |
99 os.rmdir(tmp_dir) | |
100 except: | |
101 pass | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
102 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
103 if len(jobs) > 1 and num_threads > 1: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
104 #A small "info" message for Galaxy to show the user. |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
105 print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)) |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
106 results = run_jobs(jobs, num_threads) |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
107 for fasta, temp, cmd in zip(fasta_files, temp_files, jobs): |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
108 error_level = results[cmd] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
109 if error_level: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
110 try: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
111 output = open(temp).readline() |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
112 except IOError: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
113 output = "" |
7 | 114 clean_up(fasta_files + temp_files) |
26
20139cb4c844
planemo upload for repository https://github.com/peterjc/pico_galaxy/tools/protein_analysis commit 221d4187992cbb993e02dc3ea0ef0150c7916a4a-dirty
peterjc
parents:
8
diff
changeset
|
115 sys_exit("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output), |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
116 error_level) |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
117 del results |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
118 del jobs |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
119 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
120 out_handle = open(tabular_file, "w") |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
121 out_handle.write("#ID\tlen\tExpAA\tFirst60\tPredHel\tTopology\n") |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
122 for temp in temp_files: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
123 data_handle = open(temp) |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
124 count = clean_tabular(data_handle, out_handle) |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
125 data_handle.close() |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
126 if not count: |
7 | 127 clean_up(fasta_files + temp_files) |
26
20139cb4c844
planemo upload for repository https://github.com/peterjc/pico_galaxy/tools/protein_analysis commit 221d4187992cbb993e02dc3ea0ef0150c7916a4a-dirty
peterjc
parents:
8
diff
changeset
|
128 sys_exit("No output from tmhmm2") |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
129 out_handle.close() |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
130 |
7 | 131 clean_up(fasta_files + temp_files) |