Mercurial > repos > peterjc > tmhmm_and_signalp
changeset 0:a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
author | peterjc |
---|---|
date | Tue, 07 Jun 2011 17:37:26 -0400 |
parents | |
children | 9a8a7f680dd6 |
files | test-data/four_human_proteins.fasta test-data/four_human_proteins.signalp3.tsv test-data/four_human_proteins.tmhmm2.tsv tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.py tools/protein_analysis/signalp3.xml tools/protein_analysis/suite_config.xml tools/protein_analysis/tmhmm2.py tools/protein_analysis/tmhmm2.xml |
diffstat | 11 files changed, 736 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/four_human_proteins.fasta Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,61 @@ +>sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1 +MHPAVFLSLPDLRCSLLLLVTWVFTPVTTEITSLDTENIDEILNNADVALVNFYADWCRF +SQMLHPIFEEASDVIKEEFPNENQVVFARVDCDQHSDIAQRYRISKYPTLKLFRNGMMMK +REYRGQRSVKALADYIRQQKSDPIQEIRDLAEITTLDRSKRNIIGYFEQKDSDNYRVFER +VANILHDDCAFLSAFGDVSKPERYSGDNIIYKPPGHSAPDMVYLGAMTNFDVTYNWIQDK +CVPLVREITFENGEELTEEGLPFLILFHMKEDTESLEIFQNEVARQLISEKGTINFLHAD +CDKFRHPLLHIQKTPADCPVIAIDSFRHMYVFGDFKDVLIPGKLKQFVFDLHSGKLHREF +HHGPDPTDTAPGEQAQDVASSPPESSFQKLAPSEYRYTLLRDRDEL +>sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2 +MKKFSRMPKSEGGSGGGAAGGGAGGAGAGAGCGSGGSSVGVRVFAVGRHQVTLEESLAEG +GFSTVFLVRTHGGIRCALKRMYVNNMPDLNVCKREITIMKELSGHKNIVGYLDCAVNSIS +DNVWEVLILMEYCRAGQVVNQMNKKLQTGFTEPEVLQIFCDTCEAVARLHQCKTPIIHRD +LKVENILLNDGGNYVLCDFGSATNKFLNPQKDGVNVVEEEIKKYTTLSYRAPEMINLYGG +KPITTKADIWALGCLLYKLCFFTLPFGESQVAICDGNFTIPDNSRYSRNIHCLIRFMLEP +DPEHRPDIFQVSYFAFKFAKKDCPVSNINNSSIPSALPEPMTASEAAARKSQIKARITDT +IGPTETSIAPRQRPKANSATTATPSVLTIQSSATPVKVLAPGEFGNHRPKGALRPGNGPE +ILLGQGPPQQPPQQHRVLQQLQQGDWRLQQLHLQHRHPHQQQQQQQQQQQQQQQQQQQQQ +QQQQQQHHHHHHHHLLQDAYMQQYQHATQQQQMLQQQFLMHSVYQPQPSASQYPTMMPQY +QQAFFQQQMLAQHQPSQQQASPEYLTSPQEFSPALVSYTSSLPAQVGTIMDSSYSANRSV +ADKEAIANFTNQKNISNPPDMSGWNPFGEDNFSKLTEEELLDREFDLLRSNRLEERASSD +KNVDSLSAPHNHPPEDPFGSVPFISHSGSPEKKAEHSSINQENGTANPIKNGKTSPASKD +QRTGKKTSVQGQVQKGNDESESDFESDPPSPKSSEEEEQDDEEVLQGEQGDFNDDDTEPE +NLGHRPLLMDSEDEEEEEKHSSDSDYEQAKAKYSDMSSVYRDRSGSGPTQDLNTILLTSA +QLSSDVAVETPKQEFDVFGAVPFFAVRAQQPQQEKNEKNLPQHRFPAAGLEQEEFDVFTK +APFSKKVNVQECHAVGPEAHTIPGYPKSVDVFGSTPFQPFLTSTSKSESNEDLFGLVPFD +EITGSQQQKVKQRSLQKLSSRQRRTKQDMSKSNGKRHHGTPTSTKKTLKPTYRTPERARR +HKKVGRRDSQSSNEFLTISDSKENISVALTDGKDRGNVLQPEESLLDPFGAKPFHSPDLS +WHPPHQGLSDIRADHNTVLPGRPRQNSLHGSFHSADVLKMDDFGAVPFTELVVQSITPHQ +SQQSQPVELDPFGAAPFPSKQ +>sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4 +MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHL +QILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYAL +VIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNYIVLNKDDNE +ECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECL +GNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQG +CHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGC +TVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETL +EIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQE +RNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQ +NVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFS +DERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWE +RQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQIL +KELEESSFRKTFEDYLHNVVFVPRKTSSGTGAEDPRPSRKRRSLGDVGNVTVAVPTVAAF +PNTSSTSVPTSPEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYV +SARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCV +SRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKIIIG +PLIFVFLFSVVIGSIYLFLRKRQPDGPLGPLYASSNPEYLSASDVFPCSVYVPDEWEVSR +EKITLLRELGQGSFGMVYEGNARDIIKGEAETRVAVKTVNESASLRERIEFLNEASVMKG +FTCHHVVRLLGVVSKGQPTLVVMELMAHGDLKSYLRSLRPEAENNPGRPPPTLQEMIQMA +AEIADGMAYLNAKKFVHRDLAARNCMVAHDFTVKIGDFGMTRDIYETDYYRKGGKGLLPV +RWMAPESLKDGVFTTSSDMWSFGVVLWEITSLAEQPYQGLSNEQVLKFVMDGGYLDQPDN +CPERVTDLMRMCWQFNPKMRPTFLEIVNLLKDDLHPSFPEVSFFHSEENKAPESEELEME +FEDMENVPLDRSSHCQREEAGGRDGGSSLGFKRSYEEHIPYTHMNGGKKNGRILTLPRSN +PS +>sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1 +MNGTEGPNFYVPFSNATGVVRSPFEYPQYYLAEPWQFSMLAAYMFLLIVLGFPINFLTLY +VTVQHKKLRTPLNYILLNLAVADLFMVLGGFTSTLYTSLHGYFVFGPTGCNLEGFFATLG +GEIALWSLVVLAIERYVVVCKPMSNFRFGENHAIMGVAFTWVMALACAAPPLAGWSRYIP +EGLQCSCGIDYYTLKPEVNNESFVIYMFVVHFTIPMIIIFFCYGQLVFTVKEAAAQQQES +ATTQKAEKEVTRMVIIMVIAFLICWVPYASVAFYIFTHQGSNFGPIFMTIPAFFAKSAAI +YNPVIYIMMNKQFRNCMLTTICCGKNPLGDDEASATVSKTETSQVAPA
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/four_human_proteins.signalp3.tsv Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,5 @@ +#ID NN_Cmax_score NN_Cmax_pos NN_Cmax_pred NN_Ymax_score NN_Ymax_pos NN_Ymax_pred NN_Smax_score NN_Smax_pos NN_Smax_pred NN_Smean_score NN_Smean_pred NN_D_score NN_D_pred HMM_type HMM_Cmax_score HMM_Cmax_pos HMM_Cmax_pred HMM_Sprob_score HMM_Sprob_pred +sp|Q9BS26|ERP44_HUMAN 0.565 30 Y 0.686 30 Y 0.986 12 Y 0.818 Y 0.752 Y S 0.945 30 Y 0.966 Y +sp|Q9NSY1|BMP2K_HUMAN 0.153 869 N 0.050 270 N 0.229 12 N 0.008 N 0.029 N Q 0.000 0 N 0.000 N +sp|P06213|INSR_HUMAN 0.396 28 Y 0.561 28 Y 0.993 19 Y 0.902 Y 0.731 Y Q 0.205 28 N 0.341 N +sp|P08100|OPSD_HUMAN 0.211 52 N 0.344 52 Y 0.945 50 Y 0.245 N 0.295 N Q 0.000 52 N 0.000 N
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/four_human_proteins.tmhmm2.tsv Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,5 @@ +#ID len ExpAA First60 PredHel Topology +sp|Q9BS26|ERP44_HUMAN 406 0.23 0.23 0 o +sp|Q9NSY1|BMP2K_HUMAN 1161 0.35 0.26 0 o +sp|P06213|INSR_HUMAN 1382 50.22 20.76 2 i9-31o957-979i +sp|P08100|OPSD_HUMAN 348 157.99 21.69 7 o39-61i74-96o111-133i153-175o202-224i254-276o286-308i
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/LICENSE Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,23 @@ +Copyright (c) 2010 Peter Cock, SCRI, UK + +License for TMHMM 2.0 and SignalP 3.0 wrappers for Galaxy (note that +TMHMM 2.0 and SignalP 3.0 are copyright and licensed separately). + +Permission to use, copy, modify, and distribute this software and its +documentation with or without modifications and for any purpose and +without fee is hereby granted, provided that any copyright notices +appear in all copies and that both those copyright notices and this +permission notice appear in supporting documentation, and that the +names of the contributors or copyright holders not be used in +advertising or publicity pertaining to distribution of the software +without specific prior permission. + +THE CONTRIBUTORS AND COPYRIGHT HOLDERS OF THIS SOFTWARE DISCLAIM ALL +WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED +WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE +CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT +OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS +OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE +OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE +OR PERFORMANCE OF THIS SOFTWARE. +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/README Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,59 @@ +This package contains Galaxy wrappers for two standalone command line protein +analysis tools (SignalP 3.0 and THMHMM 2.0) from the Center for Biological +Sequence Analysis at the Technical University of Denmark, +http://www.cbs.dtu.dk/cbs/ + +To use these Galaxy wrappers you must first install the CBS command line +tools. At the time of writing both SignalP 3.0 and TMHMM 2.0 are free for +academic use. + +These wrappers are copyright 2010 by Peter Cock, SCRI, UK. All rights +reserved. See the included LICENCE file for details. + +Installation +============ + +1. Install the command line version of SignalP 3.0 and ensure it is on the + PATH, see: http://www.cbs.dtu.dk/services/SignalP/ + +2. Install the command line version of TMHMM 2.0 and ensure it is on the + PATH, see: http://www.cbs.dtu.dk/services/TMHMM/ + +3. Create a folder tools/protein_analysis under your Galaxy installation. + +4. Copy/move the following files (from this archive) there: + +tmhmm2.xml (Galaxy tool definition) +tmhmm2.py (Python wrapper script) +signalp3.xml (Galaxy tool definition) +signalp3.py (Python wrapper script) +seq_analysis_utils.py (shared Python code) +README (optional) + +5. Edit your Galaxy conjuration file tool_conf.xml (to use the tools) AND + also tool_conf.xml.sample (to run the tests) to include the two new tools + by adding: + + <section name="Protein sequence analysis" id="protein_analysis"> + <tool file="protein_analysis/tmhmm2.xml" /> + <tool file="protein_analysis/signalp3.xml" /> + </section> + +6. Copy/move the following test files (from these archive) to Galaxy + subfolder test-data: + +four_human_proteins.fasta +four_human_proteins.signalp3.tsv +four_human_proteins.tmhmm2.tsv + +7. Run the Galaxy functional tests for these new wrappers with: + +./run_functional_tests.sh -id tmhmm2 +./run_functional_tests.sh -id signalp3 + +Alternatively, this should work (assuming you left the name and id as shown in +the XML file tool_conf.xml.sample): + +./run_functional_tests.sh -sid Protein_sequence_analysis-protein_analysis + +8. Restart Galaxy and check the new tools are shown and work.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/seq_analysis_utils.py Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,129 @@ +"""A few useful functions for working with FASTA files and running jobs. + +This module was originally written to hold common code used in both the TMHMM +and SignalP wrappers in Galaxy. + +Given Galaxy currently supports Python 2.4+ this cannot use the Python module +multiprocessing so the function run_jobs instead is a simple pool approach +using just the subprocess library. +""" +import sys +import os +import subprocess +from time import sleep + +__version__ = "0.0.1" + +def stop_err(msg, error_level=1): + """Print error message to stdout and quit with given error level.""" + sys.stderr.write("%s\n" % msg) + sys.exit(error_level) + +def fasta_iterator(filename, max_len=None, truncate=None): + """Simple FASTA parser yielding tuples of (title, sequence) strings.""" + handle = open(filename) + title, seq = "", "" + for line in handle: + if line.startswith(">"): + if title: + if truncate: + seq = seq[:truncate] + if max_len and len(seq) > max_len: + raise ValueError("Sequence %s is length %i, max length %i" \ + % (title.split()[0], len(seq), max_len)) + yield title, seq + title = line[1:].rstrip() + seq = "" + elif title: + seq += line.strip() + elif not line.strip() or line.startswith("#"): + #Ignore blank lines, and any comment lines + #between records (starting with hash). + pass + else: + raise ValueError("Bad FASTA line %r" % line) + handle.close() + if title: + if truncate: + seq = seq[:truncate] + if max_len and len(seq) > max_len: + raise ValueError("Sequence %s is length %i, max length %i" \ + % (title.split()[0], len(seq), max_len)) + yield title, seq + raise StopIteration + +def split_fasta(input_filename, output_filename_base, n=500, truncate=None, keep_descr=False, max_len=None): + """Split FASTA file into sub-files each of at most n sequences. + + Returns a list of the filenames used (based on the input filename). + Each sequence can also be truncated (since we only need the start for + SignalP), and have its description discarded (since we don't usually + care about it and some tools don't like very long title lines). + + If a max_len is given and any sequence exceeds it no temp files are + created and an exception is raised. + """ + iterator = fasta_iterator(input_filename, max_len, truncate) + files = [] + try: + while True: + records = [] + for i in range(n): + try: + records.append(iterator.next()) + except StopIteration: + break + if not records: + break + new_filename = "%s.%i.tmp" % (output_filename_base, len(files)) + handle = open(new_filename, "w") + if keep_descr: + for title, seq in records: + handle.write(">%s\n" % title) + for i in range(0, len(seq), 60): + handle.write(seq[i:i+60] + "\n") + else: + for title, seq in records: + handle.write(">%s\n" % title.split()[0]) + for i in range(0, len(seq), 60): + handle.write(seq[i:i+60] + "\n") + handle.close() + files.append(new_filename) + #print "%i records in %s" % (len(records), new_filename) + except ValueError, err: + #Max length failure from parser - clean up + try: + handle.close() + except: + pass + for f in files: + if os.path.isfile(f): + os.remove(f) + raise err + return files + +def run_jobs(jobs, threads): + """Takes list of cmd strings, returns dict with error levels.""" + pending = jobs[:] + running = [] + results = {} + while pending or running: + #print "%i jobs pending, %i running, %i completed" \ + # % (len(jobs), len(running), len(results)) + #See if any have finished + for (cmd, process) in running: + return_code = process.wait() + if return_code is not None: + results[cmd] = return_code + running = [(cmd, process) for (cmd, process) in running \ + if cmd not in results] + #See if we can start any new threads + while pending and len(running) < threads: + cmd = pending.pop(0) + process = subprocess.Popen(cmd, shell=True) + running.append((cmd, process)) + #Loop... + sleep(1) + #print "%i jobs completed" % len(results) + assert set(jobs) == set(results) + return results
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/signalp3.py Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,137 @@ +#!/usr/bin/env python +"""Wrapper for SignalP v3.0 for use in Galaxy. + +This script takes exactly fives command line arguments: + * the organism type (euk, gram+ or gram-) + * length to truncate sequences to (integer) + * number of threads to use (integer) + * an input protein FASTA filename + * output tabular filename. + +It then calls the standalone SignalP v3.0 program (not the webservice) +requesting the short output (one line per protein) using both NN and HMM +for predictions. + +First major feature is cleaning up the output. The raw output from SignalP +v3.0 looks like this (21 columns space separated): + +# SignalP-NN euk predictions # SignalP-HMM euk predictions +# name Cmax pos ? Ymax pos ? Smax pos ? Smean ? D ? # name ! Cmax pos ? Sprob ? +gi|2781234|pdb|1JLY| 0.061 17 N 0.043 17 N 0.199 1 N 0.067 N 0.055 N gi|2781234|pdb|1JLY|B Q 0.000 17 N 0.000 N +gi|4959044|gb|AAD342 0.099 191 N 0.012 38 N 0.023 12 N 0.014 N 0.013 N gi|4959044|gb|AAD34209.1|AF069992_1 Q 0.000 0 N 0.000 N +gi|671626|emb|CAA856 0.139 381 N 0.020 8 N 0.121 4 N 0.067 N 0.044 N gi|671626|emb|CAA85685.1| Q 0.000 0 N 0.000 N +gi|3298468|dbj|BAA31 0.208 24 N 0.184 38 N 0.980 32 Y 0.613 Y 0.398 N gi|3298468|dbj|BAA31520.1| Q 0.066 24 N 0.139 N + +In order to make it easier to use in Galaxy, this wrapper script reformats +this to use tab separators. Also it removes the redundant truncated name +column, and assigns unique column names in the header: + +#ID NN_Cmax_score NN_Cmax_pos NN_Cmax_pred NN_Ymax_score NN_Ymax_pos NN_Ymax_pred NN_Smax_score NN_Smax_pos NN_Smax_pred NN_Smean_score NN_Smean_pred NN_D_score NN_D_pred HMM_bang HMM_Cmax_score HMM_Cmax_pos HMM_Cmax_pred HMM_Sprob_score HMM_Sprob_pred +gi|2781234|pdb|1JLY|B 0.061 17 N 0.043 17 N 0.199 1 N 0.067 N 0.055 N Q 0.000 17 N 0.000 N +gi|4959044|gb|AAD34209.1|AF069992_1 0.099 191 N 0.012 38 N 0.023 12 N 0.014 N 0.013 N Q 0.000 0 N 0.000 N +gi|671626|emb|CAA85685.1| 0.139 381 N 0.020 8 N 0.121 4 N 0.067 N 0.044 N Q 0.000 0 N 0.000 N +gi|3298468|dbj|BAA31520.1| 0.208 24 N 0.184 38 N 0.980 32 Y 0.613 Y 0.398 N Q 0.066 24 N 0.139 N + +The second major feature is overcoming SignalP's built in limit of 4000 +sequences by breaking up the input FASTA file into chunks. This also allows +us to pre-trim the sequences since SignalP only needs their starts. + +The third major feature is taking advantage of multiple cores (since SignalP +v3.0 itself is single threaded) by using the individual FASTA input files to +run multiple copies of TMHMM in parallel. I would normally use Python's +multiprocessing library in this situation but it requires at least Python 2.6 +and at the time of writing Galaxy still supports Python 2.4. +""" +import sys +import os +from seq_analysis_utils import stop_err, split_fasta, run_jobs + +FASTA_CHUNK = 500 +MAX_LEN = 6000 #Found by trial and error + +if len(sys.argv) != 6: + stop_err("Require five arguments, organism, truncate, threads, input protein FASTA file & output tabular file") + +organism = sys.argv[1] +if organism not in ["euk", "gram+", "gram-"]: + stop_err("Organism argument %s is not one of euk, gram+ or gram-" % organism) + +try: + truncate = int(sys.argv[2]) +except: + truncate = 0 +if truncate < 0: + stop_err("Truncate argument %s is not a positive integer (or zero)" % sys.argv[2]) + +try: + num_threads = int(sys.argv[3]) +except: + num_threads = 0 +if num_threads < 1: + stop_err("Threads argument %s is not a positive integer" % sys.argv[3]) + +fasta_file = sys.argv[4] + +tabular_file = sys.argv[5] + +def clean_tabular(raw_handle, out_handle): + """Clean up SignalP output to make it tabular.""" + for line in raw_handle: + if not line or line.startswith("#"): + continue + parts = line.rstrip("\r\n").split() + assert len(parts)==21, repr(line) + assert parts[14].startswith(parts[0]) + #Remove redundant truncated name column (col 0) + #and put full name at start (col 14) + parts = parts[14:15] + parts[1:14] + parts[15:] + out_handle.write("\t".join(parts) + "\n") + +fasta_files = split_fasta(fasta_file, tabular_file, n=FASTA_CHUNK, truncate=truncate, max_len=MAX_LEN) +temp_files = [f+".out" for f in fasta_files] +assert len(fasta_files) == len(temp_files) +jobs = ["signalp -short -t %s %s > %s" % (organism, fasta, temp) + for (fasta, temp) in zip(fasta_files, temp_files)] +assert len(fasta_files) == len(temp_files) == len(jobs) + +def clean_up(file_list): + for f in file_list: + if os.path.isfile(f): + os.remove(f) + +if len(jobs) > 1 and num_threads > 1: + #A small "info" message for Galaxy to show the user. + print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)) +results = run_jobs(jobs, num_threads) +assert len(fasta_files) == len(temp_files) == len(jobs) +for fasta, temp, cmd in zip(fasta_files, temp_files, jobs): + error_level = results[cmd] + try: + output = open(temp).readline() + except IOError: + output = "" + if error_level or output.lower().startswith("error running"): + clean_up(fasta_files) + clean_up(temp_files) + stop_err("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output), + error_level) +del results + +out_handle = open(tabular_file, "w") +fields = ["ID"] +#NN results: +for name in ["Cmax", "Ymax", "Smax"]: + fields.extend(["NN_%s_score"%name, "NN_%s_pos"%name, "NN_%s_pred"%name]) +fields.extend(["NN_Smean_score", "NN_Smean_pred", "NN_D_score", "NN_D_pred"]) +#HMM results: +fields.extend(["HMM_type", "HMM_Cmax_score", "HMM_Cmax_pos", "HMM_Cmax_pred", + "HMM_Sprob_score", "HMM_Sprob_pred"]) +out_handle.write("#" + "\t".join(fields) + "\n") +for temp in temp_files: + data_handle = open(temp) + clean_tabular(data_handle, out_handle) + data_handle.close() +out_handle.close() + +clean_up(fasta_files) +clean_up(temp_files)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/signalp3.xml Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,116 @@ +<tool id="signalp3" name="SignalP 3.0" version="0.0.1"> + <description>Find signal peptides in protein sequences</description> + <command interpreter="python"> + signalp3.py $organism $truncate 8 $fasta_file $tabular_file + ##I want the number of threads to be a Galaxy config option... + </command> + <inputs> + <param name="fasta_file" type="data" format="fasta" label="FASTA file of protein sequences"/> + <param name="organism" type="select" display="radio" label="Organism"> + <option value="euk">Eukaryote</option> + <option value="gram+">Gram positive</option> + <option value="gram-">Gram negative</option> + </param> + <param name="truncate" type="integer" label="Truncate sequences to this many amino acids" value="60" help="Use zero for no truncation, maximum value 6000"> + <validator type="in_range" min="0" max="6000" message="Truncation value should be at most 6000. Use zero for no truncation."/> + </param> + </inputs> + <outputs> + <data name="tabular_file" format="tabular" label="SignalP $organism results" /> + </outputs> + <requirements> + <requirement type="binary">signalp</requirement> + </requirements> + <tests> + <test> + <param name="fasta_file" value="four_human_proteins.fasta" ftype="fasta"/> + <param name="organism" value="euk"/> + <param name="truncate" value="0"/> + <output name="tabular_file" file="four_human_proteins.signalp3.tsv" ftype="tabular"/> + </test> + </tests> + <help> + +**What it does** + +This calls the SignalP v3.0 tool for prediction of signal peptides, which uses both a neural network (NN) and Hidden Markmov Model (HMM) to produce two sets of scores. + +The input is a FASTA file of protein sequences, and the output is tabular with twenty columns (one row per protein): + + * Sequence identifier + * Neural Network (NN) predictions (13 columns) + * Hidden Markov Model (HMM) predictions (6 columns) + +Internally the input FASTA file is divided into parts (to allow multiple processors to be used), and the proteins truncated as specified (see below). The raw output from SignalP is then reformatted into a tabular layout suitable for Galaxy (see below). + +**Neural Network Scores** + +For each organism class (Eukaryote, Gram-negative and Gram-positive), two different neural networks are used, one for predicting the actual signal peptide and one for predicting the position of the signal peptidase I (SPase I) cleavage site. + +The NN output comprises three different scores (C-max, S-max and Y-max) and two scores derived from them (S-mean and D-score). + +The C-score is the 'cleavage site' score. For each position in the submitted sequence, a C-score is reported, which should only be significantly high at the cleavage site. Confusion is often seen with the position numbering of the cleavage site. When a cleavage site position is referred to by a single number, the number indicates the first residue in the mature protein, meaning that a reported cleavage site between amino acid 26-27 corresponds to that the mature protein starts at (and include) position 27. + +The S-score for the signal peptide prediction is calculateded for every single amino acid position in the submitted sequence (not shown in the output via Galaxy), with high scores indicating that the corresponding amino acid is part of a signal peptide, and low scores indicating that the amino acid is part of a mature protein. + +Y-max is a derivative of the C-score combined with the S-score resulting in a better cleavage site prediction than the raw C-score alone. This is due to the fact that multiple high-peaking C-scores can be found in one sequence, where only one is the true cleavage site. The cleavage site is assigned from the Y-score where the slope of the S-score is steep and a significant C-score is found. + +The S-mean is the average of the S-score, ranging from the N-terminal amino acid to the amino acid assigned with the highest Y-max score, thus the S-mean score is calculated for the length of the predicted signal peptide. The S-mean score was in SignalP version 2.0 used as the criteria for discrimination of secretory and non-secretory proteins. + +The D-score is introduced in SignalP version 3.0 and is a simple average of the S-mean and Y-max score. The score shows superior discrimination performance of secretory and non-secretory proteins to that of the S-mean score which was used in SignalP version 1 and 2. + +For non-secretory proteins all the scores represented in the SignalP3-NN output should ideally be very low. + +**Hidden Markov Model Scores** + +The hidden Markov model calculates the probability of whether the submitted sequence contains a signal peptide or not. The eukaryotic HMM model also reports the probability of a signal anchor, previously named uncleaved signal peptides. Furthermore, the cleavage site is assigned by a probability score together with scores for the n-region, h-region, and c-region of the signal peptide, if such one is found. + +The 'type' column uses 'S' for a signal peptide (i.e. secretory protein) and 'Q' for non-secretory protein. + +**Notes** + +The raw output 'short' output from TMHMM v2.0 looks something like this (21 columns space separated - shown here formatted nicely). Notice that the identifiers are given twice, the first time truncated (as part of the NN predictions) and the second time in full (in the HMM predictions). + +==================== ===== === = ===== === = ===== === = ===== = ===== = =================================== = ===== === = ===== = +# SignalP-NN euk predictions # SignalP-HMM euk predictions +----------------------------------------------------------------------------- ------------------------------------------------------------ +# name Cmax pos ? Ymax pos ? Smax pos ? Smean ? D ? # name ! Cmax pos ? Sprob ? +gi|2781234|pdb|1JLY| 0.061 17 N 0.043 17 N 0.199 1 N 0.067 N 0.055 N gi|2781234|pdb|1JLY|B Q 0.000 17 N 0.000 N +gi|4959044|gb|AAD342 0.099 191 N 0.012 38 N 0.023 12 N 0.014 N 0.013 N gi|4959044|gb|AAD34209.1|AF069992_1 Q 0.000 0 N 0.000 N +gi|671626|emb|CAA856 0.139 381 N 0.020 8 N 0.121 4 N 0.067 N 0.044 N gi|671626|emb|CAA85685.1| Q 0.000 0 N 0.000 N +gi|3298468|dbj|BAA31 0.208 24 N 0.184 38 N 0.980 32 Y 0.613 Y 0.398 N gi|3298468|dbj|BAA31520.1| Q 0.066 24 N 0.139 N +==================== ===== === = ===== === = ===== === = ===== = ===== = =================================== = ===== === = ===== = + +In order to make this easier to use in Galaxy, the wrapper script simplifies this to remove the redundant column and use tabs for separation. It also includes a header line with unique column names. + +=================================== ============= =========== ============ ============= =========== ============ ============= =========== ============ ============== ============= ========== ========= ======== ============== ============ ============= =============== ============== +#ID NN_Cmax_score NN_Cmax_pos NN_Cmax_pred NN_Ymax_score NN_Ymax_pos NN_Ymax_pred NN_Smax_score NN_Smax_pos NN_Smax_pred NN_Smean_score NN_Smean_pred NN_D_score NN_D_pred HMM_type HMM_Cmax_score HMM_Cmax_pos HMM_Cmax_pred HMM_Sprob_score HMM_Sprob_pred +gi|2781234|pdb|1JLY|B 0.061 17 N 0.043 17 N 0.199 1 N 0.067 N 0.055 N Q 0.000 17 N 0.000 N +gi|4959044|gb|AAD34209.1|AF069992_1 0.099 191 N 0.012 38 N 0.023 12 N 0.014 N 0.013 N Q 0.000 0 N 0.000 N +gi|671626|emb|CAA85685.1| 0.139 381 N 0.020 8 N 0.121 4 N 0.067 N 0.044 N Q 0.000 0 N 0.000 N +gi|3298468|dbj|BAA31520.1| 0.208 24 N 0.184 38 N 0.980 32 Y 0.613 Y 0.398 N Q 0.066 24 N 0.139 N +=================================== ============= =========== ============ ============= =========== ============ ============= =========== ============ ============== ============= ========== ========= ======== ============== ============ ============= =============== ============== + +**Truncation** + +Signal peptides are found at the start of a protein, so there is limited value in providing the full length sequence, and providing the full sequence slows down the analysis. Furthermore, SignalP has an upper bound on the sequence length it will accept (6000bp). Thus for practical reasons it is useful to truncate the proteins before passing them to SignalP. However, the precise point they are truncated does have a small influence on some score values, and thus to the results. + +**References** + +Bendtsen, Nielsen, von Heijne, and Brunak. +Improved prediction of signal peptides: SignalP 3.0. +J. Mol. Biol., 340:783-795, 2004. + +Nielsen, Engelbrecht, Brunak and von Heijne. +Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. +Protein Engineering, 10:1-6, 1997. + +Nielsen and Krogh. +Prediction of signal peptides and signal anchors by a hidden Markov model. +Proceedings of the Sixth International Conference on Intelligent Systems for Molecular Biology (ISMB 6), +AAAI Press, Menlo Park, California, pp. 122-130, 1998. + +http://www.cbs.dtu.dk/services/SignalP-3.0/output.php + + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/suite_config.xml Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,9 @@ + <suite id="tmhmm_and_signalp" name="TMHMM and SignalP" version="0.0.1"> + <description>Wrappers for TMHMM and SignalP</description> + <tool id="tmhmm2" name="TMHMM 2.0" version="0.0.1"> + <description>Find transmembrane domains in protein sequences</description> + </tool> + <tool id="signalp3" name="SignalP 3.0" version="0.0.1"> + <description>Find signal peptides in protein sequences</description> + </tool> + </suite>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/tmhmm2.py Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,111 @@ +#!/usr/bin/env python +"""Wrapper for TMHMM v2.0 for use in Galaxy. + +This script takes exactly two command line arguments - an input protein FASTA +filename and an output tabular filename. It then calls the standalone TMHMM +v2.0 program (not the webservice) requesting the short output (one line per +protein). + +First major feature is cleaning up the tabular output. The raw output from +TMHMM v2.0 looks like this (six columns tab separated): + + gi|2781234|pdb|1JLY|B len=304 ExpAA=0.01 First60=0.00 PredHel=0 Topology=o + gi|4959044|gb|AAD34209.1|AF069992_1 len=600 ExpAA=0.00 First60=0.00 PredHel=0 Topology=o + gi|671626|emb|CAA85685.1| len=473 ExpAA=0.19 First60=0.00 PredHel=0 Topology=o + gi|3298468|dbj|BAA31520.1| len=107 ExpAA=59.37 First60=31.17 PredHel=3 Topology=o23-45i52-74o89-106i + +In order to make it easier to use in Galaxy, this wrapper script simplifies +this to remove the redundant tags, and instead adds a comment line at the +top with the column names: + + #ID len ExpAA First60 PredHel Topology + gi|2781234|pdb|1JLY|B 304 0.01 60 0.00 0 o + gi|4959044|gb|AAD34209.1|AF069992_1 600 0.00 0 0.00 0 o + gi|671626|emb|CAA85685.1| 473 0.19 0.00 0 o + gi|3298468|dbj|BAA31520.1| 107 59.37 31.17 3 o23-45i52-74o89-106i + +The second major potential feature is taking advantage of multiple cores +(since TMHMM v2.0 itself is single threaded) by dividing the input FASTA file +into chunks and running multiple copies of TMHMM in parallel. I would normally +use Python's multiprocessing library in this situation but it requires at +least Python 2.6 and at the time of writing Galaxy still supports Python 2.4. +""" +import sys +import os +from seq_analysis_utils import stop_err, split_fasta, run_jobs + +FASTA_CHUNK = 500 + +if len(sys.argv) != 4: + stop_err("Require three arguments, number of threads (int), input protein FASTA file & output tabular file") +try: + num_threads = int(sys.argv[1]) +except: + num_threads = 0 +if num_threads < 1: + stop_err("Threads argument %s is not a positive integer" % sys.argv[1]) +fasta_file = sys.argv[2] +tabular_file = sys.argv[3] + +def clean_tabular(raw_handle, out_handle): + """Clean up tabular TMHMM output.""" + for line in raw_handle: + if not line: + continue + parts = line.rstrip("\r\n").split("\t") + try: + identifier, length, expAA, first60, predhel, topology = parts + except: + assert len(parts)!=6 + stop_err("Bad line: %r" % line) + assert length.startswith("len="), line + length = length[4:] + assert expAA.startswith("ExpAA="), line + expAA = expAA[6:] + assert first60.startswith("First60="), line + first60 = first60[8:] + assert predhel.startswith("PredHel="), line + predhel = predhel[8:] + assert topology.startswith("Topology="), line + topology = topology[9:] + out_handle.write("%s\t%s\t%s\t%s\t%s\t%s\n" \ + % (identifier, length, expAA, first60, predhel, topology)) + +fasta_files = split_fasta(fasta_file, tabular_file, FASTA_CHUNK) +temp_files = [f+".out" for f in fasta_files] +jobs = ["tmhmm %s > %s" % (fasta, temp) + for fasta, temp in zip(fasta_files, temp_files)] + +def clean_up(file_list): + for f in file_list: + if os.path.isfile(f): + os.remove(f) + +if len(jobs) > 1 and num_threads > 1: + #A small "info" message for Galaxy to show the user. + print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)) +results = run_jobs(jobs, num_threads) +for fasta, temp, cmd in zip(fasta_files, temp_files, jobs): + error_level = results[cmd] + if error_level: + try: + output = open(temp).readline() + except IOError: + output = "" + clean_up(fasta_files) + clean_up(temp_files) + stop_err("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output), + error_level) +del results +del jobs + +out_handle = open(tabular_file, "w") +out_handle.write("#ID\tlen\tExpAA\tFirst60\tPredHel\tTopology\n") +for temp in temp_files: + data_handle = open(temp) + clean_tabular(data_handle, out_handle) + data_handle.close() +out_handle.close() + +clean_up(fasta_files) +clean_up(temp_files)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/tmhmm2.xml Tue Jun 07 17:37:26 2011 -0400 @@ -0,0 +1,81 @@ +<tool id="tmhmm2" name="TMHMM 2.0" version="0.0.1"> + <description>Find transmembrane domains in protein sequences</description> + <command interpreter="python"> + tmhmm2.py 8 $fasta_file $tabular_file + ##I want the number of threads to be a Galaxy config option... + </command> + <inputs> + <param name="fasta_file" type="data" format="fasta" label="FASTA file of protein sequences"/> + <!-- + <param name="version" type="select" display="radio" label="Model version"> + <option value="">Version 1 (old)</option> + <option value="" selected="True">Version 2 (default)</option> + </param> + --> + </inputs> + <outputs> + <data name="tabular_file" format="tabular" label="TMHMM results" /> + </outputs> + <requirements> + <requirement type="binary">tmhmm</requirement> + </requirements> + <tests> + <test> + <param name="fasta_file" value="four_human_proteins.fasta" ftype="fasta"/> + <output name="tabular_file" file="four_human_proteins.tmhmm2.tsv" ftype="tabular"/> + </test> + </tests> + <help> + +**What it does** + +This calls the TMHMM v2.0 tool for prediction of transmembrane (TM) helices in proteins using a hidden Markov model (HMM). + +The input is a FASTA file of protein sequences, and the output is tabular with six columns (one row per protein): + + 1. Sequence identifier + 2. Sequence length + 3. Expected number of amino acids in TM helices (ExpAA). If this number is larger than 18 it is very likely to be a transmembrane protein (OR have a signal peptide). + 4. Expected number of amino acids in TM helices in the first 60 amino acids of the protein (Exp60). If this number more than a few, be aware that a predicted transmembrane helix in the N-term could be a signal peptide. + 5. Number of transmembrane helices predicted by N-best. + 6. Topology predicted by N-best (encoded as a strip using o for output and i for inside) + +Predicted TM segments in the n-terminal region sometime turn out to be signal peptides. + +One of the most common mistakes by the program is to reverse the direction of proteins with one TM segment. + +Do not use the program to predict whether a non-membrane protein is cytoplasmic or not. + +**Notes** + +The raw output from TMHMM v2.0 looks like this (six columns tab separated): + +=================================== ======= =========== ============= ========= ============================= +gi|2781234|pdb|1JLY|B len=304 ExpAA=0.01 First60=0.00 PredHel=0 Topology=o +gi|4959044|gb|AAD34209.1|AF069992_1 len=600 ExpAA=0.00 First60=0.00 PredHel=0 Topology=o +gi|671626|emb|CAA85685.1| len=473 ExpAA=0.19 First60=0.00 PredHel=0 Topology=o +gi|3298468|dbj|BAA31520.1| len=107 ExpAA=59.37 First60=31.17 PredHel=3 Topology=o23-45i52-74o89-106i +=================================== ======= =========== ============= ========= ============================= + +In order to make it easier to use in Galaxy, the wrapper script simplifies this to remove the redundant tags, and instead adds a comment line at the top with the column names: + +=================================== === ===== ======= ======= ==================== +#ID len ExpAA First60 PredHel Topology +gi|2781234|pdb|1JLY|B 304 0.01 0.00 0 o +gi|4959044|gb|AAD34209.1|AF069992_1 600 0.00 0.00 0 o +gi|671626|emb|CAA85685.1| 473 0.19 0.00 0 o +gi|3298468|dbj|BAA31520.1| 107 59.37 31.17 3 o23-45i52-74o89-106i +=================================== === ===== ======= ======= ==================== + +**References** + +Krogh, Larsson, von Heijne, and Sonnhammer. +Predicting Transmembrane Protein Topology with a Hidden Markov Model: Application to Complete Genomes. +J. Mol. Biol. 305:567-580, 2001. + +Sonnhammer, von Heijne, and Krogh. +A hidden Markov model for predicting transmembrane helices in protein sequences. +In J. Glasgow et al., eds.: Proc. Sixth Int. Conf. on Intelligent Systems for Molecular Biology, pages 175-182. AAAI Press, 1998. + + </help> +</tool>