Mercurial > repos > peterjc > tmhmm_and_signalp
changeset 5:ef7ceca37e3f
Migrated tool version 0.0.8 from old tool shed archive to new tool shed repository
author | peterjc |
---|---|
date | Tue, 07 Jun 2011 17:40:55 -0400 |
parents | 1426b2bae76d |
children | 39a6e46cdda3 |
files | tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/signalp3.py tools/protein_analysis/suite_config.xml tools/protein_analysis/wolf_psort.py tools/protein_analysis/wolf_psort.xml |
diffstat | 6 files changed, 292 insertions(+), 30 deletions(-) [+] |
line wrap: on
line diff
--- a/tools/protein_analysis/LICENSE Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/LICENSE Tue Jun 07 17:40:55 2011 -0400 @@ -1,7 +1,8 @@ -Copyright (c) 2010 Peter Cock, SCRI, UK +Copyright (c) 2010-2011 Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. -License for TMHMM 2.0 and SignalP 3.0 wrappers for Galaxy (note that -TMHMM 2.0 and SignalP 3.0 are copyright and licensed separately). +License for TMHMM 2.0, SignalP 3.0, and WoLF PSORT wrappers for Galaxy +(note that tools themselves are copyright and licensed separately). Permission to use, copy, modify, and distribute this software and its documentation with or without modifications and for any purpose and
--- a/tools/protein_analysis/README Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/README Tue Jun 07 17:40:55 2011 -0400 @@ -1,45 +1,70 @@ -This package contains Galaxy wrappers for two standalone command line protein -analysis tools (SignalP 3.0 and THMHMM 2.0) from the Center for Biological -Sequence Analysis at the Technical University of Denmark, -http://www.cbs.dtu.dk/cbs/ +This package contains Galaxy wrappers for a selection of standalone command +line protein analysis tools: + +* SignalP 3.0 and THMHMM 2.0, from the Center for Biological + Sequence Analysis at the Technical University of Denmark, + http://www.cbs.dtu.dk/cbs/ + +* WoLF PSORT v0.2 from http://wolfpsort.org/ + +To use these Galaxy wrappers you must first install the command line tools. +At the time of writing they are all free for academic use. + +These wrappers are copyright 2010-2011 by Peter Cock, James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the included LICENCE file for details. -To use these Galaxy wrappers you must first install the CBS command line -tools. At the time of writing both SignalP 3.0 and TMHMM 2.0 are free for -academic use. +Requirements +============ + +First install those command line tools you wish to use the wrappers for: + +1. Install the command line version of SignalP 3.0 and ensure "signalp" is + on the PATH, see: http://www.cbs.dtu.dk/services/SignalP/ -These wrappers are copyright 2010 by Peter Cock, SCRI, UK. All rights -reserved. See the included LICENCE file for details. +2. Install the command line version of TMHMM 2.0 and ensure "tmhmm" is on + the PATH, see: http://www.cbs.dtu.dk/services/TMHMM/ + +3. Install the WoLF PSORT v0.2 package, and ensure "runWolfPsortSummary" + is on the PATH (we use an extra wrapper script to change to the WoLF PSORT + directory, run runWolfPsortSummary, and then change back to the original + directory), see: http://wolfpsort.org/WoLFPSORT_package/version0.2/ + +Verify each of the tools is installed and working from the command line +(when logged in at the Galaxy user if appropriate). Installation ============ -1. Install the command line version of SignalP 3.0 and ensure it is on the - PATH, see: http://www.cbs.dtu.dk/services/SignalP/ +1. Create a folder tools/protein_analysis under your Galaxy installation. -2. Install the command line version of TMHMM 2.0 and ensure it is on the - PATH, see: http://www.cbs.dtu.dk/services/TMHMM/ - -3. Create a folder tools/protein_analysis under your Galaxy installation. - -4. Copy/move the following files (from this archive) there: +2. Copy/move the following files (from this archive) there: tmhmm2.xml (Galaxy tool definition) tmhmm2.py (Python wrapper script) + signalp3.xml (Galaxy tool definition) signalp3.py (Python wrapper script) + +wolf_psort.xml (Galaxy tool definition) +wolf_psort.py (Python wrapper script) + seq_analysis_utils.py (shared Python code) README (optional) -5. Edit your Galaxy conjuration file tool_conf.xml (to use the tools) AND - also tool_conf.xml.sample (to run the tests) to include the two new tools +3. Edit your Galaxy conjuration file tool_conf.xml (to use the tools) AND + also tool_conf.xml.sample (to run the tests) to include the new tools by adding: <section name="Protein sequence analysis" id="protein_analysis"> <tool file="protein_analysis/tmhmm2.xml" /> <tool file="protein_analysis/signalp3.xml" /> + <tool file="protein_analysis/wolf_psort.xml" /> </section> -6. Copy/move the following test files (from these archive) to Galaxy + Leave out the lines for any tools you do not wish to use in Galaxy. + +4. Copy/move the following test files (from these archive) to Galaxy subfolder test-data: four_human_proteins.fasta @@ -49,7 +74,7 @@ empty_tmhmm2.tabular empty_signalp3.tabular -7. Run the Galaxy functional tests for these new wrappers with: +5. Run the Galaxy functional tests for these new wrappers with: ./run_functional_tests.sh -id tmhmm2 ./run_functional_tests.sh -id signalp3 @@ -59,7 +84,7 @@ ./run_functional_tests.sh -sid Protein_sequence_analysis-protein_analysis -8. Restart Galaxy and check the new tools are shown and work. +6. Restart Galaxy and check the new tools are shown and work. History @@ -75,6 +100,7 @@ v0.0.6 - Improvement to how sub-jobs are run (should be faster) v0.0.7 - Change SignalP default truncation from 60 to 70 to match the SignalP webservice. +v0.0.8 - Added WoLF PSORT wrapper to the suite. Developers @@ -89,11 +115,11 @@ For making the "Galaxy Tool Shed" http://community.g2.bx.psu.edu/ tarball use the following command from the Galaxy root folder: -tar -czf tmhmm_and_signalp.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular +tar -czf tmhmm_signalp_wolfpsort.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py tools/protein_analysis/wolf_psort.xml tools/protein_analysis/wolf_psort.py test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular Check this worked: -$ tar -tzf tmhmm_and_signalp.tar.gz +$ tar -tzf tmhmm_signalp_wolfpsort.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml @@ -102,6 +128,8 @@ tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py +tools/protein_analysis/wolf_psort.xml +tools/protein_analysis/wolf_psort.py test-data/four_human_proteins.fasta test-data/four_human_proteins.signalp3.tabular test-data/four_human_proteins.tmhmm2.tabular
--- a/tools/protein_analysis/signalp3.py Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/signalp3.py Tue Jun 07 17:40:55 2011 -0400 @@ -1,7 +1,7 @@ #!/usr/bin/env python """Wrapper for SignalP v3.0 for use in Galaxy. -This script takes exactly fives command line arguments: +This script takes exactly five command line arguments: * the organism type (euk, gram+ or gram-) * length to truncate sequences to (integer) * number of threads to use (integer)
--- a/tools/protein_analysis/suite_config.xml Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/suite_config.xml Tue Jun 07 17:40:55 2011 -0400 @@ -1,9 +1,12 @@ - <suite id="tmhmm_and_signalp" name="TMHMM and SignalP" version="0.0.7"> - <description>Wrappers for TMHMM and SignalP</description> + <suite id="tmhmm_and_signalp" name="TMHMM, SignalP, WoLF PSORT" version="0.0.8"> + <description>Wrappers for TMHMM, SignalP and WoLF PSORT</description> <tool id="tmhmm2" name="TMHMM 2.0" version="0.0.6"> <description>Find transmembrane domains in protein sequences</description> </tool> <tool id="signalp3" name="SignalP 3.0" version="0.0.7"> <description>Find signal peptides in protein sequences</description> </tool> + <tool id="wolf_psort" name="WoLF PSORT" version="0.0.1"> + <description>Eukaryote protein subcellular localization prediction</description> + </tool> </suite>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/wolf_psort.py Tue Jun 07 17:40:55 2011 -0400 @@ -0,0 +1,128 @@ +#!/usr/bin/env python +"""Wrapper for WoLF PSORT v0.2 for use in Galaxy. + +This script takes exactly four command line arguments: + * the organism type (animal, plant or fungi) + * number of threads to use (integer) + * an input protein FASTA filename + * output tabular filename. + +It then calls the standalone WoLF PSORT v0.2 program runWolfPsortSummary +(not the webservice), and coverts the output from something like this: + +# k used for kNN is: 27 +gi|301087619|ref|XP_002894699.1| extr 12, mito 4, E.R. 3, golg 3, mito_nucl 3 +gi|301087623|ref|XP_002894700.1| extr 21, mito 2, cyto 2, cyto_mito 2 + +In order to make it easier to use in Galaxy, this wrapper script reformats +this to use tab separators, with one line per compartment prediction: + +#ID Compartment Score Rank +gi|301087619|ref|XP_002894699.1| extr 12 1 +gi|301087619|ref|XP_002894699.1| mito 4 2 +gi|301087619|ref|XP_002894699.1| E.R. 3 3 +gi|301087619|ref|XP_002894699.1| golg 3 4 +gi|301087619|ref|XP_002894699.1| mito_nucl 3 5 +gi|301087623|ref|XP_002894700.1| extr 21 1 +gi|301087623|ref|XP_002894700.1| mito 2 2 +gi|301087623|ref|XP_002894700.1| cyto 2 3 +gi|301087623|ref|XP_002894700.1| cyto_mito 2 4 + +Additionally in order to take full advantage of multiple cores, by subdividing +the input FASTA file multiple copies of WoLF PSORT are run in parallel. I would +normally use Python's multiprocessing library in this situation but it requires +at least Python 2.6 and at the time of writing Galaxy still supports Python 2.4. +""" +import sys +import os +from seq_analysis_utils import stop_err, split_fasta, run_jobs + +FASTA_CHUNK = 500 +exe = "runWolfPsortSummary" + +""" +Note: I had trouble getting runWolfPsortSummary on the path, so used a wrapper +python script called runWolfPsortSummary as follows: + +#!/usr/bin/env python +#Wrapper script to call WoLF PSORT from its own directory. +import os +import sys +import subprocess +saved_dir = os.path.abspath(os.curdir) +os.chdir("/opt/WoLFPSORT_package_v0.2/bin") +args = ["./runWolfPsortSummary"] + sys.argv[1:] +return_code = subprocess.call(args) +os.chdir(saved_dir) +sys.exit(return_code) +""" + +if len(sys.argv) != 5: + stop_err("Require four arguments, organism, threads, input protein FASTA file & output tabular file") + +organism = sys.argv[1] +if organism not in ["animal", "plant", "fungi"]: + stop_err("Organism argument %s is not one of animal, plant, fungi" % organism) + +try: + num_threads = int(sys.argv[2]) +except: + num_threads = 0 +if num_threads < 1: + stop_err("Threads argument %s is not a positive integer" % sys.argv[3]) + +fasta_file = sys.argv[3] + +tabular_file = sys.argv[4] + +def clean_tabular(raw_handle, out_handle): + """Clean up WoLF PSORT output to make it tabular.""" + for line in raw_handle: + if not line or line.startswith("#"): + continue + name, data = line.rstrip("\r\n").split(None,1) + for rank, comp_data in enumerate(data.split(",")): + comp, score = comp_data.split() + out_handle.write("%s\t%s\t%s\t%i\n" \ + % (name, comp, score, rank+1)) + +fasta_files = split_fasta(fasta_file, tabular_file, n=FASTA_CHUNK) +temp_files = [f+".out" for f in fasta_files] +assert len(fasta_files) == len(temp_files) +jobs = ["%s %s < %s > %s" % (exe, organism, fasta, temp) + for (fasta, temp) in zip(fasta_files, temp_files)] +assert len(fasta_files) == len(temp_files) == len(jobs) + +def clean_up(file_list): + for f in file_list: + if os.path.isfile(f): + os.remove(f) + +if len(jobs) > 1 and num_threads > 1: + #A small "info" message for Galaxy to show the user. + print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)) +results = run_jobs(jobs, num_threads) +assert len(fasta_files) == len(temp_files) == len(jobs) +for fasta, temp, cmd in zip(fasta_files, temp_files, jobs): + error_level = results[cmd] + try: + output = open(temp).readline() + except IOError: + output = "" + if error_level or output.lower().startswith("error running"): + clean_up(fasta_files) + clean_up(temp_files) + stop_err("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output), + error_level) +del results + +out_handle = open(tabular_file, "w") +out_handle.write("#ID\tCompartment\tScore\tRank\n") +for temp in temp_files: + data_handle = open(temp) + clean_tabular(data_handle, out_handle) + data_handle.close() +out_handle.close() + +clean_up(fasta_files) +clean_up(temp_files)
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/wolf_psort.xml Tue Jun 07 17:40:55 2011 -0400 @@ -0,0 +1,102 @@ +<tool id="wolf_psort" name="WoLF PSORT" version="0.0.1"> + <description>Eukaryote protein subcellular localization prediction</description> + <command interpreter="python"> + wolf_psort.py $organism 8 $fasta_file $tabular_file + ##I want the number of threads to be a Galaxy config option... + </command> + <inputs> + <param name="fasta_file" type="data" format="fasta" label="FASTA file of protein sequences"/> + <param name="organism" type="select" display="radio" label="Organism"> + <option value="animal">Animal</option> + <option value="plant">Plant</option> + <option value="fungi">Fungi</option> + </param> + </inputs> + <outputs> + <data name="tabular_file" format="tabular" label="WoLF PSORT $organism results" /> + </outputs> + <requirements> + <requirement type="binary">runWolfPsortSummary</requirement> + </requirements> + <help> + +**What it does** + +This calls the WoLF PSORT tool for prediction of eukaryote protein subcellular localization. + +The input is a FASTA file of protein sequences, and the output is tabular with four columns (multiple rows per protein): + + * Sequence identifier + * Compartment + * Score + * Prediction rank + + +**Localization Compartments** + +The table below gives the WoLF PSORT localization site definitions, and the corresponding Gene Ontology (GO) term. + +====== ===================== ===================== +Abbrev Localization Site GO Cellular Component +------ --------------------- --------------------- +chlo chloroplast 0009507, 0009543 +cyto cytosol 0005829 +cysk cytoskeleton 0005856(2) +E.R. endoplasmic reticulum 0005783 +extr extracellular 0005576, 0005618 +golg Golgi apparatus 0005794(1) +lyso lysosome 0005764 +mito mitochondria 0005739 +nucl nuclear 0005634 +pero peroxisome 0005777(2) +plas plasma membrane 0005886 +vacu vacuolar membrane 0005774(2) +====== ===================== ===================== + +Additionally compound predictions like mito_nucl are also given. + + +**Notes** + +The raw output from WoLF PSORT looks like this (space separated), showing two proteins: + +================================ ============================================ +gi|301087619|ref|XP_002894699.1| extr 12, mito 4, E.R. 3, golg 3, mito_nucl 3 +gi|301087623|ref|XP_002894700.1| extr 21, mito 2, cyto 2, cyto_mito 2 +================================ ============================================ + +This is reformatted into a tabular file as follows for use in Galaxy: + +================================ =========== ===== ==== +#ID Compartment Score Rank +-------------------------------- ----------- ----- ---- +gi|301087619|ref|XP_002894699.1| extr 12 1 +gi|301087619|ref|XP_002894699.1| mito 4 2 +gi|301087619|ref|XP_002894699.1| E.R. 3 3 +gi|301087619|ref|XP_002894699.1| golg 3 4 +gi|301087619|ref|XP_002894699.1| mito_nucl 3 5 +gi|301087623|ref|XP_002894700.1| extr 21 1 +gi|301087623|ref|XP_002894700.1| mito 2 2 +gi|301087623|ref|XP_002894700.1| cyto 2 3 +gi|301087623|ref|XP_002894700.1| cyto_mito 2 4 +================================ =========== ===== ==== + +This way you can easily filter for things like having a top prediction for +mitochondria (c2=='mito' and c4==1), or extracellular with a score of at +least 10 (c2=='extr' and 10<=c3), and so on. + + +**References** + +Paul Horton, Keun-Joon Park, Takeshi Obayashi, Naoya Fujita, Hajime Harada, C.J. Adams-Collier, and Kenta Nakai, +WoLF PSORT: Protein Localization Predictor. +Nucleic Acids Research, 35(S2), W585-W587, doi:10.1093/nar/gkm259, 2007. + +Paul Horton, Keun-Joon Park, Takeshi Obayashi and Kenta Nakai. +Protein Subcellular Localization Prediction with WoLF PSORT. +Proceedings of the 4th Annual Asia Pacific Bioinformatics Conference APBC06, Taipei, Taiwan. pp. 39-48, 2006. + +http://wolfpsort.org + + </help> +</tool>