# HG changeset patch # User peterjc # Date 1307482855 14400 # Node ID ef7ceca37e3f2388b562e8034f0800650364fcda # Parent 1426b2bae76df0b26e74cecb25a033db5e30c9be Migrated tool version 0.0.8 from old tool shed archive to new tool shed repository diff -r 1426b2bae76d -r ef7ceca37e3f tools/protein_analysis/LICENSE --- a/tools/protein_analysis/LICENSE Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/LICENSE Tue Jun 07 17:40:55 2011 -0400 @@ -1,7 +1,8 @@ -Copyright (c) 2010 Peter Cock, SCRI, UK +Copyright (c) 2010-2011 Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. -License for TMHMM 2.0 and SignalP 3.0 wrappers for Galaxy (note that -TMHMM 2.0 and SignalP 3.0 are copyright and licensed separately). +License for TMHMM 2.0, SignalP 3.0, and WoLF PSORT wrappers for Galaxy +(note that tools themselves are copyright and licensed separately). Permission to use, copy, modify, and distribute this software and its documentation with or without modifications and for any purpose and diff -r 1426b2bae76d -r ef7ceca37e3f tools/protein_analysis/README --- a/tools/protein_analysis/README Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/README Tue Jun 07 17:40:55 2011 -0400 @@ -1,45 +1,70 @@ -This package contains Galaxy wrappers for two standalone command line protein -analysis tools (SignalP 3.0 and THMHMM 2.0) from the Center for Biological -Sequence Analysis at the Technical University of Denmark, -http://www.cbs.dtu.dk/cbs/ +This package contains Galaxy wrappers for a selection of standalone command +line protein analysis tools: + +* SignalP 3.0 and THMHMM 2.0, from the Center for Biological + Sequence Analysis at the Technical University of Denmark, + http://www.cbs.dtu.dk/cbs/ + +* WoLF PSORT v0.2 from http://wolfpsort.org/ + +To use these Galaxy wrappers you must first install the command line tools. +At the time of writing they are all free for academic use. + +These wrappers are copyright 2010-2011 by Peter Cock, James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the included LICENCE file for details. -To use these Galaxy wrappers you must first install the CBS command line -tools. At the time of writing both SignalP 3.0 and TMHMM 2.0 are free for -academic use. +Requirements +============ + +First install those command line tools you wish to use the wrappers for: + +1. Install the command line version of SignalP 3.0 and ensure "signalp" is + on the PATH, see: http://www.cbs.dtu.dk/services/SignalP/ -These wrappers are copyright 2010 by Peter Cock, SCRI, UK. All rights -reserved. See the included LICENCE file for details. +2. Install the command line version of TMHMM 2.0 and ensure "tmhmm" is on + the PATH, see: http://www.cbs.dtu.dk/services/TMHMM/ + +3. Install the WoLF PSORT v0.2 package, and ensure "runWolfPsortSummary" + is on the PATH (we use an extra wrapper script to change to the WoLF PSORT + directory, run runWolfPsortSummary, and then change back to the original + directory), see: http://wolfpsort.org/WoLFPSORT_package/version0.2/ + +Verify each of the tools is installed and working from the command line +(when logged in at the Galaxy user if appropriate). Installation ============ -1. Install the command line version of SignalP 3.0 and ensure it is on the - PATH, see: http://www.cbs.dtu.dk/services/SignalP/ +1. Create a folder tools/protein_analysis under your Galaxy installation. -2. Install the command line version of TMHMM 2.0 and ensure it is on the - PATH, see: http://www.cbs.dtu.dk/services/TMHMM/ - -3. Create a folder tools/protein_analysis under your Galaxy installation. - -4. Copy/move the following files (from this archive) there: +2. Copy/move the following files (from this archive) there: tmhmm2.xml (Galaxy tool definition) tmhmm2.py (Python wrapper script) + signalp3.xml (Galaxy tool definition) signalp3.py (Python wrapper script) + +wolf_psort.xml (Galaxy tool definition) +wolf_psort.py (Python wrapper script) + seq_analysis_utils.py (shared Python code) README (optional) -5. Edit your Galaxy conjuration file tool_conf.xml (to use the tools) AND - also tool_conf.xml.sample (to run the tests) to include the two new tools +3. Edit your Galaxy conjuration file tool_conf.xml (to use the tools) AND + also tool_conf.xml.sample (to run the tests) to include the new tools by adding:
+
-6. Copy/move the following test files (from these archive) to Galaxy + Leave out the lines for any tools you do not wish to use in Galaxy. + +4. Copy/move the following test files (from these archive) to Galaxy subfolder test-data: four_human_proteins.fasta @@ -49,7 +74,7 @@ empty_tmhmm2.tabular empty_signalp3.tabular -7. Run the Galaxy functional tests for these new wrappers with: +5. Run the Galaxy functional tests for these new wrappers with: ./run_functional_tests.sh -id tmhmm2 ./run_functional_tests.sh -id signalp3 @@ -59,7 +84,7 @@ ./run_functional_tests.sh -sid Protein_sequence_analysis-protein_analysis -8. Restart Galaxy and check the new tools are shown and work. +6. Restart Galaxy and check the new tools are shown and work. History @@ -75,6 +100,7 @@ v0.0.6 - Improvement to how sub-jobs are run (should be faster) v0.0.7 - Change SignalP default truncation from 60 to 70 to match the SignalP webservice. +v0.0.8 - Added WoLF PSORT wrapper to the suite. Developers @@ -89,11 +115,11 @@ For making the "Galaxy Tool Shed" http://community.g2.bx.psu.edu/ tarball use the following command from the Galaxy root folder: -tar -czf tmhmm_and_signalp.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular +tar -czf tmhmm_signalp_wolfpsort.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py tools/protein_analysis/wolf_psort.xml tools/protein_analysis/wolf_psort.py test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular Check this worked: -$ tar -tzf tmhmm_and_signalp.tar.gz +$ tar -tzf tmhmm_signalp_wolfpsort.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml @@ -102,6 +128,8 @@ tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py +tools/protein_analysis/wolf_psort.xml +tools/protein_analysis/wolf_psort.py test-data/four_human_proteins.fasta test-data/four_human_proteins.signalp3.tabular test-data/four_human_proteins.tmhmm2.tabular diff -r 1426b2bae76d -r ef7ceca37e3f tools/protein_analysis/signalp3.py --- a/tools/protein_analysis/signalp3.py Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/signalp3.py Tue Jun 07 17:40:55 2011 -0400 @@ -1,7 +1,7 @@ #!/usr/bin/env python """Wrapper for SignalP v3.0 for use in Galaxy. -This script takes exactly fives command line arguments: +This script takes exactly five command line arguments: * the organism type (euk, gram+ or gram-) * length to truncate sequences to (integer) * number of threads to use (integer) diff -r 1426b2bae76d -r ef7ceca37e3f tools/protein_analysis/suite_config.xml --- a/tools/protein_analysis/suite_config.xml Tue Jun 07 17:40:04 2011 -0400 +++ b/tools/protein_analysis/suite_config.xml Tue Jun 07 17:40:55 2011 -0400 @@ -1,9 +1,12 @@ - - Wrappers for TMHMM and SignalP + + Wrappers for TMHMM, SignalP and WoLF PSORT Find transmembrane domains in protein sequences Find signal peptides in protein sequences + + Eukaryote protein subcellular localization prediction + diff -r 1426b2bae76d -r ef7ceca37e3f tools/protein_analysis/wolf_psort.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/wolf_psort.py Tue Jun 07 17:40:55 2011 -0400 @@ -0,0 +1,128 @@ +#!/usr/bin/env python +"""Wrapper for WoLF PSORT v0.2 for use in Galaxy. + +This script takes exactly four command line arguments: + * the organism type (animal, plant or fungi) + * number of threads to use (integer) + * an input protein FASTA filename + * output tabular filename. + +It then calls the standalone WoLF PSORT v0.2 program runWolfPsortSummary +(not the webservice), and coverts the output from something like this: + +# k used for kNN is: 27 +gi|301087619|ref|XP_002894699.1| extr 12, mito 4, E.R. 3, golg 3, mito_nucl 3 +gi|301087623|ref|XP_002894700.1| extr 21, mito 2, cyto 2, cyto_mito 2 + +In order to make it easier to use in Galaxy, this wrapper script reformats +this to use tab separators, with one line per compartment prediction: + +#ID Compartment Score Rank +gi|301087619|ref|XP_002894699.1| extr 12 1 +gi|301087619|ref|XP_002894699.1| mito 4 2 +gi|301087619|ref|XP_002894699.1| E.R. 3 3 +gi|301087619|ref|XP_002894699.1| golg 3 4 +gi|301087619|ref|XP_002894699.1| mito_nucl 3 5 +gi|301087623|ref|XP_002894700.1| extr 21 1 +gi|301087623|ref|XP_002894700.1| mito 2 2 +gi|301087623|ref|XP_002894700.1| cyto 2 3 +gi|301087623|ref|XP_002894700.1| cyto_mito 2 4 + +Additionally in order to take full advantage of multiple cores, by subdividing +the input FASTA file multiple copies of WoLF PSORT are run in parallel. I would +normally use Python's multiprocessing library in this situation but it requires +at least Python 2.6 and at the time of writing Galaxy still supports Python 2.4. +""" +import sys +import os +from seq_analysis_utils import stop_err, split_fasta, run_jobs + +FASTA_CHUNK = 500 +exe = "runWolfPsortSummary" + +""" +Note: I had trouble getting runWolfPsortSummary on the path, so used a wrapper +python script called runWolfPsortSummary as follows: + +#!/usr/bin/env python +#Wrapper script to call WoLF PSORT from its own directory. +import os +import sys +import subprocess +saved_dir = os.path.abspath(os.curdir) +os.chdir("/opt/WoLFPSORT_package_v0.2/bin") +args = ["./runWolfPsortSummary"] + sys.argv[1:] +return_code = subprocess.call(args) +os.chdir(saved_dir) +sys.exit(return_code) +""" + +if len(sys.argv) != 5: + stop_err("Require four arguments, organism, threads, input protein FASTA file & output tabular file") + +organism = sys.argv[1] +if organism not in ["animal", "plant", "fungi"]: + stop_err("Organism argument %s is not one of animal, plant, fungi" % organism) + +try: + num_threads = int(sys.argv[2]) +except: + num_threads = 0 +if num_threads < 1: + stop_err("Threads argument %s is not a positive integer" % sys.argv[3]) + +fasta_file = sys.argv[3] + +tabular_file = sys.argv[4] + +def clean_tabular(raw_handle, out_handle): + """Clean up WoLF PSORT output to make it tabular.""" + for line in raw_handle: + if not line or line.startswith("#"): + continue + name, data = line.rstrip("\r\n").split(None,1) + for rank, comp_data in enumerate(data.split(",")): + comp, score = comp_data.split() + out_handle.write("%s\t%s\t%s\t%i\n" \ + % (name, comp, score, rank+1)) + +fasta_files = split_fasta(fasta_file, tabular_file, n=FASTA_CHUNK) +temp_files = [f+".out" for f in fasta_files] +assert len(fasta_files) == len(temp_files) +jobs = ["%s %s < %s > %s" % (exe, organism, fasta, temp) + for (fasta, temp) in zip(fasta_files, temp_files)] +assert len(fasta_files) == len(temp_files) == len(jobs) + +def clean_up(file_list): + for f in file_list: + if os.path.isfile(f): + os.remove(f) + +if len(jobs) > 1 and num_threads > 1: + #A small "info" message for Galaxy to show the user. + print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)) +results = run_jobs(jobs, num_threads) +assert len(fasta_files) == len(temp_files) == len(jobs) +for fasta, temp, cmd in zip(fasta_files, temp_files, jobs): + error_level = results[cmd] + try: + output = open(temp).readline() + except IOError: + output = "" + if error_level or output.lower().startswith("error running"): + clean_up(fasta_files) + clean_up(temp_files) + stop_err("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output), + error_level) +del results + +out_handle = open(tabular_file, "w") +out_handle.write("#ID\tCompartment\tScore\tRank\n") +for temp in temp_files: + data_handle = open(temp) + clean_tabular(data_handle, out_handle) + data_handle.close() +out_handle.close() + +clean_up(fasta_files) +clean_up(temp_files) diff -r 1426b2bae76d -r ef7ceca37e3f tools/protein_analysis/wolf_psort.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/protein_analysis/wolf_psort.xml Tue Jun 07 17:40:55 2011 -0400 @@ -0,0 +1,102 @@ + + Eukaryote protein subcellular localization prediction + + wolf_psort.py $organism 8 $fasta_file $tabular_file + ##I want the number of threads to be a Galaxy config option... + + + + + + + + + + + + + + runWolfPsortSummary + + + +**What it does** + +This calls the WoLF PSORT tool for prediction of eukaryote protein subcellular localization. + +The input is a FASTA file of protein sequences, and the output is tabular with four columns (multiple rows per protein): + + * Sequence identifier + * Compartment + * Score + * Prediction rank + + +**Localization Compartments** + +The table below gives the WoLF PSORT localization site definitions, and the corresponding Gene Ontology (GO) term. + +====== ===================== ===================== +Abbrev Localization Site GO Cellular Component +------ --------------------- --------------------- +chlo chloroplast 0009507, 0009543 +cyto cytosol 0005829 +cysk cytoskeleton 0005856(2) +E.R. endoplasmic reticulum 0005783 +extr extracellular 0005576, 0005618 +golg Golgi apparatus 0005794(1) +lyso lysosome 0005764 +mito mitochondria 0005739 +nucl nuclear 0005634 +pero peroxisome 0005777(2) +plas plasma membrane 0005886 +vacu vacuolar membrane 0005774(2) +====== ===================== ===================== + +Additionally compound predictions like mito_nucl are also given. + + +**Notes** + +The raw output from WoLF PSORT looks like this (space separated), showing two proteins: + +================================ ============================================ +gi|301087619|ref|XP_002894699.1| extr 12, mito 4, E.R. 3, golg 3, mito_nucl 3 +gi|301087623|ref|XP_002894700.1| extr 21, mito 2, cyto 2, cyto_mito 2 +================================ ============================================ + +This is reformatted into a tabular file as follows for use in Galaxy: + +================================ =========== ===== ==== +#ID Compartment Score Rank +-------------------------------- ----------- ----- ---- +gi|301087619|ref|XP_002894699.1| extr 12 1 +gi|301087619|ref|XP_002894699.1| mito 4 2 +gi|301087619|ref|XP_002894699.1| E.R. 3 3 +gi|301087619|ref|XP_002894699.1| golg 3 4 +gi|301087619|ref|XP_002894699.1| mito_nucl 3 5 +gi|301087623|ref|XP_002894700.1| extr 21 1 +gi|301087623|ref|XP_002894700.1| mito 2 2 +gi|301087623|ref|XP_002894700.1| cyto 2 3 +gi|301087623|ref|XP_002894700.1| cyto_mito 2 4 +================================ =========== ===== ==== + +This way you can easily filter for things like having a top prediction for +mitochondria (c2=='mito' and c4==1), or extracellular with a score of at +least 10 (c2=='extr' and 10<=c3), and so on. + + +**References** + +Paul Horton, Keun-Joon Park, Takeshi Obayashi, Naoya Fujita, Hajime Harada, C.J. Adams-Collier, and Kenta Nakai, +WoLF PSORT: Protein Localization Predictor. +Nucleic Acids Research, 35(S2), W585-W587, doi:10.1093/nar/gkm259, 2007. + +Paul Horton, Keun-Joon Park, Takeshi Obayashi and Kenta Nakai. +Protein Subcellular Localization Prediction with WoLF PSORT. +Proceedings of the 4th Annual Asia Pacific Bioinformatics Conference APBC06, Taipei, Taiwan. pp. 39-48, 2006. + +http://wolfpsort.org + + +