changeset 5:ef7ceca37e3f

Migrated tool version 0.0.8 from old tool shed archive to new tool shed repository
author peterjc
date Tue, 07 Jun 2011 17:40:55 -0400
parents 1426b2bae76d
children 39a6e46cdda3
files tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/signalp3.py tools/protein_analysis/suite_config.xml tools/protein_analysis/wolf_psort.py tools/protein_analysis/wolf_psort.xml
diffstat 6 files changed, 292 insertions(+), 30 deletions(-) [+]
line wrap: on
line diff
--- a/tools/protein_analysis/LICENSE	Tue Jun 07 17:40:04 2011 -0400
+++ b/tools/protein_analysis/LICENSE	Tue Jun 07 17:40:55 2011 -0400
@@ -1,7 +1,8 @@
-Copyright (c) 2010 Peter Cock, SCRI, UK
+Copyright (c) 2010-2011 Peter Cock, The James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK.
 
-License for TMHMM 2.0 and SignalP 3.0 wrappers for Galaxy (note that
-TMHMM 2.0 and SignalP 3.0 are copyright and licensed separately).
+License for TMHMM 2.0, SignalP 3.0, and WoLF PSORT  wrappers for Galaxy
+(note that tools themselves are copyright and licensed separately).
 
 Permission to use, copy, modify, and distribute this software and its
 documentation with or without modifications and for any purpose and
--- a/tools/protein_analysis/README	Tue Jun 07 17:40:04 2011 -0400
+++ b/tools/protein_analysis/README	Tue Jun 07 17:40:55 2011 -0400
@@ -1,45 +1,70 @@
-This package contains Galaxy wrappers for two standalone command line protein
-analysis tools (SignalP 3.0 and THMHMM 2.0) from the Center for Biological
-Sequence Analysis at the Technical University of Denmark,
-http://www.cbs.dtu.dk/cbs/
+This package contains Galaxy wrappers for a selection of standalone command
+line protein analysis tools:
+
+* SignalP 3.0 and THMHMM 2.0, from the Center for Biological
+  Sequence Analysis at the Technical University of Denmark,
+  http://www.cbs.dtu.dk/cbs/
+
+* WoLF PSORT v0.2 from http://wolfpsort.org/
+
+To use these Galaxy wrappers you must first install the command line tools.
+At the time of writing they are all free for academic use.
+
+These wrappers are copyright 2010-2011 by Peter Cock, James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+See the included LICENCE file for details.
 
-To use these Galaxy wrappers you must first install the CBS command line
-tools. At the time of writing both SignalP 3.0 and TMHMM 2.0 are free for
-academic use.
+Requirements
+============
+
+First install those command line tools you wish to use the wrappers for:
+
+1. Install the command line version of SignalP 3.0 and ensure "signalp" is
+   on the PATH, see: http://www.cbs.dtu.dk/services/SignalP/
 
-These wrappers are copyright 2010 by Peter Cock, SCRI, UK. All rights
-reserved. See the included LICENCE file for details.
+2. Install the command line version of TMHMM 2.0 and ensure "tmhmm" is on
+   the PATH, see: http://www.cbs.dtu.dk/services/TMHMM/
+
+3. Install the WoLF PSORT v0.2 package, and ensure "runWolfPsortSummary"
+   is on the PATH (we use an extra wrapper script to change to the WoLF PSORT
+   directory, run runWolfPsortSummary, and then change back to the original
+   directory), see: http://wolfpsort.org/WoLFPSORT_package/version0.2/
+
+Verify each of the tools is installed and working from the command line
+(when logged in at the Galaxy user if appropriate).
 
 Installation
 ============
 
-1. Install the command line version of SignalP 3.0 and ensure it is on the
-   PATH, see: http://www.cbs.dtu.dk/services/SignalP/
+1. Create a folder tools/protein_analysis under your Galaxy installation.
 
-2. Install the command line version of TMHMM 2.0 and ensure it is on the
-   PATH, see: http://www.cbs.dtu.dk/services/TMHMM/
-
-3. Create a folder tools/protein_analysis under your Galaxy installation.
-
-4. Copy/move the following files (from this archive) there:
+2. Copy/move the following files (from this archive) there:
 
 tmhmm2.xml (Galaxy tool definition)
 tmhmm2.py (Python wrapper script)
+
 signalp3.xml (Galaxy tool definition)
 signalp3.py (Python wrapper script)
+
+wolf_psort.xml (Galaxy tool definition)
+wolf_psort.py (Python wrapper script)
+
 seq_analysis_utils.py (shared Python code)
 README (optional)
 
-5. Edit your Galaxy conjuration file tool_conf.xml (to use the tools) AND
-   also tool_conf.xml.sample (to run the tests) to include the two new tools
+3. Edit your Galaxy conjuration file tool_conf.xml (to use the tools) AND
+   also tool_conf.xml.sample (to run the tests) to include the new tools
    by adding:
 
   <section name="Protein sequence analysis" id="protein_analysis">
     <tool file="protein_analysis/tmhmm2.xml" />
     <tool file="protein_analysis/signalp3.xml" />
+    <tool file="protein_analysis/wolf_psort.xml" />
   </section>
 
-6. Copy/move the following test files (from these archive) to Galaxy
+   Leave out the lines for any tools you do not wish to use in Galaxy.
+
+4. Copy/move the following test files (from these archive) to Galaxy
    subfolder test-data:
 
 four_human_proteins.fasta
@@ -49,7 +74,7 @@
 empty_tmhmm2.tabular
 empty_signalp3.tabular
 
-7. Run the Galaxy functional tests for these new wrappers with:
+5. Run the Galaxy functional tests for these new wrappers with:
 
 ./run_functional_tests.sh -id tmhmm2
 ./run_functional_tests.sh -id signalp3
@@ -59,7 +84,7 @@
 
 ./run_functional_tests.sh -sid Protein_sequence_analysis-protein_analysis
 
-8. Restart Galaxy and check the new tools are shown and work.
+6. Restart Galaxy and check the new tools are shown and work.
 
 
 History
@@ -75,6 +100,7 @@
 v0.0.6 - Improvement to how sub-jobs are run (should be faster)
 v0.0.7 - Change SignalP default truncation from 60 to 70 to match the
          SignalP webservice.
+v0.0.8 - Added WoLF PSORT wrapper to the suite.
 
 
 Developers
@@ -89,11 +115,11 @@
 For making the "Galaxy Tool Shed" http://community.g2.bx.psu.edu/ tarball use
 the following command from the Galaxy root folder:
 
-tar -czf tmhmm_and_signalp.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular
+tar -czf tmhmm_signalp_wolfpsort.tar.gz tools/protein_analysis/LICENSE tools/protein_analysis/README tools/protein_analysis/suite_config.xml tools/protein_analysis/seq_analysis_utils.py tools/protein_analysis/signalp3.xml tools/protein_analysis/signalp3.py tools/protein_analysis/tmhmm2.xml tools/protein_analysis/tmhmm2.py tools/protein_analysis/wolf_psort.xml tools/protein_analysis/wolf_psort.py test-data/four_human_proteins.* test-data/empty.fasta test-data/empty_tmhmm2.tabular test-data/empty_signalp3.tabular
 
 Check this worked:
 
-$ tar -tzf tmhmm_and_signalp.tar.gz
+$ tar -tzf tmhmm_signalp_wolfpsort.tar.gz
 tools/protein_analysis/LICENSE
 tools/protein_analysis/README
 tools/protein_analysis/suite_config.xml
@@ -102,6 +128,8 @@
 tools/protein_analysis/signalp3.py
 tools/protein_analysis/tmhmm2.xml
 tools/protein_analysis/tmhmm2.py
+tools/protein_analysis/wolf_psort.xml
+tools/protein_analysis/wolf_psort.py
 test-data/four_human_proteins.fasta
 test-data/four_human_proteins.signalp3.tabular
 test-data/four_human_proteins.tmhmm2.tabular
--- a/tools/protein_analysis/signalp3.py	Tue Jun 07 17:40:04 2011 -0400
+++ b/tools/protein_analysis/signalp3.py	Tue Jun 07 17:40:55 2011 -0400
@@ -1,7 +1,7 @@
 #!/usr/bin/env python
 """Wrapper for SignalP v3.0 for use in Galaxy.
 
-This script takes exactly fives command line arguments:
+This script takes exactly five command line arguments:
  * the organism type (euk, gram+ or gram-)
  * length to truncate sequences to (integer)
  * number of threads to use (integer)
--- a/tools/protein_analysis/suite_config.xml	Tue Jun 07 17:40:04 2011 -0400
+++ b/tools/protein_analysis/suite_config.xml	Tue Jun 07 17:40:55 2011 -0400
@@ -1,9 +1,12 @@
-    <suite id="tmhmm_and_signalp" name="TMHMM and SignalP" version="0.0.7">
-        <description>Wrappers for TMHMM and SignalP</description>
+    <suite id="tmhmm_and_signalp" name="TMHMM, SignalP, WoLF PSORT" version="0.0.8">
+        <description>Wrappers for TMHMM, SignalP and WoLF PSORT</description>
         <tool id="tmhmm2" name="TMHMM 2.0" version="0.0.6">
             <description>Find transmembrane domains in protein sequences</description>
         </tool>
         <tool id="signalp3" name="SignalP 3.0" version="0.0.7">
             <description>Find signal peptides in protein sequences</description>
         </tool>
+        <tool id="wolf_psort" name="WoLF PSORT" version="0.0.1">
+            <description>Eukaryote protein subcellular localization prediction</description>
+        </tool>
     </suite>
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/protein_analysis/wolf_psort.py	Tue Jun 07 17:40:55 2011 -0400
@@ -0,0 +1,128 @@
+#!/usr/bin/env python
+"""Wrapper for WoLF PSORT v0.2 for use in Galaxy.
+
+This script takes exactly four command line arguments:
+ * the organism type (animal, plant or fungi)
+ * number of threads to use (integer)
+ * an input protein FASTA filename
+ * output tabular filename.
+
+It then calls the standalone WoLF PSORT v0.2 program runWolfPsortSummary
+(not the webservice), and coverts the output from something like this:
+
+# k used for kNN is: 27
+gi|301087619|ref|XP_002894699.1| extr 12, mito 4, E.R. 3, golg 3, mito_nucl 3
+gi|301087623|ref|XP_002894700.1| extr 21, mito 2, cyto 2, cyto_mito 2
+
+In order to make it easier to use in Galaxy, this wrapper script reformats
+this to use tab separators, with one line per compartment prediction:
+
+#ID	Compartment	Score	Rank
+gi|301087619|ref|XP_002894699.1|	extr	12	1
+gi|301087619|ref|XP_002894699.1|	mito	4	2
+gi|301087619|ref|XP_002894699.1|	E.R.	3	3
+gi|301087619|ref|XP_002894699.1|	golg	3	4
+gi|301087619|ref|XP_002894699.1|	mito_nucl	3	5
+gi|301087623|ref|XP_002894700.1|	extr	21	1
+gi|301087623|ref|XP_002894700.1|	mito	2	2
+gi|301087623|ref|XP_002894700.1|	cyto	2	3
+gi|301087623|ref|XP_002894700.1|	cyto_mito	2	4
+
+Additionally in order to take full advantage of multiple cores, by subdividing
+the input FASTA file multiple copies of WoLF PSORT are run in parallel. I would
+normally use Python's multiprocessing library in this situation but it requires
+at least Python 2.6 and at the time of writing Galaxy still supports Python 2.4.
+"""
+import sys
+import os
+from seq_analysis_utils import stop_err, split_fasta, run_jobs
+
+FASTA_CHUNK = 500
+exe = "runWolfPsortSummary"
+
+"""
+Note: I had trouble getting runWolfPsortSummary on the path, so used a wrapper
+python script called runWolfPsortSummary as follows:
+
+#!/usr/bin/env python
+#Wrapper script to call WoLF PSORT from its own directory.
+import os
+import sys
+import subprocess
+saved_dir = os.path.abspath(os.curdir)
+os.chdir("/opt/WoLFPSORT_package_v0.2/bin")
+args = ["./runWolfPsortSummary"] + sys.argv[1:]
+return_code = subprocess.call(args)
+os.chdir(saved_dir)
+sys.exit(return_code)
+"""
+
+if len(sys.argv) != 5:
+   stop_err("Require four arguments, organism, threads, input protein FASTA file & output tabular file")
+
+organism = sys.argv[1]
+if organism not in ["animal", "plant", "fungi"]:
+   stop_err("Organism argument %s is not one of animal, plant, fungi" % organism)
+
+try:
+   num_threads = int(sys.argv[2])
+except:
+   num_threads = 0
+if num_threads < 1:
+   stop_err("Threads argument %s is not a positive integer" % sys.argv[3])
+
+fasta_file = sys.argv[3]
+
+tabular_file = sys.argv[4]
+
+def clean_tabular(raw_handle, out_handle):
+    """Clean up WoLF PSORT output to make it tabular."""
+    for line in raw_handle:
+        if not line or line.startswith("#"):
+            continue
+        name, data = line.rstrip("\r\n").split(None,1)
+        for rank, comp_data in enumerate(data.split(",")):
+            comp, score = comp_data.split()
+            out_handle.write("%s\t%s\t%s\t%i\n" \
+                             % (name, comp, score, rank+1))
+
+fasta_files = split_fasta(fasta_file, tabular_file, n=FASTA_CHUNK)
+temp_files = [f+".out" for f in fasta_files]
+assert len(fasta_files) == len(temp_files)
+jobs = ["%s %s < %s > %s" % (exe, organism, fasta, temp)
+        for (fasta, temp) in zip(fasta_files, temp_files)]
+assert len(fasta_files) == len(temp_files) == len(jobs)
+
+def clean_up(file_list):
+    for f in file_list:
+        if os.path.isfile(f):
+            os.remove(f)
+
+if len(jobs) > 1 and num_threads > 1:
+    #A small "info" message for Galaxy to show the user.
+    print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs))
+results = run_jobs(jobs, num_threads)
+assert len(fasta_files) == len(temp_files) == len(jobs)
+for fasta, temp, cmd in zip(fasta_files, temp_files, jobs):
+    error_level = results[cmd]
+    try:
+        output = open(temp).readline()
+    except IOError:
+        output = ""
+    if error_level or output.lower().startswith("error running"):
+        clean_up(fasta_files)
+        clean_up(temp_files)
+        stop_err("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output),
+                 error_level)
+del results
+
+out_handle = open(tabular_file, "w")
+out_handle.write("#ID\tCompartment\tScore\tRank\n")
+for temp in temp_files:
+    data_handle = open(temp)
+    clean_tabular(data_handle, out_handle)
+    data_handle.close()
+out_handle.close()
+
+clean_up(fasta_files)
+clean_up(temp_files)
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/protein_analysis/wolf_psort.xml	Tue Jun 07 17:40:55 2011 -0400
@@ -0,0 +1,102 @@
+<tool id="wolf_psort" name="WoLF PSORT" version="0.0.1">
+    <description>Eukaryote protein subcellular localization prediction</description>
+    <command interpreter="python">
+      wolf_psort.py $organism 8 $fasta_file $tabular_file
+      ##I want the number of threads to be a Galaxy config option...
+    </command>
+    <inputs>
+        <param name="fasta_file" type="data" format="fasta" label="FASTA file of protein sequences"/> 
+        <param name="organism" type="select" display="radio" label="Organism">
+            <option value="animal">Animal</option>
+            <option value="plant">Plant</option>
+            <option value="fungi">Fungi</option>
+        </param>
+    </inputs>
+    <outputs>
+        <data name="tabular_file" format="tabular" label="WoLF PSORT $organism results" />
+    </outputs>
+    <requirements>
+        <requirement type="binary">runWolfPsortSummary</requirement>
+    </requirements>
+    <help>
+    
+**What it does**
+
+This calls the WoLF PSORT tool for prediction of eukaryote protein subcellular localization.
+
+The input is a FASTA file of protein sequences, and the output is tabular with four columns (multiple rows per protein):
+
+ * Sequence identifier
+ * Compartment
+ * Score
+ * Prediction rank
+
+
+**Localization Compartments**
+
+The table below gives the WoLF PSORT localization site definitions, and the corresponding Gene Ontology (GO) term.
+
+====== ===================== =====================
+Abbrev Localization Site     GO Cellular Component
+------ --------------------- ---------------------
+chlo   chloroplast           0009507, 0009543
+cyto   cytosol               0005829
+cysk   cytoskeleton          0005856(2)
+E.R.   endoplasmic reticulum 0005783
+extr   extracellular         0005576, 0005618
+golg   Golgi apparatus       0005794(1)
+lyso   lysosome		     0005764
+mito   mitochondria	     0005739
+nucl   nuclear		     0005634
+pero   peroxisome	     0005777(2)
+plas   plasma membrane	     0005886
+vacu   vacuolar membrane     0005774(2)
+====== ===================== =====================
+
+Additionally compound predictions like mito_nucl are also given.
+
+
+**Notes**
+
+The raw output from WoLF PSORT looks like this (space separated), showing two proteins:
+
+================================ ============================================
+gi|301087619|ref|XP_002894699.1| extr 12, mito 4, E.R. 3, golg 3, mito_nucl 3
+gi|301087623|ref|XP_002894700.1| extr 21, mito 2, cyto 2, cyto_mito 2
+================================ ============================================
+
+This is reformatted into a tabular file as follows for use in Galaxy:
+
+================================ =========== ===== ====
+#ID                              Compartment Score Rank
+-------------------------------- ----------- ----- ----
+gi|301087619|ref|XP_002894699.1| extr           12    1
+gi|301087619|ref|XP_002894699.1| mito            4    2
+gi|301087619|ref|XP_002894699.1| E.R.            3    3
+gi|301087619|ref|XP_002894699.1| golg            3    4
+gi|301087619|ref|XP_002894699.1| mito_nucl       3    5
+gi|301087623|ref|XP_002894700.1| extr           21    1
+gi|301087623|ref|XP_002894700.1| mito            2    2
+gi|301087623|ref|XP_002894700.1| cyto            2    3
+gi|301087623|ref|XP_002894700.1| cyto_mito       2    4
+================================ =========== ===== ====
+
+This way you can easily filter for things like having a top prediction for
+mitochondria (c2=='mito' and c4==1), or extracellular with a score of at
+least 10 (c2=='extr' and 10&lt;=c3), and so on.
+
+
+**References**
+
+Paul Horton, Keun-Joon Park, Takeshi Obayashi, Naoya Fujita, Hajime Harada, C.J. Adams-Collier, and Kenta Nakai,
+WoLF PSORT: Protein Localization Predictor.
+Nucleic Acids Research, 35(S2), W585-W587, doi:10.1093/nar/gkm259, 2007.
+
+Paul Horton, Keun-Joon Park, Takeshi Obayashi and Kenta Nakai.
+Protein Subcellular Localization Prediction with WoLF PSORT.
+Proceedings of the 4th Annual Asia Pacific Bioinformatics Conference APBC06, Taipei, Taiwan. pp. 39-48, 2006.
+
+http://wolfpsort.org
+
+    </help>
+</tool>