comparison README.rst @ 11:99209ed2ec87 draft

planemo upload for repository https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow commit 4bd49529e9ca2096cd875e98daf7190d13fa8d0b-dirty
author peterjc
date Wed, 01 Feb 2017 13:21:32 -0500
parents 2c8931827fa5
children cb25a70933ea
comparison
equal deleted inserted replaced
10:2c8931827fa5 11:99209ed2ec87
1 Introduction 1 This is package is a Galaxy workflow for the identification of candidate
2 secreted proteins from a given protein FASTA file.
3
4 It runs SignalP v3.0 (Bendtsen et al. 2004) and selects only proteins with a
5 strong predicted signal peptide, and then runs TMHMM v2.0 (Krogh et al. 2001)
6 on those, and selects only proteins without a predicted trans-membrane helix.
7 This workflow was used in Kikuchi et al. (2011), and is a simplification of
8 the candidate effector protocol described in Jones et al. (2009).
9
10 See http://www.galaxyproject.org for information about the Galaxy Project.
11
12
13 Availability
2 ============ 14 ============
3 15
4 Galaxy is a web-based platform for biological data analysis, supporting 16 This workflow is available to download and/or install from the main
5 extension with additional tools (often wrappers for existing command line 17 Galaxy Tool Shed:
6 tools) and datatypes. See http://www.galaxyproject.org/ and the public
7 server at http://usegalaxy.org for an example.
8 18
9 The NCBI BLAST suite is a widely used set of tools for biological sequence 19 http://toolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow
10 comparison. It is available as standalone binaries for use at the command
11 line, and via the NCBI website for smaller searches. For more details see
12 http://blast.ncbi.nlm.nih.gov/Blast.cgi
13 20
14 This is an example workflow using the Galaxy wrappers for NCBI BLAST+, 21 Test releases (which should not normally be used) are on the Test Tool Shed:
15 see https://github.com/peterjc/galaxy_blast
16 22
23 http://testtoolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow
17 24
18 Galaxy workflow for counting species of top BLAST hits 25 Development is being done on github here:
19 ======================================================
20 26
21 This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an 27 https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow
22 initial assessment of a transcriptome assembly to give a crude indication of
23 any major contamination present based on the species of the top BLAST hit
24 of 1000 representative sequences.
25
26 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png
27
28 In words, the workflow proceeds as follows:
29
30 1. Upload/import your transcriptome assembly or any nucleotide FASTA file.
31 2. Samples 1000 representative sequences, selected uniformly/evenly though
32 the file.
33 3. Convert the sampled FASTA file into a three column tabular file.
34 4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr``
35 database (assuming this is already available setup on your local Galaxy
36 under the alias ``nr``), requesting tabular output including the taxonomy
37 fields, and at most one matching target sequence.
38 5. Remove any duplicate alignments (multiple HSPs for the same match).
39 6. Combine the filtered BLAST output with the tabular version of the 1000
40 sequences to give a new tabular file with exactly 1000 lines, adding
41 ``None`` for sequences missing a BLAST hit.
42 7. Count the BLAST species names in this file.
43 8. Sort the counts.
44
45 Finally we would suggest visualising the sorted tally table as a Pie Chart.
46 28
47 29
48 Sample Data 30 Sample Data
49 =========== 31 ===========
50 32
51 As an example, you can upload the transcriptome assembly of the nematode 33 This workflow was developed and run on several nematode species. For example,
52 *Nacobbus abberans* from Eves van den Akker *et al.* (2015), 34 try the protein set for *Bursaphelenchus xylophilus* (Kikuchi et al. 2011):
53 http://dx.doi.org/10.1093/gbe/evu171 using this URL:
54 35
55 http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip 36 ftp://ftp.sanger.ac.uk/pub/pathogens/Bursaphelenchus/xylophilus/Assembly-v1.2/BUX.v1.2.genedb.protein.fa.gz
56 37
57 Running this workflow with a copy of the NCBI non-redundant ``nr`` database 38 You can upload this directly into Galaxy via this URL. Galaxy will handle
58 from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave 39 removing the gzip compression to give you the FASTA protein file which has
59 the following results - note 609 out of the 1000 sequences gave no BLAST hit. 40 18,074 sequences. The expected result (selecting organism type Eukaryote)
60 41 is a FASTA protein file of 2,297 predicted secreted protein sequences.
61 ===== ==================
62 Count Subject Blast Name
63 ----- ------------------
64 609 None
65 244 nematodes
66 30 ascomycetes
67 27 eukaryotes
68 8 basidiomycetes
69 6 aphids
70 5 eudicots
71 5 flies
72 ... ...
73 ===== ==================
74
75 As you might guess from the filename ``N.abberans_reference_no_contam.fasta``,
76 this transcriptome assembly has already had obvious contamination removed.
77
78 At the time of writing, Galaxy's visualizations could not be included in
79 a workflow. You can generate a pie chart from the final count file using
80 the counts (c1) and labels (c2), like this:
81
82 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png
83
84 Note the nematode count in this image was shown as a mouse-over effect.
85
86
87 Disclaimer
88 ==========
89
90 Species assignment by top BLAST hit is not suitable for any in depth
91 analysis. It is particularly prone to false positives where contaminants
92 in public datasets are mislabelled. See for example Ed Yong (2015),
93 "There's No Plague on the NYC Subway. No Platypuses Either.":
94
95 http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/
96
97
98 Known Issues
99 ============
100
101 Counts
102 ------
103
104 This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with
105 the current stable release (Galaxy v15.03, i.e. March 2015).
106
107 The updated "Count" tool version 1.0.1 includes a fix not to remove spaces
108 in the fields being counted. In the example above, while the top hits are
109 not affected, minor entries like "cellular slime molds" are shown as
110 "cellularslimemolds" instead (look closely at the Pie Chart key)..
111
112 The updated "Count" tool version 1.0.1 also adds a new option to sort the
113 output, which avoids the additional sorting step in the current version of
114 the workflow.
115
116 A future update to this workflow will use the revised "Count" tool, once
117 this is included in the next stable Galaxy release - or migrated to the
118 Galaxy Tool Shed.
119
120 NCBI nr database
121 ----------------
122
123 The use of external datasets within Galaxy via the ``*.loc`` configuration
124 files undermines provenance tracking within Galaxy. This is exacerbated
125 by the lack of officially versioned BLAST database releases by the NCBI.
126
127 This workflow assumes that you have an entry ``nr`` in your ``blastdb_p.loc``
128 (the configuration file listing locally installed BLAST databases external
129 to Galaxy - consult the NCBI BLAST+ wrapper documentation for more details),
130 and that this points to a mirror of the latest NCBI "non-redundant" database
131 from ftp://ftp.ncbi.nlm.nih.gov/blast/db/
132
133 i.e. The workflow is intended to be used against the *latest* nr database,
134 and thus is not reproducible over the long term as the database changes.
135
136
137 Availability
138 ============
139
140 This workflow is available to download and/or install from the main Galaxy Tool Shed:
141
142 http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
143
144 Test releases (which should not normally be used) are on the Test Tool Shed:
145
146 http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
147
148 Development is being done on github here:
149
150 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
151 42
152 43
153 Citation 44 Citation
154 ======== 45 ========
155 46
156 Please cite the following paper (currently available as a preprint): 47 If you use this workflow directly, or a derivative of it, in work leading
48 to a scientific publication, please cite:
157 49
158 NCBI BLAST+ integrated into Galaxy. 50 Cock, P.J.A. and Pritchard, L. (2014). Galaxy as a platform for identifying
159 P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo 51 candidate pathogen effectors. Chapter 1 in "Plant-Pathogen Interactions:
160 bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint) 52 Methods and Protocols (Second Edition)"; P. Birch, J. Jones, and J.I. Bos, eds.
53 Methods in Molecular Biology. Humana Press, Springer. ISBN 978-1-62703-985-7.
54 http://www.springer.com/life+sciences/plant+sciences/book/978-1-62703-985-7
161 55
162 You should also cite Galaxy, and the NCBI BLAST+ tools: 56 Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013).
57 Galaxy tools and workflows for sequence analysis with applications
58 in molecular plant pathology. PeerJ 1:e167
59 http://dx.doi.org/10.7717/peerj.167
163 60
164 BLAST+: architecture and applications. 61 Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S. (2004)
165 C. Camacho et al. BMC Bioinformatics 2009, 10:421. 62 Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340: 783–95.
166 DOI: http://dx.doi.org/10.1186/1471-2105-10-421 63 http://dx.doi.org/10.1016/j.jmb.2004.05.028
64
65 Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. (2001)
66 Predicting transmembrane protein topology with a hidden Markov model:
67 application to complete genomes. J Mol Biol 305: 567- 580.
68 http://dx.doi.org/10.1006/jmbi.2000.4315
167 69
168 70
169 Automated Installation 71 Additional References
170 ====================== 72 =====================
171 73
172 Installation via the Galaxy Tool Shed should take care of the dependencies 74 Kikuchi, T., Cotton, J.A., Dalzell, J.J., Hasegawa. K., et al. (2011)
173 on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries. 75 Genomic insights into the origin of parasitism in the emerging plant
76 pathogen *Bursaphelenchus xylophilus*. PLoS Pathog 7: e1002219.
77 http://dx.doi.org/10.1371/journal.ppat.1002219
174 78
175 However, this workflow requires a current version of the NCBI nr protein 79 Jones, J.T., Kumar, A., Pylypenko, L.A., Thirugnanasambandam, A., et al. (2009)
176 BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower 80 Identification and functional characterization of effectors in expressed
177 case). 81 sequence tags from various life cycle stages of the potato cyst nematode
82 *Globodera pallida*. Mol Plant Pathol 10: 815–28.
83 http://dx.doi.org/10.1111/j.1364-3703.2009.00585.x
84
85
86 Dependencies
87 ============
88
89 These dependencies should be resolved automatically via the Galaxy Tool Shed:
90
91 * http://toolshed.g2.bx.psu.edu/view/peterjc/tmhmm_and_signalp
92 * http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id
93
94 However, at the time of writing those Galaxy tools have their own
95 dependencies required for this workflow which require manual
96 installation (SignalP v3.0 and TMHMM v2.0).
178 97
179 98
180 History 99 History
181 ======= 100 =======
182 101
183 ======= ====================================================================== 102 ======= ======================================================================
184 Version Changes 103 Version Changes
185 ------- ---------------------------------------------------------------------- 104 ------- ----------------------------------------------------------------------
186 v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29 105 v0.0.1 - Initial release to Tool Shed (May, 2013)
106 - Expanded README file to include example data
107 v0.0.2 - Updated versions of the tools used, inclulding core Galaxy Filter
108 tool to avoid warning about new ``header_lines`` parameter.
109 - Added link to Tool Shed in the workflow annotation explaining there
110 is a README file with sample data, and a requested citation.
111 v0.0.3 - Use MIT licence.
187 ======= ====================================================================== 112 ======= ======================================================================
188 113
189 114
190 Developers 115 Developers
191 ========== 116 ==========
192 117
193 This workflow is under source code control here: 118 This workflow is under source code control here:
194 119
195 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species 120 https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow
196 121
197 To prepare the tar-ball for uploading to the Tool Shed, I use this: 122 To prepare the tar-ball for uploading to the Tool Shed, I use this:
198 123
199 $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png 124 $ tar -cf secreted_protein_workflow.tar.gz README.rst repository_dependencies.xml secreted_protein_workflow.ga
200 125
201 Check this, 126 Check this,
202 127
203 $ tar -tzf blast_top_hit_species.tar.gz 128 $ tar -tzf secreted_protein_workflow.tar.gz
204 README.rst 129 README.rst
205 repository_dependencies.xml 130 repository_dependencies.xml
206 blast_top_hit_species.ga 131 secreted_protein_workflow.ga
207 blast_top_hit_species.png
208 N_abberans_piechart_mouseover.png
209 132
210 133
211 Licence (MIT) 134 Licence (MIT)
212 ============= 135 =============
213 136