Mercurial > repos > peterjc > secreted_protein_workflow
diff README.rst @ 11:99209ed2ec87 draft
planemo upload for repository https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow commit 4bd49529e9ca2096cd875e98daf7190d13fa8d0b-dirty
author | peterjc |
---|---|
date | Wed, 01 Feb 2017 13:21:32 -0500 |
parents | 2c8931827fa5 |
children | cb25a70933ea |
line wrap: on
line diff
--- a/README.rst Mon Mar 30 11:46:13 2015 -0400 +++ b/README.rst Wed Feb 01 13:21:32 2017 -0500 @@ -1,180 +1,99 @@ -Introduction -============ - -Galaxy is a web-based platform for biological data analysis, supporting -extension with additional tools (often wrappers for existing command line -tools) and datatypes. See http://www.galaxyproject.org/ and the public -server at http://usegalaxy.org for an example. +This is package is a Galaxy workflow for the identification of candidate +secreted proteins from a given protein FASTA file. -The NCBI BLAST suite is a widely used set of tools for biological sequence -comparison. It is available as standalone binaries for use at the command -line, and via the NCBI website for smaller searches. For more details see -http://blast.ncbi.nlm.nih.gov/Blast.cgi +It runs SignalP v3.0 (Bendtsen et al. 2004) and selects only proteins with a +strong predicted signal peptide, and then runs TMHMM v2.0 (Krogh et al. 2001) +on those, and selects only proteins without a predicted trans-membrane helix. +This workflow was used in Kikuchi et al. (2011), and is a simplification of +the candidate effector protocol described in Jones et al. (2009). -This is an example workflow using the Galaxy wrappers for NCBI BLAST+, -see https://github.com/peterjc/galaxy_blast +See http://www.galaxyproject.org for information about the Galaxy Project. -Galaxy workflow for counting species of top BLAST hits -====================================================== +Availability +============ -This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an -initial assessment of a transcriptome assembly to give a crude indication of -any major contamination present based on the species of the top BLAST hit -of 1000 representative sequences. +This workflow is available to download and/or install from the main +Galaxy Tool Shed: -.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png - -In words, the workflow proceeds as follows: +http://toolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow -1. Upload/import your transcriptome assembly or any nucleotide FASTA file. -2. Samples 1000 representative sequences, selected uniformly/evenly though - the file. -3. Convert the sampled FASTA file into a three column tabular file. -4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr`` - database (assuming this is already available setup on your local Galaxy - under the alias ``nr``), requesting tabular output including the taxonomy - fields, and at most one matching target sequence. -5. Remove any duplicate alignments (multiple HSPs for the same match). -6. Combine the filtered BLAST output with the tabular version of the 1000 - sequences to give a new tabular file with exactly 1000 lines, adding - ``None`` for sequences missing a BLAST hit. -7. Count the BLAST species names in this file. -8. Sort the counts. +Test releases (which should not normally be used) are on the Test Tool Shed: + +http://testtoolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow -Finally we would suggest visualising the sorted tally table as a Pie Chart. +Development is being done on github here: + +https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow Sample Data =========== -As an example, you can upload the transcriptome assembly of the nematode -*Nacobbus abberans* from Eves van den Akker *et al.* (2015), -http://dx.doi.org/10.1093/gbe/evu171 using this URL: - -http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip - -Running this workflow with a copy of the NCBI non-redundant ``nr`` database -from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave -the following results - note 609 out of the 1000 sequences gave no BLAST hit. - -===== ================== -Count Subject Blast Name ------ ------------------ - 609 None - 244 nematodes - 30 ascomycetes - 27 eukaryotes - 8 basidiomycetes - 6 aphids - 5 eudicots - 5 flies - ... ... -===== ================== +This workflow was developed and run on several nematode species. For example, +try the protein set for *Bursaphelenchus xylophilus* (Kikuchi et al. 2011): -As you might guess from the filename ``N.abberans_reference_no_contam.fasta``, -this transcriptome assembly has already had obvious contamination removed. - -At the time of writing, Galaxy's visualizations could not be included in -a workflow. You can generate a pie chart from the final count file using -the counts (c1) and labels (c2), like this: - -.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png - -Note the nematode count in this image was shown as a mouse-over effect. - - -Disclaimer -========== - -Species assignment by top BLAST hit is not suitable for any in depth -analysis. It is particularly prone to false positives where contaminants -in public datasets are mislabelled. See for example Ed Yong (2015), -"There's No Plague on the NYC Subway. No Platypuses Either.": - -http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/ - - -Known Issues -============ +ftp://ftp.sanger.ac.uk/pub/pathogens/Bursaphelenchus/xylophilus/Assembly-v1.2/BUX.v1.2.genedb.protein.fa.gz -Counts ------- - -This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with -the current stable release (Galaxy v15.03, i.e. March 2015). - -The updated "Count" tool version 1.0.1 includes a fix not to remove spaces -in the fields being counted. In the example above, while the top hits are -not affected, minor entries like "cellular slime molds" are shown as -"cellularslimemolds" instead (look closely at the Pie Chart key).. - -The updated "Count" tool version 1.0.1 also adds a new option to sort the -output, which avoids the additional sorting step in the current version of -the workflow. - -A future update to this workflow will use the revised "Count" tool, once -this is included in the next stable Galaxy release - or migrated to the -Galaxy Tool Shed. - -NCBI nr database ----------------- - -The use of external datasets within Galaxy via the ``*.loc`` configuration -files undermines provenance tracking within Galaxy. This is exacerbated -by the lack of officially versioned BLAST database releases by the NCBI. - -This workflow assumes that you have an entry ``nr`` in your ``blastdb_p.loc`` -(the configuration file listing locally installed BLAST databases external -to Galaxy - consult the NCBI BLAST+ wrapper documentation for more details), -and that this points to a mirror of the latest NCBI "non-redundant" database -from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ - -i.e. The workflow is intended to be used against the *latest* nr database, -and thus is not reproducible over the long term as the database changes. - - -Availability -============ - -This workflow is available to download and/or install from the main Galaxy Tool Shed: - -http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species - -Test releases (which should not normally be used) are on the Test Tool Shed: - -http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species - -Development is being done on github here: - -https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species +You can upload this directly into Galaxy via this URL. Galaxy will handle +removing the gzip compression to give you the FASTA protein file which has +18,074 sequences. The expected result (selecting organism type Eukaryote) +is a FASTA protein file of 2,297 predicted secreted protein sequences. Citation ======== -Please cite the following paper (currently available as a preprint): +If you use this workflow directly, or a derivative of it, in work leading +to a scientific publication, please cite: -NCBI BLAST+ integrated into Galaxy. -P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo -bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint) +Cock, P.J.A. and Pritchard, L. (2014). Galaxy as a platform for identifying +candidate pathogen effectors. Chapter 1 in "Plant-Pathogen Interactions: +Methods and Protocols (Second Edition)"; P. Birch, J. Jones, and J.I. Bos, eds. +Methods in Molecular Biology. Humana Press, Springer. ISBN 978-1-62703-985-7. +http://www.springer.com/life+sciences/plant+sciences/book/978-1-62703-985-7 -You should also cite Galaxy, and the NCBI BLAST+ tools: +Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). +Galaxy tools and workflows for sequence analysis with applications +in molecular plant pathology. PeerJ 1:e167 +http://dx.doi.org/10.7717/peerj.167 -BLAST+: architecture and applications. -C. Camacho et al. BMC Bioinformatics 2009, 10:421. -DOI: http://dx.doi.org/10.1186/1471-2105-10-421 +Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S. (2004) +Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340: 783–95. +http://dx.doi.org/10.1016/j.jmb.2004.05.028 + +Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. (2001) +Predicting transmembrane protein topology with a hidden Markov model: +application to complete genomes. J Mol Biol 305: 567- 580. +http://dx.doi.org/10.1006/jmbi.2000.4315 -Automated Installation -====================== +Additional References +===================== + +Kikuchi, T., Cotton, J.A., Dalzell, J.J., Hasegawa. K., et al. (2011) +Genomic insights into the origin of parasitism in the emerging plant +pathogen *Bursaphelenchus xylophilus*. PLoS Pathog 7: e1002219. +http://dx.doi.org/10.1371/journal.ppat.1002219 -Installation via the Galaxy Tool Shed should take care of the dependencies -on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries. +Jones, J.T., Kumar, A., Pylypenko, L.A., Thirugnanasambandam, A., et al. (2009) +Identification and functional characterization of effectors in expressed +sequence tags from various life cycle stages of the potato cyst nematode +*Globodera pallida*. Mol Plant Pathol 10: 815–28. +http://dx.doi.org/10.1111/j.1364-3703.2009.00585.x + -However, this workflow requires a current version of the NCBI nr protein -BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower -case). +Dependencies +============ + +These dependencies should be resolved automatically via the Galaxy Tool Shed: + +* http://toolshed.g2.bx.psu.edu/view/peterjc/tmhmm_and_signalp +* http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id + +However, at the time of writing those Galaxy tools have their own +dependencies required for this workflow which require manual +installation (SignalP v3.0 and TMHMM v2.0). History @@ -183,7 +102,13 @@ ======= ====================================================================== Version Changes ------- ---------------------------------------------------------------------- -v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29 +v0.0.1 - Initial release to Tool Shed (May, 2013) + - Expanded README file to include example data +v0.0.2 - Updated versions of the tools used, inclulding core Galaxy Filter + tool to avoid warning about new ``header_lines`` parameter. + - Added link to Tool Shed in the workflow annotation explaining there + is a README file with sample data, and a requested citation. +v0.0.3 - Use MIT licence. ======= ====================================================================== @@ -192,20 +117,18 @@ This workflow is under source code control here: -https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species +https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow To prepare the tar-ball for uploading to the Tool Shed, I use this: - $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png + $ tar -cf secreted_protein_workflow.tar.gz README.rst repository_dependencies.xml secreted_protein_workflow.ga Check this, - $ tar -tzf blast_top_hit_species.tar.gz + $ tar -tzf secreted_protein_workflow.tar.gz README.rst repository_dependencies.xml - blast_top_hit_species.ga - blast_top_hit_species.png - N_abberans_piechart_mouseover.png + secreted_protein_workflow.ga Licence (MIT)