annotate readme.rst @ 0:ba161910b46f draft

Uploaded
author rnateam
date Mon, 21 Oct 2013 12:27:17 -0400
parents
children d6553277b759
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
ba161910b46f Uploaded
rnateam
parents:
diff changeset
1 This package is a Galaxy workflow for BlockClust pipeline.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
2
ba161910b46f Uploaded
rnateam
parents:
diff changeset
3 It uses the Glimmer3 tool (Delcher et al. 2007) trained on a known set of
ba161910b46f Uploaded
rnateam
parents:
diff changeset
4 genes to generate gene predictions on a new genome, and then calls EMBOSS
ba161910b46f Uploaded
rnateam
parents:
diff changeset
5 (Rice et al. 2000) to translate the predictions into a FASTA file of
ba161910b46f Uploaded
rnateam
parents:
diff changeset
6 predicted protein sequences. The workflow requires two input files:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
7
ba161910b46f Uploaded
rnateam
parents:
diff changeset
8 * Nucleotide FASTA file of know gene sequences (training set)
ba161910b46f Uploaded
rnateam
parents:
diff changeset
9 * Nucleotide FASTA file of genome sequence or assembled contigs
ba161910b46f Uploaded
rnateam
parents:
diff changeset
10
ba161910b46f Uploaded
rnateam
parents:
diff changeset
11 First an interpolated context model (ICM) is built from the set of known
ba161910b46f Uploaded
rnateam
parents:
diff changeset
12 genes, preferably from the closest relative organism(s) available. Next this
ba161910b46f Uploaded
rnateam
parents:
diff changeset
13 ICM model is used to predict genes on the genomic FASTA file. This produces
ba161910b46f Uploaded
rnateam
parents:
diff changeset
14 a FASTA file of the predicted gene nucleotide sequences, which is translated
ba161910b46f Uploaded
rnateam
parents:
diff changeset
15 into protein sequences using the EMBOSS tool transeq.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
16
ba161910b46f Uploaded
rnateam
parents:
diff changeset
17 Glimmer is intended for finding genes in microbial DNA, especially bacteria,
ba161910b46f Uploaded
rnateam
parents:
diff changeset
18 archaea, and viruses.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
19
ba161910b46f Uploaded
rnateam
parents:
diff changeset
20 See http://www.galaxyproject.org for information about the Galaxy Project.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
21
ba161910b46f Uploaded
rnateam
parents:
diff changeset
22
ba161910b46f Uploaded
rnateam
parents:
diff changeset
23 Sample Data
ba161910b46f Uploaded
rnateam
parents:
diff changeset
24 ===========
ba161910b46f Uploaded
rnateam
parents:
diff changeset
25
ba161910b46f Uploaded
rnateam
parents:
diff changeset
26 As an example, we will use the first public assembly of the 2011 Shiga-toxin
ba161910b46f Uploaded
rnateam
parents:
diff changeset
27 producing *Escherichia coli* O104:H4 outbreak in Germany. This was part of the
ba161910b46f Uploaded
rnateam
parents:
diff changeset
28 open-source crowd-sourcing analysis described in Rohde et al. (2011) and here:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
29 https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/wiki
ba161910b46f Uploaded
rnateam
parents:
diff changeset
30
ba161910b46f Uploaded
rnateam
parents:
diff changeset
31 You can upload this assembly directly into Galaxy using the "Upload File" tool
ba161910b46f Uploaded
rnateam
parents:
diff changeset
32 with either of these URLs - Galaxy should recognise this is a FASTA file with
ba161910b46f Uploaded
rnateam
parents:
diff changeset
33 3,057 sequences:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
34
ba161910b46f Uploaded
rnateam
parents:
diff changeset
35 * http://static.xbase.ac.uk/files/results/nick/TY2482/TY2482.fasta.txt
ba161910b46f Uploaded
rnateam
parents:
diff changeset
36 * https://github.com/ehec-outbreak-crowdsourced/BGI-data-analysis/blob/master/strains/TY2482/seqProject/BGI/assemblies/NickLoman/TY2482.fasta.txt
ba161910b46f Uploaded
rnateam
parents:
diff changeset
37
ba161910b46f Uploaded
rnateam
parents:
diff changeset
38 This FASTA file ``TY2482.fasta.txt`` was the initial TY-2482 strain assembled
ba161910b46f Uploaded
rnateam
parents:
diff changeset
39 by Nick Loman from 5 runs of Ion Torrent data released by the BGI, using the
ba161910b46f Uploaded
rnateam
parents:
diff changeset
40 MIRA 3.2 assembler. It was initially released via his blog,
ba161910b46f Uploaded
rnateam
parents:
diff changeset
41 http://pathogenomics.bham.ac.uk/blog/2011/06/ehec-genome-assembly/
ba161910b46f Uploaded
rnateam
parents:
diff changeset
42
ba161910b46f Uploaded
rnateam
parents:
diff changeset
43 We will also need a training set of known *E. coli* genes, for example the
ba161910b46f Uploaded
rnateam
parents:
diff changeset
44 model strain *Escherichia coli* str. K-12 substr. MG1655 which is well
ba161910b46f Uploaded
rnateam
parents:
diff changeset
45 annotated. You can upload the NCBI FASTA file ``NC_000913.ffn`` of the
ba161910b46f Uploaded
rnateam
parents:
diff changeset
46 gene nucleotide sequences directly into Galaxy via this URL, which Galaxy
ba161910b46f Uploaded
rnateam
parents:
diff changeset
47 should recognise as a FASTA file with 4,321 sequences:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
48
ba161910b46f Uploaded
rnateam
parents:
diff changeset
49 * ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.ffn
ba161910b46f Uploaded
rnateam
parents:
diff changeset
50
ba161910b46f Uploaded
rnateam
parents:
diff changeset
51 Then run the workflow, which should produce 2,333 predicted genes for the
ba161910b46f Uploaded
rnateam
parents:
diff changeset
52 TY2482 assembly (two FASTA files, nucleotide and protein sequences).
ba161910b46f Uploaded
rnateam
parents:
diff changeset
53
ba161910b46f Uploaded
rnateam
parents:
diff changeset
54
ba161910b46f Uploaded
rnateam
parents:
diff changeset
55 Citation
ba161910b46f Uploaded
rnateam
parents:
diff changeset
56 ========
ba161910b46f Uploaded
rnateam
parents:
diff changeset
57
ba161910b46f Uploaded
rnateam
parents:
diff changeset
58 If you use this workflow directly, or a derivative of it, or the associated
ba161910b46f Uploaded
rnateam
parents:
diff changeset
59 wrappers for Galaxy, in work leading to a scientific publication,
ba161910b46f Uploaded
rnateam
parents:
diff changeset
60 please cite:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
61
ba161910b46f Uploaded
rnateam
parents:
diff changeset
62 P. Videm at al...
ba161910b46f Uploaded
rnateam
parents:
diff changeset
63
ba161910b46f Uploaded
rnateam
parents:
diff changeset
64 For Glimmer3 please cite:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
65
ba161910b46f Uploaded
rnateam
parents:
diff changeset
66 Delcher, A.L., Bratke, K.A., Powers, E.C., and Salzberg, S.L. (2007)
ba161910b46f Uploaded
rnateam
parents:
diff changeset
67 Identifying bacterial genes and endosymbiont DNA with Glimmer.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
68 Bioinformatics 23(6), 673-679.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
69 http://dx.doi.org/10.1093/bioinformatics/btm009
ba161910b46f Uploaded
rnateam
parents:
diff changeset
70
ba161910b46f Uploaded
rnateam
parents:
diff changeset
71 For EMBOSS please cite:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
72
ba161910b46f Uploaded
rnateam
parents:
diff changeset
73 Rice, P., Longden, I. and Bleasby, A. (2000)
ba161910b46f Uploaded
rnateam
parents:
diff changeset
74 EMBOSS: The European Molecular Biology Open Software Suite
ba161910b46f Uploaded
rnateam
parents:
diff changeset
75 Trends in Genetics 16(6), 276-277.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
76 http://dx.doi.org/10.1016/S0168-9525(00)02024-2
ba161910b46f Uploaded
rnateam
parents:
diff changeset
77
ba161910b46f Uploaded
rnateam
parents:
diff changeset
78
ba161910b46f Uploaded
rnateam
parents:
diff changeset
79 Additional References
ba161910b46f Uploaded
rnateam
parents:
diff changeset
80 =====================
ba161910b46f Uploaded
rnateam
parents:
diff changeset
81
ba161910b46f Uploaded
rnateam
parents:
diff changeset
82 Rohde, H., Qin, J., Cui, Y., Li, D., Loman, N.J., et al. (2011)
ba161910b46f Uploaded
rnateam
parents:
diff changeset
83 Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
84 New England Journal of Medicine 365, 718-724.
ba161910b46f Uploaded
rnateam
parents:
diff changeset
85 http://dx.doi.org/10.1056/NEJMoa1107643
ba161910b46f Uploaded
rnateam
parents:
diff changeset
86
ba161910b46f Uploaded
rnateam
parents:
diff changeset
87
ba161910b46f Uploaded
rnateam
parents:
diff changeset
88 Availability
ba161910b46f Uploaded
rnateam
parents:
diff changeset
89 ============
ba161910b46f Uploaded
rnateam
parents:
diff changeset
90
ba161910b46f Uploaded
rnateam
parents:
diff changeset
91 This workflow is available on the main Galaxy Tool Shed:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
92
ba161910b46f Uploaded
rnateam
parents:
diff changeset
93 http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer_gene_calling_workflow
ba161910b46f Uploaded
rnateam
parents:
diff changeset
94
ba161910b46f Uploaded
rnateam
parents:
diff changeset
95 Development is being done on github:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
96
ba161910b46f Uploaded
rnateam
parents:
diff changeset
97 https://github.com/bgruening/galaxytools/workflows/glimmer3/
ba161910b46f Uploaded
rnateam
parents:
diff changeset
98
ba161910b46f Uploaded
rnateam
parents:
diff changeset
99
ba161910b46f Uploaded
rnateam
parents:
diff changeset
100 Dependencies
ba161910b46f Uploaded
rnateam
parents:
diff changeset
101 ============
ba161910b46f Uploaded
rnateam
parents:
diff changeset
102
ba161910b46f Uploaded
rnateam
parents:
diff changeset
103 These dependencies should be resolved automatically via the Galaxy Tool Shed:
ba161910b46f Uploaded
rnateam
parents:
diff changeset
104
ba161910b46f Uploaded
rnateam
parents:
diff changeset
105 * http://toolshed.g2.bx.psu.edu/view/bgruening/glimmer3
ba161910b46f Uploaded
rnateam
parents:
diff changeset
106 * http://toolshed.g2.bx.psu.edu/view/devteam/emboss_5