comparison flye.xml @ 6:b94de04ca7c2 draft

planemo upload for repository https://github.com/quadram-institute-bioscience/galaxy-tools/tree/master/tools/flye commit e5796f490952b36c7f1360351be90ec0bb60de55-dirty
author thanhlv
date Tue, 17 Sep 2019 11:08:59 -0400
parents af87635c6888
children 36721dedba06
comparison
equal deleted inserted replaced
5:693c8773a40c 6:b94de04ca7c2
4 <import>macros.xml</import> 4 <import>macros.xml</import>
5 </macros> 5 </macros>
6 <expand macro="requirements" /> 6 <expand macro="requirements" />
7 <version_command>flye --version</version_command> 7 <version_command>flye --version</version_command>
8 <command detect_errors="exit_code"> 8 <command detect_errors="exit_code">
9 <![CDATA[ 9 <![CDATA[
10 10
11 #for $counter, $input in enumerate($inputs): 11 #for $counter, $input in enumerate($inputs):
12 12
13 #if $input.is_of_type('fastqsanger', 'fastq'): 13 #if $input.is_of_type('fastqsanger', 'fastq'):
14 #set $ext = 'fastq' 14 #set $ext = 'fastq'
46 #end if 46 #end if
47 #if $no_trestle: 47 #if $no_trestle:
48 '$no_trestle' 48 '$no_trestle'
49 #end if 49 #end if
50 2>&1 50 2>&1
51 ]]></command> 51 ]]> </command>
52 <inputs> 52 <inputs>
53 <param name="inputs" type="data" format="fasta,fasta.gz,fastq,fastq.gz,fastqsanger.gz,fastqsanger" multiple="true" label="Input reads" /> 53 <param name="inputs" type="data" format="fasta,fasta.gz,fastq,fastq.gz,fastqsanger.gz,fastqsanger" multiple="true" label="Input reads" >
54 <help><![CDATA[
55
56 Input reads could be in FASTA or FASTQ format, uncompressed
57 or compressed with gz. Currenlty, raw and corrected reads
58 from PacBio and ONT are supported. The expected error rates are
59 <30% for raw and <2% for corrected reads. Additionally,
60 --subassemblies option performs a consensus assembly of multiple
61 sets of high-quality contigs. You may specify multiple
62 files with reads (separated by spaces). Mixing different read
63 types is not yet supported.
64 ]]> </help>
65 </param>
54 <param name="mode" type="select" label="Mode"> 66 <param name="mode" type="select" label="Mode">
55 <option value="--nano-raw">Nanopore raw</option> 67 <option value="--nano-raw">Nanopore raw</option>
56 <option value="--nano-corr">Nanopore corrected</option> 68 <option value="--nano-corr">Nanopore corrected</option>
57 <option value="--pacbio-raw">PacBio raw</option> 69 <option value="--pacbio-raw">PacBio raw</option>
58 <option value="--pacbio-corr">PacBio corrected</option> 70 <option value="--pacbio-corr">PacBio corrected</option>
59 <option value="--subassemblies">high-quality contig-like input</option> 71 <option value="--subassemblies">high-quality contig-like input</option>
60 </param> 72 </param>
61 <param argument="-g" type="text" label="estimated genome size (for example, 5m or 2.6g)"> 73 <param argument="-g" type="text" label="estimated genome size (for example, 5m or 2.6g)">
74 <help>
75 <![CDATA[
76 <span>The genome size estimate is used for solid k-mer selection in the
77 initial disjointig assembly stage. <b>Flye is not very sensitive to this
78 parameter, and the estimate could be rough</b>. It is ok if the parameter is
79 within 0.5x-2x of the actual genome size. If the final assembly size is
80 very different from the initial guess, consider re-running the pipeline
81 with an updated estimate for better results.</span>
82 <br>
83 <span>An alternative option is to run Flye in <b>--meta</b> mode, which uses a different
84 approach for solid k-mer selection. This mode is almost independent from the
85 genome size parameter (you still need to provide an estimate for the selection
86 of some other parameters). When assembly is completed, you can re-run in the
87 normal mode with the inferred genome size.</span>
88 ]]>
89 </help>
62 <validator type="regex" message="Genome size must be a float or integer, optionally followed by the a unit prefix (kmg)">^([0-9]*[.])?[0-9]+[kmg]?$</validator> 90 <validator type="regex" message="Genome size must be a float or integer, optionally followed by the a unit prefix (kmg)">^([0-9]*[.])?[0-9]+[kmg]?$</validator>
63 </param> 91 </param>
64 <param argument="-i" type="integer" value="1" label="number of polishing iterations" /> 92 <param argument="-i" type="integer" value="1" label="number of polishing iterations" />
65 <param argument="-m" type="integer" optional="true" label="minimum overlap between reads (default: auto)" /> 93 <param argument="-m" type="integer" optional="true" label="minimum overlap between reads (default: auto)" help="This sets a minimum overlap length for two reads to be considered overlapping. In the latest Flye versions, this parameter is chosen automatically based on the read length distribution (reads N90) and does not require manual setting. Typical value is 3k-5k (and down to 1k for datasets with shorter read length). Intuitively, we want to set this parameter as high as possible, so the repeat graph is less tangled. However, higher values might lead to assembly gaps. In some rare cases (for example in case of biased read length distribution) it makes sense to set this parameter manualy."/>
66 <param argument="--asm_coverage" type="integer" optional="true" label="reduced coverage for initial contig assembly (default: not set)" /> 94 <param argument="--asm_coverage" type="integer" optional="true" label="reduced coverage for initial contig assembly (default: not set)" />
67 <param argument="--plasmid" type="boolean" truevalue="--plasmid" falsevalue="" checked="False" label="rescue short unassmebled plasmids" /> 95 <param argument="--plasmid" type="boolean" truevalue="--plasmid" falsevalue="" checked="False" label="rescue short unassmebled plasmids" />
68 <param argument="--meta" type="boolean" truevalue="--meta" falsevalue="" checked="False" label="metagenome / uneven coverage mode" /> 96 <param argument="--meta" type="boolean" truevalue="--meta" falsevalue="" checked="False" label="metagenome / uneven coverage mode" />
69 <param argument="--no_trestle" type="boolean" truevalue="--no-trestle" falsevalue="" checked="False" label="skip Trestle stage" /> 97 <param argument="--no_trestle" type="boolean" truevalue="--no-trestle" falsevalue="" checked="False" label="skip Trestle stage" help="After resolving bridged repeats, Trestle module attempts to resolve simple unbridged repeats (of multiplicity 2) using the heterogeneities between repeat copies"/>
70 </inputs> 98 </inputs>
71 <outputs> 99 <outputs>
72 <data name="scaffolds" format="fasta" from_work_dir="out_dir/scaffolds.fasta" label="${tool.name} on ${on_string} (scaffolds)"/> 100 <data name="scaffolds" format="fasta" from_work_dir="out_dir/scaffolds.fasta" label="${tool.name} on ${on_string} (scaffolds)"/>
73 <data name="assembly_info" format="tabular" from_work_dir="out_dir/assembly_info.txt" label="${tool.name} on ${on_string} (assembly_info)"/> 101 <data name="assembly_info" format="tabular" from_work_dir="out_dir/assembly_info.txt" label="${tool.name} on ${on_string} (assembly_info)"/>
74 <data name="assembly_graph" format="graph_dot" from_work_dir="out_dir/assembly_graph.gv" label="${tool.name} on ${on_string} (assembly_graph)"/> 102 <data name="assembly_graph" format="graph_dot" from_work_dir="out_dir/assembly_graph.gv" label="${tool.name} on ${on_string} (assembly_graph)"/>
101 <param name="i" value="2"/> 129 <param name="i" value="2"/>
102 <output name="scaffolds" file="result3_scaffolds.fasta" ftype="fasta" compare="sim_size"/> 130 <output name="scaffolds" file="result3_scaffolds.fasta" ftype="fasta" compare="sim_size"/>
103 <output name="assembly_gfa" file="result2_assembly_graph.gfa" ftype="txt" compare="sim_size"/> 131 <output name="assembly_gfa" file="result2_assembly_graph.gfa" ftype="txt" compare="sim_size"/>
104 </test> 132 </test>
105 </tests> 133 </tests>
106 <help><![CDATA[ 134 <help>
135 <![CDATA[
136 Flye output
137 The main output files are:
107 138
108 Input reads could be in FASTA or FASTQ format, uncompressed 139 - **assembly.fasta** - Final assembly. Contains contigs and possibly scaffolds (see below).
109 or compressed with gz. Currenlty, raw and corrected reads
110 from PacBio and ONT are supported. The expected error rates are
111 <30% for raw and <2% for corrected reads. Additionally,
112 --subassemblies option performs a consensus assembly of multiple
113 sets of high-quality contigs. You may specify multiple
114 files with reads (separated by spaces). Mixing different read
115 types is not yet supported.
116 140
117 You must provide an estimate of the genome size as input, 141 - **assembly_graph.{gfa|gv}** - Final repeat graph. Note that the edge sequences might be different (shorter) than contig sequences, because contigs might include multiple graph edges (see below).
118 which is used for solid k-mers selection. The estimate could 142
119 be rough (e.g. withing 0.5x-2x range) and does not affect 143 - **assembly_info.txt** - Extra information about contigs (such as length or coverage).
120 the other assembly stages. Standard size modificators are 144
121 supported (e.g. 5m or 2.6g). 145 Each contig is formed by a single unique graph edge. If possible, unique contigs are extended with the sequence from flanking unresolved repeats on the graph. Thus, a contig fully contains the corresponding graph edge (with the same id), but might be longer then this edge. This is somewhat similar to unitig-contig relation in OLC assemblers. In a rare case when a repetitive graph edge is not covered by the set of "extended" contigs, it will be also output in the assembly file.
122 146
123 ]]></help> 147 Sometimes it is possible to further order contigs into scaffolds based on the repeat graph structure. These ordered contigs will be output as a part of scaffold in the assembly file (with a scaffold\_ prefix). Since it is hard to give a reliable estimate of the gap size, those gaps are represented with the default 100 Ns. assembly_info.txt file (below) contains additional information about how scaffolds were formed.
148
149 Extra information about contigs/scaffolds is output into the assembly_info.txt file. It is a tab-delimited table with the columns as follows:
150
151 - Contig/scaffold id
152
153 - Length
154
155 - Coverage
156
157 - Is circular (representing circular sequence, such as bacterial chromosome or plasmid)
158
159 - Is repetitive (represents repeated, rather than unique sequence)
160
161 - Multiplicity (inferred multiplicity based on coverage)
162
163 - Graph path (repeat graph path corresponding to this contig/scaffold). Scaffold gaps are marked with ?? symbols, and * symbol denotes a terminal graph node.
164
165 scaffolds.fasta file is a symlink to assembly.fasta, which is retained for the backward compatibility.
166 ]]>
167 </help>
124 <expand macro="citations" /> 168 <expand macro="citations" />
125 </tool> 169 </tool>