Mercurial > repos > wolma > mimodd_tool_wrappers
changeset 3:d6ec32ce882b draft default tip
Uploaded
| author | wolma |
|---|---|
| date | Tue, 28 Mar 2017 04:34:04 -0400 |
| parents | 7f7028112439 |
| children | |
| files | annotate_variants.xml bamsort.xml cloudmap.xml convert.xml covstats.xml deletion_predictor.xml fileinfo.xml mimodd_bitbucket_wrappers/annotate_variants.xml mimodd_bitbucket_wrappers/bamsort.xml mimodd_bitbucket_wrappers/cloudmap.xml mimodd_bitbucket_wrappers/convert.xml mimodd_bitbucket_wrappers/covstats.xml mimodd_bitbucket_wrappers/deletion_predictor.xml mimodd_bitbucket_wrappers/fileinfo.xml mimodd_bitbucket_wrappers/reheader.xml mimodd_bitbucket_wrappers/sam_header.xml mimodd_bitbucket_wrappers/snap_caller.xml mimodd_bitbucket_wrappers/snp_caller_caller.xml mimodd_bitbucket_wrappers/snpeff_genomes.xml mimodd_bitbucket_wrappers/toolshed_macros.xml mimodd_bitbucket_wrappers/varextract.xml mimodd_bitbucket_wrappers/vcf_filter.xml reheader.xml sam_header.xml snap_caller.xml snp_caller_caller.xml snpeff_genomes.xml toolshed_macros.xml varextract.xml vcf_filter.xml |
| diffstat | 30 files changed, 1767 insertions(+), 1767 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/annotate_variants.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,169 @@ +<tool id="annotate_variants" name="Variant Annotation" version="0.1.7.3"> + <description>Predict the effects of SNPs and indels on known genes in the reference genome using SnpEff</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd annotate + + "$inputfile" + + #if $str($annotool.name)=='snpeff': + --genome "${annotool.genomeVersion}" + #if $annotool.ori_output: + --snpeff-out "$snpeff_file" + #end if + #if $annotool.stats: + --stats "$summary_file" + #end if + ${annotool.snpeff_settings.chr} ${annotool.snpeff_settings.no_us} ${annotool.snpeff_settings.no_ds} ${annotool.snpeff_settings.no_intron} ${annotool.snpeff_settings.no_intergenic} ${annotool.snpeff_settings.no_utr} + #if $annotool.snpeff_settings.min_cov: + --minC "${annotool.snpeff_settings.min_cov}" + #end if + #if $annotool.snpeff_settings.min_qual: + --minQ "${annotool.snpeff_settings.min_qual}" + #end if + #if $annotool.snpeff_settings.ud: + --ud "${annotool.snpeff_settings.ud}" + #end if + #end if + + --ofile "$outputfile" + #if $str($formatting.oformat) == "text": + --oformat text + #end if + #if $str($formatting.oformat) == "html": + #if $formatting.formatter_file: + --link "${formatting.formatter_file}" + #end if + #if $formatting.species + --species "${formatting.species}" + #end if + #end if + + #if $str($grouping): + --grouping $grouping + #end if + --verbose + </command> + + <inputs> + <param format="vcf" label="vcf inputfile to be annotated" name="inputfile" type="data" /> + <param label="Group variants by" name="grouping" type="select"> + <option value="">order in the input file</option> + <option value="by_sample">sample</option> + <option value="by_genes">most affected genes</option> + </param> + <conditional name="formatting"> + <param label="Format of the annotation output file" name="oformat" type="select"> + <option value="html">HTML</option> + <option value="text">Tab-separated plain text</option> + </param> + <when value="html"> + <param format="txt" label="Optional file with hyperlink formatting instructions" name="formatter_file" optional="true" type="data" /> + <param help="Overwrite the species guess from the SnpEff genome, often not necessary" label="Species" name="species" type="text" /> + </when> + </conditional> + <conditional name="annotool"> + <param help="Select SnpEff here, if you want to have the vcf input annotated with genomic feature information. Select None if you do not want additional annotation, if you do not have SnpEff installed, or if you have no appropriate SnpEff annotation file for the input." label="Use this tool to annotate the input file" name="name" type="select"> + <option value="snpeff">SnpEff</option> + <option value="None">None</option> + </param> + <when value="snpeff"> + <param format="tabular" label="genome list" name="genome_list" type="data" /> + <param label="Genome" name="genomeVersion" type="select"> + <options from_dataset="genome_list"> + <column index="0" name="name" /> + <column index="1" name="value" /> + </options> + </param> + <param checked="true" label="Keep the original SnpEff output" name="ori_output" type="boolean" /> + <param checked="true" label="Produce a summary file of results" name="stats" type="boolean" /> + + <conditional name="snpeff_settings"> + <param help="This section lets you specify the detailed parameter settings for the SnpEff tool." label="SnpEff-specific parameter settings" name="detail_level" type="select"> + <option value="default">default settings</option> + <option value="change">change settings</option> + </param> + <when value="default"> + ## default settings for SnpEff + <param name="chr" type="hidden" value="" /> + <param name="min_cov" type="hidden" value="" /> + <param name="min_qual" type="hidden" value="" /> + <param name="no_ds" type="hidden" value="" /> + <param name="no_us" type="hidden" value="" /> + <param name="no_intron" type="hidden" value="" /> + <param name="no_intergenic" type="hidden" value="" /> + <param name="no_utr" type="hidden" value="" /> + <param name="ud" type="hidden" value="" /> + </when> + <when value="change"> + <param checked="false" falsevalue="" label="prepend 'chr' to chromosome names, e.g., 'chr7' instead of '7'" name="chr" truevalue="-chr" type="boolean" /> + <param help="do not include variants with a coverage lower than this value" label="minimum coverage (default = not used)" name="min_cov" optional="true" type="integer" /> + <param help="do not include variants with a quality lower than this value" label="minimum quality (default = not used)" name="min_qual" optional="true" type="integer" /> + <param checked="false" falsevalue="" help="annotation of effects on the downstream region of genes can be suppressed" label="do not show downstream changes" name="no_ds" truevalue="--no-downstream" type="boolean" /> + <param checked="false" falsevalue="" help="annotation of effects on the upstream region of genes can be suppressed" label="do not show upstream changes" name="no_us" truevalue="--no-upstream" type="boolean" /> + <param checked="false" falsevalue="" help="annotation of effects on introns of genes can be suppressed" label="do not show intron changes" name="no_intron" truevalue="--no-intron" type="boolean" /> + <param checked="false" falsevalue="" help="annotation of effects on intergenic regions can be suppressed" label="do not show intergenic changes" name="no_intergenic" truevalue="--no-intergenic" type="boolean" /> + <param checked="false" falsevalue="" help="annotation of effects on the untranslated regions of genes can be suppressed" label="do not show UTR changes" name="no_utr" truevalue="--no-utr" type="boolean" /> + <param help="specify the upstream/downstream interval length, i.e., variants more than INTERVAL nts from the next annotated gene are considered to be intergenic" label="upstream downstream interval length (default = 5000 bases)" name="ud" optional="true" type="integer" /> + </when> + </conditional> + </when> + </conditional> + </inputs> + + <outputs> + <data format="html" name="outputfile"> + <change_format> + <when format="tabular" input="formatting.oformat" value="text" /> + </change_format> + </data> + <data format="vcf" name="snpeff_file"> + <filter>(annotool['name']=="snpeff" and annotool['ori_output'])</filter> + </data> + <data format="html" name="summary_file"> + <filter>(annotool['name']=="snpeff" and annotool['stats'])</filter> + </data> + </outputs> + + <help> +.. class:: infomark + + **What it does** + +The tool turns a variant list in VCF format into a more readable summary table listing variant sites and effects. + +If installed, the variant annotation tool SnpEff can be used transparently to determine the genomic features, e.g., genes or transcripts, affected by the variants. + +Use of this feature requires that you have an appropriate SnpEff genome file installed on the host machine. You can use the *List installed SnpEff genomes* tool to generate a list of all available SnpEff genomes. +This list can then be used (by selecting the dataset as the *genome list*) to populate the *genome* dropdown menu, from which you can select the SnpEff genome file to be used for the annotation. + +As output file formats HTML or plain text are supported. +In HTML mode, variant positions and/or affected genomic features can be turned into hyperlinks to corresponding views in web-based genome browsers and databases. + +The behavior of this feature depends on: + +1) Recognition of the species that is analyzed + + You can declare the species you are working with using the *Species* text field. + If you are not declaring the species explicitly, but are choosing SnpEff for effect annotation, the tool will usually be able to auto-detect the species from the SnpEff genome you are using. + If no species gets assigned in either way, no hyperlinks will be generated and the html output will look essentially like plain text. + +2) Available hyperlink formatting rules for this species + + When the species has been recognized, the tool checks if you have selected an *optional file with hyperlink formatting instructions*. + If you did and that file contains an entry matching the recognized species, that entry will be used as a template to construct the hyperlinks. + If no matching entry is found in the file, an error will be raised. + + If you did not supply a hyperlink formatting instruction file, the tool will consult an internal lookup table to see if it finds default rules for the construction of the hyperlinks for the species. + If not, no hyperlinks will be generated and the html output will look essentially like plain text. + + **TIP:** + MiModD's internal hyperlink formatting lookup tables are maintained and growing with every new version, but since weblinks are changing frequently as well, it is possible that you will encounter broken hyperlinks for your species of interest. In such a case, you can resort to two things: `tell us about the problem`_ to make sure it gets fixed in the next release and, in the meantime, use a custom file with hyperlink formatting instructions to overwrite the default entry for your species. + +.. _tell us about the problem: mailto:mimodd@googlegroups.com + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bamsort.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,52 @@ +<tool id="bamsort" name="Sort BAM file" version="0.1.7.3"> + <description>Sort a BAM file by coordinates (or names) of the mapped reads</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd sort "$input.ifile" -o "$output" --iformat $input.iformat --oformat $oformat $by_name + </command> + + <inputs> + <conditional name="input"> + <param label="Input data format" name="iformat" type="select"> + <option value="bam">bam</option> + <option value="sam">sam</option> + </param> + <when value="bam"> + <param format="bam" label="BAM input file to sort" name="ifile" type="data" /> + </when> + <when value="sam"> + <param format="sam" label="SAM input file to sort" name="ifile" type="data" /> + </when> + </conditional> + <param label="Output format for the sorted data" name="oformat" type="select"> + <option value="bam">bam</option> + <option value="sam">sam</option> + </param> + <param checked="false" falsevalue="" help="A less common option, but necessary, e.g., if you want to re-align sorted output from a previous run of the Snap Align Tool." label="Sort by read names instead of coordinates" name="by_name" truevalue="-n" type="boolean" /> + </inputs> + + <outputs> + <data format="bam" label="Sorted output from MiModd ${tool.name} on ${on_string}" name="output"> + <change_format> + <when format="sam" input="oformat" value="sam" /> + </change_format> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool sorts a BAM file of aligned reads, typically by the reference genome coordinates that the reads have been mapped to. + +Coordinate-sorted input files are expected by most downstream MiModD tools, but note that the *SNAP Read Alignment* produces coordinate-sorted output by default and it is only necessary to sort files that come from other sources or from *SNAP Read Alignment* jobs with a custom sort order. + +The option *Sort by read names instead of coordinates* is useful if you want to re-align coordinate-sorted paired-end data. In *paired-end mode*, the *SNAP Read Alignment* tool expects the reads in the input file to be arranged in read pairs, i.e., the forward read information of a pair must be followed immediately by its reverse mate information, which is typically not the case in coordinate-sorted files. Resorting such files by read names fixes this problem. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cloudmap.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,340 @@ +<tool id="nacreousmap" name="NacreousMap" version="0.1.7.3"> + <description>Map causative mutations by multi-variant linkage analysis.</description> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd map ${opt.mode} "${opt.source.ifile}" + #if $str($opt.source.sample): + -m "${opt.source.sample}" + #end if + #if $str($opt.source.related_parent_sample): + -r "${opt.source.related_parent_sample}" + #end if + #if $str($opt.source.unrelated_parent_sample): + -u "${opt.source.unrelated_parent_sample}" + #end if + $opt.source.infer_missing + -o "$ofile" + #if $str($opt.source.seqdict_required.required) == "yes": + -s "${opt.source.seqdict_required.seqdict}" + #end if + $opt.source.norm + #if $len($opt.source.bin_sizes): + --bin-sizes + #for $size in $opt.source.bin_sizes: + "${size.bin_size}" + #end for + #end if + #if $str($opt.source.tabfile): + $str($opt.source.tabfile) $tfile + #end if + #if $str($opt.source.plotopts.plots): + $str($opt.source.plotopts.plots) "$pfile" + $str($opt.source.plotopts.xlim) + #if $str($opt.source.plotopts.hylim): + --ylim-hist $str($opt.source.plotopts.hylim) + #end if + #if $str($opt.source.plotopts.hcols) and $len($opt.source.plotopts.hcols): + --hist-colors + #for $color in $opt.source.plotopts.hcols: + "${color.hcolor}" + #end for + #end if + #if $str($opt.source.plotopts.sylim): + --ylim-scatter $str($opt.source.plotopts.sylim) + #end if + #if $str($opt.source.plotopts.pcol): + --points-color "$str($opt.source.plotopts.pcol)" + #end if + #if $str($opt.source.plotopts.lcol): + --loess-color "$str($opt.source.plotopts.lcol)" + #end if + #if $str($opt.source.plotopts.span): + --loess-span "$str($opt.source.plotopts.span)" + #end if + #end if + + </command> + + <macros> + <import>toolshed_macros.xml</import> + <macro name="svd_unconditional"> + <expand macro="hidden_vaf_algo_params" /> + <expand macro="seqdict_param" /> + <expand macro="bins" /> + <param checked="true" falsevalue="--no-normalize" help="without normalization the tool will just report the number of nucleotides per bin; with normalization the results for different bin-widths will be comparable." label="normalize variant counts to bin-width" name="norm" truevalue="" type="boolean" /> + <conditional name="plotopts"> + <param label="graphical output settings" name="plots" type="select"> + <option value="">Do not generate graphs.</option> + <option value="-p">Give me graphics.</option> + </param> + <when value="-p"> + <expand macro="scatter_default" /> + <param label="upper limit for the histogram y-axis (leave blank for automatic scaling)" name="hylim" type="text" /> + <param label="x-axis scaling" name="xlim" type="select"> + <option value="">preserve relative contig sizes</option> + <option value="--fit-width">scale each contig to fit the plot width</option> + </param> + <expand macro="hist_colors" /> + </when> + </conditional> + </macro> + <macro name="vaf_unconditional"> + <expand macro="bins" /> + <param checked="true" falsevalue="--no-normalize" label="normalize variant counts to bin-width" name="norm" truevalue="" type="boolean" /> + <conditional name="plotopts"> + <param label="graphical output settings" name="plots" type="select"> + <option value="">Do not generate graphs.</option> + <option value="--no-scatter -p">Generate only histograms</option> + <option value="--no-hist -p">Generate only scatter plots</option> + <option value="-p">Give me everything (scatter plots and histograms)</option> + </param> + <when value="--no-scatter -p"> + <expand macro="scatter_default" /> + <param label="upper limit for the histogram y-axis (leave blank for automatic scaling)" name="hylim" type="text" /> + <param label="x-axis scaling" name="xlim" type="select"> + <option value="">preserve relative contig sizes</option> + <option value="--fit-width">scale each contig to fit the plot width</option> + </param> + <expand macro="hist_colors" /> + </when> + <when value="--no-hist -p"> + <expand macro="hist_default" /> + <param label="upper limit for the scatter plot y-axis (default: 1)" name="sylim" type="text" /> + <param label="x-axis scaling" name="xlim" type="select"> + <option value="">preserve relative contig sizes</option> + <option value="--fit-width">scale each contig to fit the plot width</option> + </param> + <param help="smaller values give a more responsive curve that often picks up local evidence for tight linkage better, but too small values lead to plotting failures (in that case just rerun the tool with a larger value)." label="span value to be used in calculating the Loess regression line through the scatter data (default=0.1, specify 0 to prevent calculation)" name="span" type="text" /> + <expand macro="scatter_colors" /> + </when> + <when value="-p"> + <expand macro="plot_all" /> + </when> + </conditional> + </macro> + <macro name="hidden_vaf_algo_params"> + <param name="sample" type="hidden" value="" /> + <param name="related_parent_sample" type="hidden" value="" /> + <param name="unrelated_parent_sample" type="hidden" value="" /> + <param name="infer_missing" type="hidden" value="" /> + </macro> + <macro name="bins"> + <repeat default="0" help="Values can be entered in bases (e.g., 1000000), kilobases (e.g., 500Kb) or megabases (e.g., 1Mb), but must be integral, i.e. no decimal numbers are allowed." min="0" name="bin_sizes" title="bin sizes to analyze variants in (defaults to: 1Mb and 500Kb)"> + <param name="bin_size" type="text" /> + </repeat> + </macro> + <macro name="scatter_default"> + <param name="sylim" type="hidden" value="" /> + <param name="span" type="hidden" value="" /> + <param name="pcol" type="hidden" value="" /> + <param name="lcol" type="hidden" value="" /> + </macro> + <macro name="hist_default"> + <param name="hylim" type="hidden" value="" /> + <param name="hcols" type="hidden" value="" /> + </macro> + <macro name="hist_colors"> + <repeat default="0" help="For each bin size chosen above a histogram will be generated with its color selected from the list provided here (defaults to alternating darkgrey, red)." min="0" name="hcols" title="histogram colors"> + <param name="hcolor" type="color" value="darkgrey"> + <sanitizer><valid><add value="#" /></valid></sanitizer> + </param> + </repeat> + </macro> + <macro name="scatter_colors"> + <param label="color to be used for the scatter plot data points (default: gray27)" name="pcol" type="color" value="#454545"> + <sanitizer><valid><add value="#" /></valid></sanitizer> + </param> + <param label="color to be used for the regression line (default: red)" name="lcol" type="color" value="red"> + <sanitizer><valid><add value="#" /></valid></sanitizer> + </param> + </macro> + <macro name="plot_all"> + <param label="upper limit for the histogram y-axis (leave blank for automatic scaling)" name="hylim" type="text" /> + <param label="upper limit for the scatter plot y-axis (default: 1)" name="sylim" type="text" /> + <param label="x-axis scaling" name="xlim" type="select"> + <option value="">preserve relative contig sizes</option> + <option value="--fit-width">scale each contig to fit the plot width</option> + </param> + <param help="smaller values give a more responsive curve that often picks up local evidence for tight linkage better, but too small values lead to plotting failures (in that case just rerun the tool with a larger value)." label="span value to be used in calculating the Loess regression line through the scatter data (default=0.1, specify 0 to prevent calculation)" name="span" type="text" /> + <expand macro="hist_colors" /> + <expand macro="scatter_colors" /> + </macro> + <macro name="seqdict_param"> + <conditional name="seqdict_required"> + <param help="A sequence dictionary file is required ONLY if the input file does not provide information about the sizes of the chromosomes defined in it. It is NEVER needed for MiModD-generated input files." label="does this input file require a CloudMap-style sequence dictionary?" name="required" type="select"> + <option value="no">No</option> + <option value="yes">Yes</option> + </param> + <when value="yes"> + <param format="tabular" label="CloudMap-style sequence dictionary file" name="seqdict" type="data" /> + </when> + </conditional> + </macro> + </macros> + + <inputs> + <conditional name="opt"> + <param help="Select Simple Variant Density (SVD) Mapping to map mutations based on linked inheritance in near isogenic populations, Variant Allele Frequency (VAF) Mapping for bulk segregant analysis. Select Reprocess for rapidly replotting the result of a previous VAF analysis." label="type of mapping analysis to perform" name="mode" type="select"> + <option value="SVD">Simple Variant Density Mapping</option> + <option value="VAF">Variant Allele Frequency Mapping</option> + </param> + <when value="SVD"> + <conditional name="source"> + <param label="data source to use" name="inputtype" type="select"> + <option value="vcf">VCF file of variants (for de-novo mapping)</option> + <option value="rep">per-variant report file (for remapping a previous analysis)</option> + </param> + <when value="vcf"> + <param format="vcf" label="input file with variants to analyze" name="ifile" type="data" /> + <expand macro="svd_unconditional" /> + <param help="You can either choose to produce a tabular per-variant report, which is useful for fast replotting with different plot settings or a vcf-like CloudMap-compatibility file that can be used as input for the CloudMap EMS Variant Density Mapping tool as an alternative plotting tool." label="additional per-variant output file" name="tabfile" type="select"> + <option value="">Do not generate per-variant output</option> + <option value="-t">Tabular per-variant report</option> + <option value="--cloudmap -t">CloudMap compatibility file</option> + </param> + </when> + <when value="rep"> + <param format="tabular" label="input file with variants to analyze" name="ifile" type="data" /> + <param name="tabfile" type="hidden" value="" /> + <expand macro="svd_unconditional" /> + </when> + </conditional> + </when> + <when value="VAF"> + <conditional name="source"> + <param label="data source to use" name="inputtype" type="select"> + <option value="vcf">VCF file of variants (for de-novo mapping)</option> + <option value="rep">per-variant report file (for remapping a previous analysis)</option> + </param> + <when value="vcf"> + <param format="vcf" label="input file with variants to analyze" name="ifile" type="data" /> + <expand macro="seqdict_param" /> + <param help="the sample to perform mutation mapping for" label="mapping sample name" name="sample" type="text" /> + <param help="the sample that provides variants present in your original mutant strain or in an ancestor (like the pre-mutagenesis strain); leave blank if not available" label="name of the related parent sample" name="related_parent_sample" type="text" /> + <param help="the sample that provides variants present in the unrelated mapping strain (or in an ancestor of it) used in the mapping cross; leave blank if not available" label="name of the unrelated parent sample" name="unrelated_parent_sample" type="text" /> + <param checked="false" falsevalue="" help="if variant data for either the related or the unrelated parent strain is not available, the tool can try to infer the alleles present in that parent from the allele spectrum found in the mapping sample. Use with caution on carefully filtered variant lists only!" label="Infer alleles for missing parent" name="infer_missing" truevalue="--infer-missing" type="boolean" /> + <expand macro="vaf_unconditional" /> + <param help="You can either choose to produce a tabular per-variant report, which is useful for fast replotting with different plot settings or a vcf-like CloudMap-compatibility file that can be used as input for the CloudMap Hawaiian Variant Mapping tool as an alternative plotting tool." label="additional per-variant output file" name="tabfile" type="select"> + <option value="">Do not generate per-variant output</option> + <option value="-t">Tabular per-variant report</option> + <option value="--cloudmap -t">CloudMap compatibility file</option> + </param> + </when> + <when value="rep"> + <param format="tabular" label="input file with variants to analyze" name="ifile" type="data" /> + <expand macro="seqdict_param" /> + <param name="tabfile" type="hidden" value="" /> + <expand macro="hidden_vaf_algo_params" /> + <expand macro="vaf_unconditional" /> + </when> + </conditional> + </when> + </conditional> + </inputs> + + <outputs> + <data format="tabular" label="MiModD ${opt.mode} Mapping - binned variant counts for ${on_string}" name="ofile" /> + <data format="tabular" label="MiModD ${opt.mode} Mapping - per-variant report for ${on_string}" name="tfile"> + <filter>(opt['source']['tabfile'])</filter> + </data> + <data format="pdf" label="MiModD ${opt.mode} Mapping - linkage plots for ${on_string}" name="pfile"> + <filter>(opt['source']['plotopts']['plots'])</filter> + </data> + </outputs> + + <help> +.. class:: infomark + + **What it does** + +This tool is a complete rewrite of and improves the EMS Variant Density and Hawaiian Variant Mapping tools of `CloudMap`_. It is the most downstream tool in `mapping-by-sequencing analysis workflows in MiModD`_. + +It can be used to analyze and visualize the inheritance pattern of variants detected and selected by other MiModD tools or as an alternative (and more versatile) plotting engine for data generated with `CloudMap`_. + +------------- + +**Usage Modes:** + +This tool can be run in one of two different modes depending on the type of mapping analysis that should be performed: + +1) *Simple Variant Density (SVD) Mapping* mode analyzes the density of variants along the reference genome by dividing each chromosome into regions of user-defined size (bins) and counting the variants found in each bin. + + All variants listed in the input file are analyzed in this mode, which means that as input you will typically want to use filtered lists of variants (as produced by the VCF Filter tool). + + The aim of SVD analysis is to identify clusters of variants in an outcrossed strain carrying a selectable unknown mutation, which is interpreted as linkage between the corresponding genomic region and the unknown mutation. + + This mode corresponds roughly to EMS Variant Density Mapping in CloudMap. + +2) *Variant Allele Frequency (VAF) Mapping** mode analyzes the inheritance pattern in cross-progeny at sites, at which the parents are homozygous for different alleles. + + The aim of VAF analysis is to identify clusters of variants with (near) homozygous inheritance in a F2 (or later generation) population obtained from a cross between a strain carrying a selectable unknown mutation and an unrelated mapping strain. Such a cluster is interpreted as linkage between the corresponding genomic region and the unknown mutation selected for in the F2 generation. + + This mode corresponds roughly to Hawaiian Variant Mapping in CloudMap, but can simultaneously take into account non-reference alleles found in either parent strain (CloudMap users may think of this as a combined Hawaiian Variant and Variant Discovery Mapping analysis). + +------------- + +**Input:** + +Valid input for this tool are VCF files (any VCF file in SVD mode, a MiModD-generated multi-sample VCF file in VAF mode) or a CloudMap tabular report file as generated by the Hawaiian Variant Mapping tool. Alternatively, the tool can generate (in both modes) its own tabular report file, which can be used as input instead of the original VCF file when rerunning the tool with different plotting parameters to reduce analysis time. + +.. class:: infomark + + CloudMap-generated tabular input files require, as additional input, a CloudMap-style sequence dictionary (even if the original CloudMap analysis was possible without one) as described in the original CloudMap paper. This file has a simple two-column tab-delimited format, in which each line lists the chromosome name (as it appears in the input VCF file) and the up-rounded length of the chromosome in megabases. + +------------- + +**Output:** + +The tool produces up to three output files: + +1) a default tabular file of binned variant counts that can be used to plot the data with external software such as Excel, + + +2) an optional pdf containing linkage plots, which should look just like the plots produced by CloudMap, but are optimized for file size and display speed and offer more user-configurable parameters and + + +3) an optional tabular per-variant report file, which can be configured to be either a valid input file for the corresponding original CloudMap tool (for users who really, really want to continue using CloudMap for plotting) or to be reusable in fast reruns of the tool (which can be useful to experiment with different plotting parameters). + +------------- + +**Settings:** + +1) Analysis settings + + *bin size to analyze variants in* - determines the width of the regions along each chromosome, in which variants are counted and analyzed together. + + Several bin sizes can be specified and for each size you will get a corresponding report section in the binned variant counts file and a histogram plot in the linkage plots file. + + *normalize variant counts to bin-width* - if selected (as per default) the variant counts for different bin sizes are not absolute, but normalized to the bin width + + *sample names (in VAF mode only)* - to analyze inheritance patterns, VAF mode needs information about the relationship between the samples defined in the input VCF file: + + The *mapping sample name* should be set to the name of the sample for which the inheritance pattern is to be analyzed (the pooled progeny population). + + The *name of the related sample* should be that of the parent sample that carried and brought in the unknown mutation to be mapped (or, alternatively, that of a closely related ancestor). + + Finally, the *name of the unrelated sample* should be that of the other parent strain used in the cross. + + At least one of the parent samples MUST be specified, but if the input file contains variant information for both parents, they can be analyzed together for higher mapping accuracy. If you are reanalyzing a tabular report file from a previous tool run or from CloudMap, the association between variants and samples is already incorporated into the input file and cannot be specified again. + +2) Graphical output settings + + .. class:: warningmark + + To be able to generate plots the system running MiModD needs to have the statistical programming environment R and its Python interface rpy2 installed. + + + *y-axes scaling* - if you want to override the defaults + + *x-axis scaling* - choose *preserve relative contig sizes* if you want the largest chromosome to fit the page width and smaller chromosomes to appear according to their relative size or choose *scale each contig to fit the plot width* if all chromosomes should exploit the available space + + *span value to be used in calculating the Loess regression line* - this value determines the degree of smoothing of the regression line through the scatterplot data. Information on loess regression and the loess span parameter can be found at http://en.wikipedia.org/wiki/Local_regression. The default is 0.1 as in CloudMap. + + *colors used for plotting* - can be selected freely from the offered palette. For histogram colors, the list of selected colors will be used to provide the colors for the different histograms plotted. If less colors than histograms (determined by the number of bin sizes selected) are specified, colors from the list will be recycled. + + +.. _CloudMap: https://usegalaxy.org/u/gm2123/p/cloudmap +.. _mapping-by-sequencing analysis workflows in MiModD: http://mimodd.readthedocs.org/en/latest/cloudmap.html + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/convert.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,170 @@ +<tool id="convert" name="Convert" version="0.1.7.3"> + <description>between different sequence data formats</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + #if $str($mode.split_on_rgs) or $str($mode.oformat)=="fastq" or $str($mode.oformat)=="gz": + echo "Your input data is now getting processed by MiModD. The output will be split into several files based on the read groups found in the input.\nThis history item will remain in the busy state until the job is finished.\nAfter the job is showing as finished, Galaxy will start adding the results files to your history one by one.\n\nThis may take a while to complete! \n\nYou should refresh your history to see if new files have arrived.\n\nThis message is for your information only and can be deleted from the history once the job has finished." > $output_split_on_read_groups; + + mkdir converted_data; + #end if + + mimodd convert + + #for $i in $mode.input_list + "${i.file1}" + #if $str($mode.iformat) in ("fastq_pe", "gz_pe"): + "${i.file2}" + #end if + #end for + #if $str($mode.header) != "None": + --header "$(mode.header)" + #end if + + #if $str($outputname) == "None": + --ofile converted_data/read_group + #else + --ofile "$outputname" + #end if + --iformat $(mode.iformat) + --oformat $(mode.oformat) + ${mode.split_on_rgs} + </command> + + <inputs> + <conditional name="mode"> + <param help="Your choice will update the interface to display further choices appropriate for your type of input data." label="input file format" name="iformat" type="select"> + <option value="fastq">fastq: single-end (one file)</option> + <option value="fastq_pe">fastq: paired-end (two files)</option> + <option value="gz">gzip compressed fastq: single-end (one file)</option> + <option value="gz_pe">gzip compressed fastq: paired-end (two files)</option> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <when value="fastq"> + <param label="output file format" name="oformat" type="select"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat default="1" min="1" name="input_list" title="fastq input dataset"> + <param format="fastq" label="inputfile" name="file1" type="data" /> + </repeat> + <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> + <param name="split_on_rgs" type="hidden" value="" /> + </when> + <when value="fastq_pe"> + <param label="output file format" name="oformat" type="select"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat default="1" min="1" name="input_list" title="fastq input datasets"> + <param format="fastq" label="inputfile with the first set of reads of paired-end data" name="file1" type="data" /> + <param format="fastq" label="inputfile with the second set of reads of paired-end data" name="file2" type="data" /> + </repeat> + <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> + <param name="split_on_rgs" type="hidden" value="" /> + </when> + <when value="gz"> + <param label="output file format" name="oformat" type="select"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat default="1" min="1" name="input_list" title="fastq.gz input dataset"> + <param format="data" label="inputfile" name="file1" type="data" /> + </repeat> + <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> + <param name="split_on_rgs" type="hidden" value="" /> + </when> + <when value="gz_pe"> + <param label="output file format" name="oformat" type="select"> + <option value="sam">sam</option> + <option value="bam">bam</option> + </param> + <repeat default="1" min="1" name="input_list" title="fastq.gz input datasets"> + <param format="data" label="inputfile with the first set of reads of paired-end data" name="file1" type="data" /> + <param format="data" label="inputfile with the second set of reads of paired-end data" name="file2" type="data" /> + </repeat> + <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> + <param name="split_on_rgs" type="hidden" value="" /> + </when> + <when value="sam"> + <param label="output file format" name="oformat" type="select"> + <option value="bam">bam</option> + <option value="sam">sam</option> + <option value="fastq">fastq</option> + <option value="gz">gzipped fastq</option> + </param> + <repeat default="1" max="1" min="1" name="input_list" title="sam input dataset"> + <param format="sam" label="inputfile" name="file1" type="data" /> + </repeat> + <param name="header" type="hidden" value="None" /> + <param checked="false" falsevalue="" help="If the input file contains reads from different read groups, write them to separate output files; implied automatically for conversions to fastq and gzipped fastq format" label="Split output based on read group IDs" name="split_on_rgs" truevalue="--split-on-rgs" type="boolean" /> + </when> + <when value="bam"> + <param label="output file format" name="oformat" type="select"> + <option value="sam">sam</option> + <option value="bam">bam</option> + <option value="fastq">fastq</option> + <option value="gz">gzipped fastq</option> + </param> + <repeat default="1" max="1" min="1" name="input_list" title="bam input dataset"> + <param format="bam" label="inputfile" name="file1" type="data" /> + </repeat> + <param name="header" type="hidden" value="None" /> + <param checked="false" falsevalue="" help="If the input file contains reads from different read groups, write them to separate output files; implied automatically for conversions to fastq and gzipped fastq format" label="Split output based on read group IDs" name="split_on_rgs" truevalue="--split-on-rgs" type="boolean" /> + </when> + </conditional> + </inputs> + + <outputs> + <data format="bam" label="Converted reads from MiModd ${tool.name} on ${on_string}" name="outputname"> + <change_format> + <when format="sam" input="mode.oformat" value="sam" /> + </change_format> + <filter> + (not mode['split_on_rgs'] and mode['oformat'] not in ("fastq", "gz")) + </filter> + </data> + + <data format="txt" label="MiModD ${tool.name} run on ${on_string}" name="output_split_on_read_groups"> + <filter> + (mode['split_on_rgs'] or mode['oformat'] in ("fastq", "gz")) + </filter> + <discover_datasets directory="converted_data" pattern="__designation_and_ext__" visible="true" /> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool converts between different file formats used for storing next-generation sequencing data. + +As input file types it can handle uncompressed or gzipped fastq, SAM or BAM format, which it can convert to SAM or BAM format. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to convert gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) The tool can convert fastq files representing data from paired-end sequencing runs to appropriate SAM/BAM format provided that the mate information is split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) Merging partial fastq (or gzipped fastq) files into a single SAM/BAM file is supported both for single-end and paired-end data. Simply add additional input datasets and select the appropriate files (pairs of files in case of paired-end data). + + Concatenation of SAM/BAM file during conversion is currently not supported. + +4) For input in fastq format a SAM header file providing run metadata **has to be specified**. The information in this file will be used as the header data of the new SAM/BAM file. You can use the *NGS Run Annotation* tool to generate a new header file for your data. + + For input in SAM/BAM format the tool will simply copy the existing header data to the new file. To modify the header of an existing SAM/BAM file, use the *Reheader BAM file* tool instead. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/covstats.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,31 @@ +<tool id="coverage_stats" name="Coverage Statistics" version="0.1.7.3"> + <description>Calculate coverage statistics for a BCF file as generated by the Variant Calling tool</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd covstats "$ifile" --ofile "$output_vcf" + </command> + + <inputs> + <param format="bcf" help="Use the Variant Calling tool to generate input for this tool." label="BCF input file" name="ifile" type="data" /> + </inputs> + <outputs> + <data format="tabular" label="Coverage Statistics for ${on_string}" name="output_vcf" /> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file produced by the *Variant Calling* tool, and calculates per-chromosome read coverage from it. + +.. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/deletion_predictor.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,65 @@ +<tool id="deletion_prediction" name="Deletion Prediction for paired-end data" version="0.1.7.3"> + <description>Predicts deletions in one or more aligned read samples based on coverage of the reference genome and on insert sizes</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd delcall + #for $l in $list_input + "${l.bamfile}" + #end for + "$covfile" -o "$outputfile" + --max-cov "$max_cov" --min-size "$min_size" $include_uncovered $group_by_id --verbose + </command> + + <inputs> + <repeat default="1" min="1" name="list_input" title="Aligned reads input source"> + <param format="bam" label="input BAM file" name="bamfile" type="data" /> + </repeat> + <param format="bcf" help="Use the Variant Calling tool to generate this file." label="BCF variant call file to extract coverage from" name="covfile" type="data" /> + <param checked="false" falsevalue="" help="If selected, reads from different read groups will be treated strictly separate. If turned off, read groups with identical sample names are used together for identifying uncovered regions, but are still treated separately for the prediction of deletions." label="group reads based on read group id only" name="group_by_id" truevalue="-i" type="boolean" /> + <param checked="false" falsevalue="" help="If selected, regions that fulfill the coverage criteria below, but are not statistically significant deletions, will be included in the output." label="include low-coverage regions" name="include_uncovered" truevalue="-u" type="boolean" /> + <param help="The maximal coverage at a site allowed to consider it as part of a low-coverage region" label="maximal coverage allowed inside a low-coverage region (default: 0)" name="max_cov" type="integer" value="0" /> + <param help="A low-coverage region must consist of at least this number of consecutive bases below the maximal coverage to consider it in further analyses." label="minimal deletion size (default: 100)" name="min_size" type="integer" value="100" /> + </inputs> + + <outputs> + <data format="gff" name="outputfile" /> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool predicts deletions from paired-end data in a two-step process: + +1) It finds regions of low-coverage, i.e., candidate regions for deletions, by scanning a BCF file produced by the *Variant Calling* tool. + + The *maximal coverage allowed inside a low-coverage region* and the *minimal deletion size* parameters are used at this step to define what is considered a low-coverage region. + + .. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + +2) It assesses every low-coverage region statistically for evidence of it being a real deletion. **This step requires paired-end data** since it relies on shifts in the distribution of read pair insert sizes around real deletions. + +By default, the tool only reports Deletions, i.e., the subset of low-coverage regions that pass the statistical test. +If *include low-coverage regions* is selected, regions that failed the test will also be reported. + +With *group reads based on read group id only* selected, as it is by default, grouping of reads into samples is done strictly based on their read group IDs. +With the option deselected, grouping is done based on sample names in the first step of the analysis, i.e. the reads of all samples with a shared sample name are used to identify low-coverage regions. +In the second step, however, reads will be regrouped by their read group IDs again, i.e. the statistical assessment for real deletions is always done on a per read group basis. + +**TIP:** +Deselecting *group reads based on read group id only* can be useful, for example, if you have both paired-end and single-end sequencing data for the same sample. + +In this case, the two sets of reads will usually share a common sample name, but differ in their read groups. +With grouping based on sample names, the single-end data can be used together with the paired-end data to identify low-coverage regions, thus increasing overall coverage and reliability of this step. +Still, the assessment of deletions will use only the paired-end data (auto-detecting that the single-end reads do not provide insert size information). + +</help> + +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/fileinfo.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,38 @@ +<tool id="fileinfo" name="Retrieve File Information" version="0.1.7.3"> + <description>for supported data formats.</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd info "$ifile" -o "$outputfile" --verbose --oformat $oformat + </command> + + <inputs> + <param format="bam,sam,vcf,bcf,fasta" label="input file" name="ifile" type="data" /> + <param label="output format" name="oformat" type="select"> + <option value="txt">text</option> + <option value="html">html</option> + </param> + </inputs> + + <outputs> + <data format="txt" label="Sample Info on ${on_string}" name="outputfile"> + <change_format> + <when format="html" input="oformat" value="html" /> + </change_format> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool inspects the input file and generates a report summarizing its contents. + +It autodetects and works with most file formats produced by MiModD, i.e., **SAM / BAM, vcf / bcf and fasta**, and produces a standardized report for all of them. + +</help> +</tool>
--- a/mimodd_bitbucket_wrappers/annotate_variants.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,169 +0,0 @@ -<tool id="annotate_variants" name="Variant Annotation" version="0.1.7.3"> - <description>Predict the effects of SNPs and indels on known genes in the reference genome using SnpEff</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd annotate - - "$inputfile" - - #if $str($annotool.name)=='snpeff': - --genome "${annotool.genomeVersion}" - #if $annotool.ori_output: - --snpeff-out "$snpeff_file" - #end if - #if $annotool.stats: - --stats "$summary_file" - #end if - ${annotool.snpeff_settings.chr} ${annotool.snpeff_settings.no_us} ${annotool.snpeff_settings.no_ds} ${annotool.snpeff_settings.no_intron} ${annotool.snpeff_settings.no_intergenic} ${annotool.snpeff_settings.no_utr} - #if $annotool.snpeff_settings.min_cov: - --minC "${annotool.snpeff_settings.min_cov}" - #end if - #if $annotool.snpeff_settings.min_qual: - --minQ "${annotool.snpeff_settings.min_qual}" - #end if - #if $annotool.snpeff_settings.ud: - --ud "${annotool.snpeff_settings.ud}" - #end if - #end if - - --ofile "$outputfile" - #if $str($formatting.oformat) == "text": - --oformat text - #end if - #if $str($formatting.oformat) == "html": - #if $formatting.formatter_file: - --link "${formatting.formatter_file}" - #end if - #if $formatting.species - --species "${formatting.species}" - #end if - #end if - - #if $str($grouping): - --grouping $grouping - #end if - --verbose - </command> - - <inputs> - <param format="vcf" label="vcf inputfile to be annotated" name="inputfile" type="data" /> - <param label="Group variants by" name="grouping" type="select"> - <option value="">order in the input file</option> - <option value="by_sample">sample</option> - <option value="by_genes">most affected genes</option> - </param> - <conditional name="formatting"> - <param label="Format of the annotation output file" name="oformat" type="select"> - <option value="html">HTML</option> - <option value="text">Tab-separated plain text</option> - </param> - <when value="html"> - <param format="txt" label="Optional file with hyperlink formatting instructions" name="formatter_file" optional="true" type="data" /> - <param help="Overwrite the species guess from the SnpEff genome, often not necessary" label="Species" name="species" type="text" /> - </when> - </conditional> - <conditional name="annotool"> - <param help="Select SnpEff here, if you want to have the vcf input annotated with genomic feature information. Select None if you do not want additional annotation, if you do not have SnpEff installed, or if you have no appropriate SnpEff annotation file for the input." label="Use this tool to annotate the input file" name="name" type="select"> - <option value="snpeff">SnpEff</option> - <option value="None">None</option> - </param> - <when value="snpeff"> - <param format="tabular" label="genome list" name="genome_list" type="data" /> - <param label="Genome" name="genomeVersion" type="select"> - <options from_dataset="genome_list"> - <column index="0" name="name" /> - <column index="1" name="value" /> - </options> - </param> - <param checked="true" label="Keep the original SnpEff output" name="ori_output" type="boolean" /> - <param checked="true" label="Produce a summary file of results" name="stats" type="boolean" /> - - <conditional name="snpeff_settings"> - <param help="This section lets you specify the detailed parameter settings for the SnpEff tool." label="SnpEff-specific parameter settings" name="detail_level" type="select"> - <option value="default">default settings</option> - <option value="change">change settings</option> - </param> - <when value="default"> - ## default settings for SnpEff - <param name="chr" type="hidden" value="" /> - <param name="min_cov" type="hidden" value="" /> - <param name="min_qual" type="hidden" value="" /> - <param name="no_ds" type="hidden" value="" /> - <param name="no_us" type="hidden" value="" /> - <param name="no_intron" type="hidden" value="" /> - <param name="no_intergenic" type="hidden" value="" /> - <param name="no_utr" type="hidden" value="" /> - <param name="ud" type="hidden" value="" /> - </when> - <when value="change"> - <param checked="false" falsevalue="" label="prepend 'chr' to chromosome names, e.g., 'chr7' instead of '7'" name="chr" truevalue="-chr" type="boolean" /> - <param help="do not include variants with a coverage lower than this value" label="minimum coverage (default = not used)" name="min_cov" optional="true" type="integer" /> - <param help="do not include variants with a quality lower than this value" label="minimum quality (default = not used)" name="min_qual" optional="true" type="integer" /> - <param checked="false" falsevalue="" help="annotation of effects on the downstream region of genes can be suppressed" label="do not show downstream changes" name="no_ds" truevalue="--no-downstream" type="boolean" /> - <param checked="false" falsevalue="" help="annotation of effects on the upstream region of genes can be suppressed" label="do not show upstream changes" name="no_us" truevalue="--no-upstream" type="boolean" /> - <param checked="false" falsevalue="" help="annotation of effects on introns of genes can be suppressed" label="do not show intron changes" name="no_intron" truevalue="--no-intron" type="boolean" /> - <param checked="false" falsevalue="" help="annotation of effects on intergenic regions can be suppressed" label="do not show intergenic changes" name="no_intergenic" truevalue="--no-intergenic" type="boolean" /> - <param checked="false" falsevalue="" help="annotation of effects on the untranslated regions of genes can be suppressed" label="do not show UTR changes" name="no_utr" truevalue="--no-utr" type="boolean" /> - <param help="specify the upstream/downstream interval length, i.e., variants more than INTERVAL nts from the next annotated gene are considered to be intergenic" label="upstream downstream interval length (default = 5000 bases)" name="ud" optional="true" type="integer" /> - </when> - </conditional> - </when> - </conditional> - </inputs> - - <outputs> - <data format="html" name="outputfile"> - <change_format> - <when format="tabular" input="formatting.oformat" value="text" /> - </change_format> - </data> - <data format="vcf" name="snpeff_file"> - <filter>(annotool['name']=="snpeff" and annotool['ori_output'])</filter> - </data> - <data format="html" name="summary_file"> - <filter>(annotool['name']=="snpeff" and annotool['stats'])</filter> - </data> - </outputs> - - <help> -.. class:: infomark - - **What it does** - -The tool turns a variant list in VCF format into a more readable summary table listing variant sites and effects. - -If installed, the variant annotation tool SnpEff can be used transparently to determine the genomic features, e.g., genes or transcripts, affected by the variants. - -Use of this feature requires that you have an appropriate SnpEff genome file installed on the host machine. You can use the *List installed SnpEff genomes* tool to generate a list of all available SnpEff genomes. -This list can then be used (by selecting the dataset as the *genome list*) to populate the *genome* dropdown menu, from which you can select the SnpEff genome file to be used for the annotation. - -As output file formats HTML or plain text are supported. -In HTML mode, variant positions and/or affected genomic features can be turned into hyperlinks to corresponding views in web-based genome browsers and databases. - -The behavior of this feature depends on: - -1) Recognition of the species that is analyzed - - You can declare the species you are working with using the *Species* text field. - If you are not declaring the species explicitly, but are choosing SnpEff for effect annotation, the tool will usually be able to auto-detect the species from the SnpEff genome you are using. - If no species gets assigned in either way, no hyperlinks will be generated and the html output will look essentially like plain text. - -2) Available hyperlink formatting rules for this species - - When the species has been recognized, the tool checks if you have selected an *optional file with hyperlink formatting instructions*. - If you did and that file contains an entry matching the recognized species, that entry will be used as a template to construct the hyperlinks. - If no matching entry is found in the file, an error will be raised. - - If you did not supply a hyperlink formatting instruction file, the tool will consult an internal lookup table to see if it finds default rules for the construction of the hyperlinks for the species. - If not, no hyperlinks will be generated and the html output will look essentially like plain text. - - **TIP:** - MiModD's internal hyperlink formatting lookup tables are maintained and growing with every new version, but since weblinks are changing frequently as well, it is possible that you will encounter broken hyperlinks for your species of interest. In such a case, you can resort to two things: `tell us about the problem`_ to make sure it gets fixed in the next release and, in the meantime, use a custom file with hyperlink formatting instructions to overwrite the default entry for your species. - -.. _tell us about the problem: mailto:mimodd@googlegroups.com - </help> -</tool>
--- a/mimodd_bitbucket_wrappers/bamsort.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,52 +0,0 @@ -<tool id="bamsort" name="Sort BAM file" version="0.1.7.3"> - <description>Sort a BAM file by coordinates (or names) of the mapped reads</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd sort "$input.ifile" -o "$output" --iformat $input.iformat --oformat $oformat $by_name - </command> - - <inputs> - <conditional name="input"> - <param label="Input data format" name="iformat" type="select"> - <option value="bam">bam</option> - <option value="sam">sam</option> - </param> - <when value="bam"> - <param format="bam" label="BAM input file to sort" name="ifile" type="data" /> - </when> - <when value="sam"> - <param format="sam" label="SAM input file to sort" name="ifile" type="data" /> - </when> - </conditional> - <param label="Output format for the sorted data" name="oformat" type="select"> - <option value="bam">bam</option> - <option value="sam">sam</option> - </param> - <param checked="false" falsevalue="" help="A less common option, but necessary, e.g., if you want to re-align sorted output from a previous run of the Snap Align Tool." label="Sort by read names instead of coordinates" name="by_name" truevalue="-n" type="boolean" /> - </inputs> - - <outputs> - <data format="bam" label="Sorted output from MiModd ${tool.name} on ${on_string}" name="output"> - <change_format> - <when format="sam" input="oformat" value="sam" /> - </change_format> - </data> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool sorts a BAM file of aligned reads, typically by the reference genome coordinates that the reads have been mapped to. - -Coordinate-sorted input files are expected by most downstream MiModD tools, but note that the *SNAP Read Alignment* produces coordinate-sorted output by default and it is only necessary to sort files that come from other sources or from *SNAP Read Alignment* jobs with a custom sort order. - -The option *Sort by read names instead of coordinates* is useful if you want to re-align coordinate-sorted paired-end data. In *paired-end mode*, the *SNAP Read Alignment* tool expects the reads in the input file to be arranged in read pairs, i.e., the forward read information of a pair must be followed immediately by its reverse mate information, which is typically not the case in coordinate-sorted files. Resorting such files by read names fixes this problem. - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/cloudmap.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,340 +0,0 @@ -<tool id="nacreousmap" name="NacreousMap" version="0.1.7.3"> - <description>Map causative mutations by multi-variant linkage analysis.</description> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd map ${opt.mode} "${opt.source.ifile}" - #if $str($opt.source.sample): - -m "${opt.source.sample}" - #end if - #if $str($opt.source.related_parent_sample): - -r "${opt.source.related_parent_sample}" - #end if - #if $str($opt.source.unrelated_parent_sample): - -u "${opt.source.unrelated_parent_sample}" - #end if - $opt.source.infer_missing - -o "$ofile" - #if $str($opt.source.seqdict_required.required) == "yes": - -s "${opt.source.seqdict_required.seqdict}" - #end if - $opt.source.norm - #if $len($opt.source.bin_sizes): - --bin-sizes - #for $size in $opt.source.bin_sizes: - "${size.bin_size}" - #end for - #end if - #if $str($opt.source.tabfile): - $str($opt.source.tabfile) $tfile - #end if - #if $str($opt.source.plotopts.plots): - $str($opt.source.plotopts.plots) "$pfile" - $str($opt.source.plotopts.xlim) - #if $str($opt.source.plotopts.hylim): - --ylim-hist $str($opt.source.plotopts.hylim) - #end if - #if $str($opt.source.plotopts.hcols) and $len($opt.source.plotopts.hcols): - --hist-colors - #for $color in $opt.source.plotopts.hcols: - "${color.hcolor}" - #end for - #end if - #if $str($opt.source.plotopts.sylim): - --ylim-scatter $str($opt.source.plotopts.sylim) - #end if - #if $str($opt.source.plotopts.pcol): - --points-color "$str($opt.source.plotopts.pcol)" - #end if - #if $str($opt.source.plotopts.lcol): - --loess-color "$str($opt.source.plotopts.lcol)" - #end if - #if $str($opt.source.plotopts.span): - --loess-span "$str($opt.source.plotopts.span)" - #end if - #end if - - </command> - - <macros> - <import>toolshed_macros.xml</import> - <macro name="svd_unconditional"> - <expand macro="hidden_vaf_algo_params" /> - <expand macro="seqdict_param" /> - <expand macro="bins" /> - <param checked="true" falsevalue="--no-normalize" help="without normalization the tool will just report the number of nucleotides per bin; with normalization the results for different bin-widths will be comparable." label="normalize variant counts to bin-width" name="norm" truevalue="" type="boolean" /> - <conditional name="plotopts"> - <param label="graphical output settings" name="plots" type="select"> - <option value="">Do not generate graphs.</option> - <option value="-p">Give me graphics.</option> - </param> - <when value="-p"> - <expand macro="scatter_default" /> - <param label="upper limit for the histogram y-axis (leave blank for automatic scaling)" name="hylim" type="text" /> - <param label="x-axis scaling" name="xlim" type="select"> - <option value="">preserve relative contig sizes</option> - <option value="--fit-width">scale each contig to fit the plot width</option> - </param> - <expand macro="hist_colors" /> - </when> - </conditional> - </macro> - <macro name="vaf_unconditional"> - <expand macro="bins" /> - <param checked="true" falsevalue="--no-normalize" label="normalize variant counts to bin-width" name="norm" truevalue="" type="boolean" /> - <conditional name="plotopts"> - <param label="graphical output settings" name="plots" type="select"> - <option value="">Do not generate graphs.</option> - <option value="--no-scatter -p">Generate only histograms</option> - <option value="--no-hist -p">Generate only scatter plots</option> - <option value="-p">Give me everything (scatter plots and histograms)</option> - </param> - <when value="--no-scatter -p"> - <expand macro="scatter_default" /> - <param label="upper limit for the histogram y-axis (leave blank for automatic scaling)" name="hylim" type="text" /> - <param label="x-axis scaling" name="xlim" type="select"> - <option value="">preserve relative contig sizes</option> - <option value="--fit-width">scale each contig to fit the plot width</option> - </param> - <expand macro="hist_colors" /> - </when> - <when value="--no-hist -p"> - <expand macro="hist_default" /> - <param label="upper limit for the scatter plot y-axis (default: 1)" name="sylim" type="text" /> - <param label="x-axis scaling" name="xlim" type="select"> - <option value="">preserve relative contig sizes</option> - <option value="--fit-width">scale each contig to fit the plot width</option> - </param> - <param help="smaller values give a more responsive curve that often picks up local evidence for tight linkage better, but too small values lead to plotting failures (in that case just rerun the tool with a larger value)." label="span value to be used in calculating the Loess regression line through the scatter data (default=0.1, specify 0 to prevent calculation)" name="span" type="text" /> - <expand macro="scatter_colors" /> - </when> - <when value="-p"> - <expand macro="plot_all" /> - </when> - </conditional> - </macro> - <macro name="hidden_vaf_algo_params"> - <param name="sample" type="hidden" value="" /> - <param name="related_parent_sample" type="hidden" value="" /> - <param name="unrelated_parent_sample" type="hidden" value="" /> - <param name="infer_missing" type="hidden" value="" /> - </macro> - <macro name="bins"> - <repeat default="0" help="Values can be entered in bases (e.g., 1000000), kilobases (e.g., 500Kb) or megabases (e.g., 1Mb), but must be integral, i.e. no decimal numbers are allowed." min="0" name="bin_sizes" title="bin sizes to analyze variants in (defaults to: 1Mb and 500Kb)"> - <param name="bin_size" type="text" /> - </repeat> - </macro> - <macro name="scatter_default"> - <param name="sylim" type="hidden" value="" /> - <param name="span" type="hidden" value="" /> - <param name="pcol" type="hidden" value="" /> - <param name="lcol" type="hidden" value="" /> - </macro> - <macro name="hist_default"> - <param name="hylim" type="hidden" value="" /> - <param name="hcols" type="hidden" value="" /> - </macro> - <macro name="hist_colors"> - <repeat default="0" help="For each bin size chosen above a histogram will be generated with its color selected from the list provided here (defaults to alternating darkgrey, red)." min="0" name="hcols" title="histogram colors"> - <param name="hcolor" type="color" value="darkgrey"> - <sanitizer><valid><add value="#" /></valid></sanitizer> - </param> - </repeat> - </macro> - <macro name="scatter_colors"> - <param label="color to be used for the scatter plot data points (default: gray27)" name="pcol" type="color" value="#454545"> - <sanitizer><valid><add value="#" /></valid></sanitizer> - </param> - <param label="color to be used for the regression line (default: red)" name="lcol" type="color" value="red"> - <sanitizer><valid><add value="#" /></valid></sanitizer> - </param> - </macro> - <macro name="plot_all"> - <param label="upper limit for the histogram y-axis (leave blank for automatic scaling)" name="hylim" type="text" /> - <param label="upper limit for the scatter plot y-axis (default: 1)" name="sylim" type="text" /> - <param label="x-axis scaling" name="xlim" type="select"> - <option value="">preserve relative contig sizes</option> - <option value="--fit-width">scale each contig to fit the plot width</option> - </param> - <param help="smaller values give a more responsive curve that often picks up local evidence for tight linkage better, but too small values lead to plotting failures (in that case just rerun the tool with a larger value)." label="span value to be used in calculating the Loess regression line through the scatter data (default=0.1, specify 0 to prevent calculation)" name="span" type="text" /> - <expand macro="hist_colors" /> - <expand macro="scatter_colors" /> - </macro> - <macro name="seqdict_param"> - <conditional name="seqdict_required"> - <param help="A sequence dictionary file is required ONLY if the input file does not provide information about the sizes of the chromosomes defined in it. It is NEVER needed for MiModD-generated input files." label="does this input file require a CloudMap-style sequence dictionary?" name="required" type="select"> - <option value="no">No</option> - <option value="yes">Yes</option> - </param> - <when value="yes"> - <param format="tabular" label="CloudMap-style sequence dictionary file" name="seqdict" type="data" /> - </when> - </conditional> - </macro> - </macros> - - <inputs> - <conditional name="opt"> - <param help="Select Simple Variant Density (SVD) Mapping to map mutations based on linked inheritance in near isogenic populations, Variant Allele Frequency (VAF) Mapping for bulk segregant analysis. Select Reprocess for rapidly replotting the result of a previous VAF analysis." label="type of mapping analysis to perform" name="mode" type="select"> - <option value="SVD">Simple Variant Density Mapping</option> - <option value="VAF">Variant Allele Frequency Mapping</option> - </param> - <when value="SVD"> - <conditional name="source"> - <param label="data source to use" name="inputtype" type="select"> - <option value="vcf">VCF file of variants (for de-novo mapping)</option> - <option value="rep">per-variant report file (for remapping a previous analysis)</option> - </param> - <when value="vcf"> - <param format="vcf" label="input file with variants to analyze" name="ifile" type="data" /> - <expand macro="svd_unconditional" /> - <param help="You can either choose to produce a tabular per-variant report, which is useful for fast replotting with different plot settings or a vcf-like CloudMap-compatibility file that can be used as input for the CloudMap EMS Variant Density Mapping tool as an alternative plotting tool." label="additional per-variant output file" name="tabfile" type="select"> - <option value="">Do not generate per-variant output</option> - <option value="-t">Tabular per-variant report</option> - <option value="--cloudmap -t">CloudMap compatibility file</option> - </param> - </when> - <when value="rep"> - <param format="tabular" label="input file with variants to analyze" name="ifile" type="data" /> - <param name="tabfile" type="hidden" value="" /> - <expand macro="svd_unconditional" /> - </when> - </conditional> - </when> - <when value="VAF"> - <conditional name="source"> - <param label="data source to use" name="inputtype" type="select"> - <option value="vcf">VCF file of variants (for de-novo mapping)</option> - <option value="rep">per-variant report file (for remapping a previous analysis)</option> - </param> - <when value="vcf"> - <param format="vcf" label="input file with variants to analyze" name="ifile" type="data" /> - <expand macro="seqdict_param" /> - <param help="the sample to perform mutation mapping for" label="mapping sample name" name="sample" type="text" /> - <param help="the sample that provides variants present in your original mutant strain or in an ancestor (like the pre-mutagenesis strain); leave blank if not available" label="name of the related parent sample" name="related_parent_sample" type="text" /> - <param help="the sample that provides variants present in the unrelated mapping strain (or in an ancestor of it) used in the mapping cross; leave blank if not available" label="name of the unrelated parent sample" name="unrelated_parent_sample" type="text" /> - <param checked="false" falsevalue="" help="if variant data for either the related or the unrelated parent strain is not available, the tool can try to infer the alleles present in that parent from the allele spectrum found in the mapping sample. Use with caution on carefully filtered variant lists only!" label="Infer alleles for missing parent" name="infer_missing" truevalue="--infer-missing" type="boolean" /> - <expand macro="vaf_unconditional" /> - <param help="You can either choose to produce a tabular per-variant report, which is useful for fast replotting with different plot settings or a vcf-like CloudMap-compatibility file that can be used as input for the CloudMap Hawaiian Variant Mapping tool as an alternative plotting tool." label="additional per-variant output file" name="tabfile" type="select"> - <option value="">Do not generate per-variant output</option> - <option value="-t">Tabular per-variant report</option> - <option value="--cloudmap -t">CloudMap compatibility file</option> - </param> - </when> - <when value="rep"> - <param format="tabular" label="input file with variants to analyze" name="ifile" type="data" /> - <expand macro="seqdict_param" /> - <param name="tabfile" type="hidden" value="" /> - <expand macro="hidden_vaf_algo_params" /> - <expand macro="vaf_unconditional" /> - </when> - </conditional> - </when> - </conditional> - </inputs> - - <outputs> - <data format="tabular" label="MiModD ${opt.mode} Mapping - binned variant counts for ${on_string}" name="ofile" /> - <data format="tabular" label="MiModD ${opt.mode} Mapping - per-variant report for ${on_string}" name="tfile"> - <filter>(opt['source']['tabfile'])</filter> - </data> - <data format="pdf" label="MiModD ${opt.mode} Mapping - linkage plots for ${on_string}" name="pfile"> - <filter>(opt['source']['plotopts']['plots'])</filter> - </data> - </outputs> - - <help> -.. class:: infomark - - **What it does** - -This tool is a complete rewrite of and improves the EMS Variant Density and Hawaiian Variant Mapping tools of `CloudMap`_. It is the most downstream tool in `mapping-by-sequencing analysis workflows in MiModD`_. - -It can be used to analyze and visualize the inheritance pattern of variants detected and selected by other MiModD tools or as an alternative (and more versatile) plotting engine for data generated with `CloudMap`_. - -------------- - -**Usage Modes:** - -This tool can be run in one of two different modes depending on the type of mapping analysis that should be performed: - -1) *Simple Variant Density (SVD) Mapping* mode analyzes the density of variants along the reference genome by dividing each chromosome into regions of user-defined size (bins) and counting the variants found in each bin. - - All variants listed in the input file are analyzed in this mode, which means that as input you will typically want to use filtered lists of variants (as produced by the VCF Filter tool). - - The aim of SVD analysis is to identify clusters of variants in an outcrossed strain carrying a selectable unknown mutation, which is interpreted as linkage between the corresponding genomic region and the unknown mutation. - - This mode corresponds roughly to EMS Variant Density Mapping in CloudMap. - -2) *Variant Allele Frequency (VAF) Mapping** mode analyzes the inheritance pattern in cross-progeny at sites, at which the parents are homozygous for different alleles. - - The aim of VAF analysis is to identify clusters of variants with (near) homozygous inheritance in a F2 (or later generation) population obtained from a cross between a strain carrying a selectable unknown mutation and an unrelated mapping strain. Such a cluster is interpreted as linkage between the corresponding genomic region and the unknown mutation selected for in the F2 generation. - - This mode corresponds roughly to Hawaiian Variant Mapping in CloudMap, but can simultaneously take into account non-reference alleles found in either parent strain (CloudMap users may think of this as a combined Hawaiian Variant and Variant Discovery Mapping analysis). - -------------- - -**Input:** - -Valid input for this tool are VCF files (any VCF file in SVD mode, a MiModD-generated multi-sample VCF file in VAF mode) or a CloudMap tabular report file as generated by the Hawaiian Variant Mapping tool. Alternatively, the tool can generate (in both modes) its own tabular report file, which can be used as input instead of the original VCF file when rerunning the tool with different plotting parameters to reduce analysis time. - -.. class:: infomark - - CloudMap-generated tabular input files require, as additional input, a CloudMap-style sequence dictionary (even if the original CloudMap analysis was possible without one) as described in the original CloudMap paper. This file has a simple two-column tab-delimited format, in which each line lists the chromosome name (as it appears in the input VCF file) and the up-rounded length of the chromosome in megabases. - -------------- - -**Output:** - -The tool produces up to three output files: - -1) a default tabular file of binned variant counts that can be used to plot the data with external software such as Excel, - - -2) an optional pdf containing linkage plots, which should look just like the plots produced by CloudMap, but are optimized for file size and display speed and offer more user-configurable parameters and - - -3) an optional tabular per-variant report file, which can be configured to be either a valid input file for the corresponding original CloudMap tool (for users who really, really want to continue using CloudMap for plotting) or to be reusable in fast reruns of the tool (which can be useful to experiment with different plotting parameters). - -------------- - -**Settings:** - -1) Analysis settings - - *bin size to analyze variants in* - determines the width of the regions along each chromosome, in which variants are counted and analyzed together. - - Several bin sizes can be specified and for each size you will get a corresponding report section in the binned variant counts file and a histogram plot in the linkage plots file. - - *normalize variant counts to bin-width* - if selected (as per default) the variant counts for different bin sizes are not absolute, but normalized to the bin width - - *sample names (in VAF mode only)* - to analyze inheritance patterns, VAF mode needs information about the relationship between the samples defined in the input VCF file: - - The *mapping sample name* should be set to the name of the sample for which the inheritance pattern is to be analyzed (the pooled progeny population). - - The *name of the related sample* should be that of the parent sample that carried and brought in the unknown mutation to be mapped (or, alternatively, that of a closely related ancestor). - - Finally, the *name of the unrelated sample* should be that of the other parent strain used in the cross. - - At least one of the parent samples MUST be specified, but if the input file contains variant information for both parents, they can be analyzed together for higher mapping accuracy. If you are reanalyzing a tabular report file from a previous tool run or from CloudMap, the association between variants and samples is already incorporated into the input file and cannot be specified again. - -2) Graphical output settings - - .. class:: warningmark - - To be able to generate plots the system running MiModD needs to have the statistical programming environment R and its Python interface rpy2 installed. - - - *y-axes scaling* - if you want to override the defaults - - *x-axis scaling* - choose *preserve relative contig sizes* if you want the largest chromosome to fit the page width and smaller chromosomes to appear according to their relative size or choose *scale each contig to fit the plot width* if all chromosomes should exploit the available space - - *span value to be used in calculating the Loess regression line* - this value determines the degree of smoothing of the regression line through the scatterplot data. Information on loess regression and the loess span parameter can be found at http://en.wikipedia.org/wiki/Local_regression. The default is 0.1 as in CloudMap. - - *colors used for plotting* - can be selected freely from the offered palette. For histogram colors, the list of selected colors will be used to provide the colors for the different histograms plotted. If less colors than histograms (determined by the number of bin sizes selected) are specified, colors from the list will be recycled. - - -.. _CloudMap: https://usegalaxy.org/u/gm2123/p/cloudmap -.. _mapping-by-sequencing analysis workflows in MiModD: http://mimodd.readthedocs.org/en/latest/cloudmap.html - </help> -</tool>
--- a/mimodd_bitbucket_wrappers/convert.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,170 +0,0 @@ -<tool id="convert" name="Convert" version="0.1.7.3"> - <description>between different sequence data formats</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - #if $str($mode.split_on_rgs) or $str($mode.oformat)=="fastq" or $str($mode.oformat)=="gz": - echo "Your input data is now getting processed by MiModD. The output will be split into several files based on the read groups found in the input.\nThis history item will remain in the busy state until the job is finished.\nAfter the job is showing as finished, Galaxy will start adding the results files to your history one by one.\n\nThis may take a while to complete! \n\nYou should refresh your history to see if new files have arrived.\n\nThis message is for your information only and can be deleted from the history once the job has finished." > $output_split_on_read_groups; - - mkdir converted_data; - #end if - - mimodd convert - - #for $i in $mode.input_list - "${i.file1}" - #if $str($mode.iformat) in ("fastq_pe", "gz_pe"): - "${i.file2}" - #end if - #end for - #if $str($mode.header) != "None": - --header "$(mode.header)" - #end if - - #if $str($outputname) == "None": - --ofile converted_data/read_group - #else - --ofile "$outputname" - #end if - --iformat $(mode.iformat) - --oformat $(mode.oformat) - ${mode.split_on_rgs} - </command> - - <inputs> - <conditional name="mode"> - <param help="Your choice will update the interface to display further choices appropriate for your type of input data." label="input file format" name="iformat" type="select"> - <option value="fastq">fastq: single-end (one file)</option> - <option value="fastq_pe">fastq: paired-end (two files)</option> - <option value="gz">gzip compressed fastq: single-end (one file)</option> - <option value="gz_pe">gzip compressed fastq: paired-end (two files)</option> - <option value="sam">sam</option> - <option value="bam">bam</option> - </param> - <when value="fastq"> - <param label="output file format" name="oformat" type="select"> - <option value="sam">sam</option> - <option value="bam">bam</option> - </param> - <repeat default="1" min="1" name="input_list" title="fastq input dataset"> - <param format="fastq" label="inputfile" name="file1" type="data" /> - </repeat> - <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> - <param name="split_on_rgs" type="hidden" value="" /> - </when> - <when value="fastq_pe"> - <param label="output file format" name="oformat" type="select"> - <option value="sam">sam</option> - <option value="bam">bam</option> - </param> - <repeat default="1" min="1" name="input_list" title="fastq input datasets"> - <param format="fastq" label="inputfile with the first set of reads of paired-end data" name="file1" type="data" /> - <param format="fastq" label="inputfile with the second set of reads of paired-end data" name="file2" type="data" /> - </repeat> - <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> - <param name="split_on_rgs" type="hidden" value="" /> - </when> - <when value="gz"> - <param label="output file format" name="oformat" type="select"> - <option value="sam">sam</option> - <option value="bam">bam</option> - </param> - <repeat default="1" min="1" name="input_list" title="fastq.gz input dataset"> - <param format="data" label="inputfile" name="file1" type="data" /> - </repeat> - <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> - <param name="split_on_rgs" type="hidden" value="" /> - </when> - <when value="gz_pe"> - <param label="output file format" name="oformat" type="select"> - <option value="sam">sam</option> - <option value="bam">bam</option> - </param> - <repeat default="1" min="1" name="input_list" title="fastq.gz input datasets"> - <param format="data" label="inputfile with the first set of reads of paired-end data" name="file1" type="data" /> - <param format="data" label="inputfile with the second set of reads of paired-end data" name="file2" type="data" /> - </repeat> - <param format="sam" help="A SAM file with header information, as generated, for example, by the NGS Run Annotation Tool, that will be used to attach metainformation to the results file." label="Use Header File" name="header" type="data" /> - <param name="split_on_rgs" type="hidden" value="" /> - </when> - <when value="sam"> - <param label="output file format" name="oformat" type="select"> - <option value="bam">bam</option> - <option value="sam">sam</option> - <option value="fastq">fastq</option> - <option value="gz">gzipped fastq</option> - </param> - <repeat default="1" max="1" min="1" name="input_list" title="sam input dataset"> - <param format="sam" label="inputfile" name="file1" type="data" /> - </repeat> - <param name="header" type="hidden" value="None" /> - <param checked="false" falsevalue="" help="If the input file contains reads from different read groups, write them to separate output files; implied automatically for conversions to fastq and gzipped fastq format" label="Split output based on read group IDs" name="split_on_rgs" truevalue="--split-on-rgs" type="boolean" /> - </when> - <when value="bam"> - <param label="output file format" name="oformat" type="select"> - <option value="sam">sam</option> - <option value="bam">bam</option> - <option value="fastq">fastq</option> - <option value="gz">gzipped fastq</option> - </param> - <repeat default="1" max="1" min="1" name="input_list" title="bam input dataset"> - <param format="bam" label="inputfile" name="file1" type="data" /> - </repeat> - <param name="header" type="hidden" value="None" /> - <param checked="false" falsevalue="" help="If the input file contains reads from different read groups, write them to separate output files; implied automatically for conversions to fastq and gzipped fastq format" label="Split output based on read group IDs" name="split_on_rgs" truevalue="--split-on-rgs" type="boolean" /> - </when> - </conditional> - </inputs> - - <outputs> - <data format="bam" label="Converted reads from MiModd ${tool.name} on ${on_string}" name="outputname"> - <change_format> - <when format="sam" input="mode.oformat" value="sam" /> - </change_format> - <filter> - (not mode['split_on_rgs'] and mode['oformat'] not in ("fastq", "gz")) - </filter> - </data> - - <data format="txt" label="MiModD ${tool.name} run on ${on_string}" name="output_split_on_read_groups"> - <filter> - (mode['split_on_rgs'] or mode['oformat'] in ("fastq", "gz")) - </filter> - <discover_datasets directory="converted_data" pattern="__designation_and_ext__" visible="true" /> - </data> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool converts between different file formats used for storing next-generation sequencing data. - -As input file types it can handle uncompressed or gzipped fastq, SAM or BAM format, which it can convert to SAM or BAM format. - -**Notes:** - -1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to convert gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. - -2) The tool can convert fastq files representing data from paired-end sequencing runs to appropriate SAM/BAM format provided that the mate information is split over two fastq files in corresponding order. - - **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. - -3) Merging partial fastq (or gzipped fastq) files into a single SAM/BAM file is supported both for single-end and paired-end data. Simply add additional input datasets and select the appropriate files (pairs of files in case of paired-end data). - - Concatenation of SAM/BAM file during conversion is currently not supported. - -4) For input in fastq format a SAM header file providing run metadata **has to be specified**. The information in this file will be used as the header data of the new SAM/BAM file. You can use the *NGS Run Annotation* tool to generate a new header file for your data. - - For input in SAM/BAM format the tool will simply copy the existing header data to the new file. To modify the header of an existing SAM/BAM file, use the *Reheader BAM file* tool instead. - -.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 -.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy -.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/covstats.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ -<tool id="coverage_stats" name="Coverage Statistics" version="0.1.7.3"> - <description>Calculate coverage statistics for a BCF file as generated by the Variant Calling tool</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd covstats "$ifile" --ofile "$output_vcf" - </command> - - <inputs> - <param format="bcf" help="Use the Variant Calling tool to generate input for this tool." label="BCF input file" name="ifile" type="data" /> - </inputs> - <outputs> - <data format="tabular" label="Coverage Statistics for ${on_string}" name="output_vcf" /> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool takes as input a BCF file produced by the *Variant Calling* tool, and calculates per-chromosome read coverage from it. - -.. class:: warningmark - - The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/deletion_predictor.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,65 +0,0 @@ -<tool id="deletion_prediction" name="Deletion Prediction for paired-end data" version="0.1.7.3"> - <description>Predicts deletions in one or more aligned read samples based on coverage of the reference genome and on insert sizes</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd delcall - #for $l in $list_input - "${l.bamfile}" - #end for - "$covfile" -o "$outputfile" - --max-cov "$max_cov" --min-size "$min_size" $include_uncovered $group_by_id --verbose - </command> - - <inputs> - <repeat default="1" min="1" name="list_input" title="Aligned reads input source"> - <param format="bam" label="input BAM file" name="bamfile" type="data" /> - </repeat> - <param format="bcf" help="Use the Variant Calling tool to generate this file." label="BCF variant call file to extract coverage from" name="covfile" type="data" /> - <param checked="false" falsevalue="" help="If selected, reads from different read groups will be treated strictly separate. If turned off, read groups with identical sample names are used together for identifying uncovered regions, but are still treated separately for the prediction of deletions." label="group reads based on read group id only" name="group_by_id" truevalue="-i" type="boolean" /> - <param checked="false" falsevalue="" help="If selected, regions that fulfill the coverage criteria below, but are not statistically significant deletions, will be included in the output." label="include low-coverage regions" name="include_uncovered" truevalue="-u" type="boolean" /> - <param help="The maximal coverage at a site allowed to consider it as part of a low-coverage region" label="maximal coverage allowed inside a low-coverage region (default: 0)" name="max_cov" type="integer" value="0" /> - <param help="A low-coverage region must consist of at least this number of consecutive bases below the maximal coverage to consider it in further analyses." label="minimal deletion size (default: 100)" name="min_size" type="integer" value="100" /> - </inputs> - - <outputs> - <data format="gff" name="outputfile" /> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool predicts deletions from paired-end data in a two-step process: - -1) It finds regions of low-coverage, i.e., candidate regions for deletions, by scanning a BCF file produced by the *Variant Calling* tool. - - The *maximal coverage allowed inside a low-coverage region* and the *minimal deletion size* parameters are used at this step to define what is considered a low-coverage region. - - .. class:: warningmark - - The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. - -2) It assesses every low-coverage region statistically for evidence of it being a real deletion. **This step requires paired-end data** since it relies on shifts in the distribution of read pair insert sizes around real deletions. - -By default, the tool only reports Deletions, i.e., the subset of low-coverage regions that pass the statistical test. -If *include low-coverage regions* is selected, regions that failed the test will also be reported. - -With *group reads based on read group id only* selected, as it is by default, grouping of reads into samples is done strictly based on their read group IDs. -With the option deselected, grouping is done based on sample names in the first step of the analysis, i.e. the reads of all samples with a shared sample name are used to identify low-coverage regions. -In the second step, however, reads will be regrouped by their read group IDs again, i.e. the statistical assessment for real deletions is always done on a per read group basis. - -**TIP:** -Deselecting *group reads based on read group id only* can be useful, for example, if you have both paired-end and single-end sequencing data for the same sample. - -In this case, the two sets of reads will usually share a common sample name, but differ in their read groups. -With grouping based on sample names, the single-end data can be used together with the paired-end data to identify low-coverage regions, thus increasing overall coverage and reliability of this step. -Still, the assessment of deletions will use only the paired-end data (auto-detecting that the single-end reads do not provide insert size information). - -</help> - -</tool>
--- a/mimodd_bitbucket_wrappers/fileinfo.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,38 +0,0 @@ -<tool id="fileinfo" name="Retrieve File Information" version="0.1.7.3"> - <description>for supported data formats.</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd info "$ifile" -o "$outputfile" --verbose --oformat $oformat - </command> - - <inputs> - <param format="bam,sam,vcf,bcf,fasta" label="input file" name="ifile" type="data" /> - <param label="output format" name="oformat" type="select"> - <option value="txt">text</option> - <option value="html">html</option> - </param> - </inputs> - - <outputs> - <data format="txt" label="Sample Info on ${on_string}" name="outputfile"> - <change_format> - <when format="html" input="oformat" value="html" /> - </change_format> - </data> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool inspects the input file and generates a report summarizing its contents. - -It autodetects and works with most file formats produced by MiModD, i.e., **SAM / BAM, vcf / bcf and fasta**, and produces a standardized report for all of them. - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/reheader.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,202 +0,0 @@ -<tool id="reheader" name="Reheader BAM file" version="0.1.7.3"> - <description>From a BAM file generate a new file with the original header (if any) replaced or modified by that found in a second SAM file</description> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - #if ($str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form") or $str($co.treat_co) != "ignore": - mimodd header - #if $str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form": - #for $rginfo in $rg.rginfo.rg - #if $str($rginfo.source_id): - --rg-id "${rginfo.source_id}" - #end if - #if $str($rginfo.rg_sm): - --rg-sm "${rginfo.rg_sm}" - #end if - #if $str($rginfo.rg_cn): - --rg-cn "${rginfo.rg_cn}" - #else: - --rg-cn "" - #end if - #if $str($rginfo.rg_ds): - --rg-ds "${rginfo.rg_ds}" - #else: - --rg-ds "" - #end if - #if $str($rginfo.rg_date): - --rg-dt "${rginfo.rg_date}" - #else: - --rg-dt "" - #end if - #if $str($rginfo.rg_lb): - --rg-lb "${rginfo.rg_lb}" - #else: - --rg-lb "" - #end if - #if $str($rginfo.rg_pl): - --rg-pl "${rginfo.rg_pl}" - #else: - --rg-pl "" - #end if - #if $str($rginfo.rg_pi): - --rg-pi "${rginfo.rg_pi}" - #else: - --rg-pi "" - #end if - #if $str($rginfo.rg_pu): - --rg-pu "${rginfo.rg_pu}" - #else: - --rg-pu "" - #end if - #end for - #end if - #if $str($co.treat_co) != "ignore": - --co - #for $comment in $co.coinfo - #if $str($comment.line): - "${comment.line}" - #end if - #end for - #end if - | - #end if - mimodd reheader "$inputfile" --sq ignore - --rg ${rg.treat_rg} - #if $str($rg.treat_rg) != "ignore": - #if $str($rg.rginfo.source) == "from_file": - "${rg.rginfo.data}" - #else: - - - #end if - #for $rgmapping in $rg.rginfo.rg - #if $str($rgmapping.source_id) and $str($rgmapping.rg_id): - "$str($rgmapping.source_id)" : "$str($rgmapping.rg_id)" - #end if - #end for - #end if - - --co ${co.treat_co} - #if $str($co.treat_co) != "ignore": - - - #end if - - #set $restr = "" - #for $rename in $rg_renaming - #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') - #end for - #if $restr - --rgm $restr - #end if - - #set $restr = "" - #for $rename in $sq_renaming - #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') - #end for - #if $restr - --sqm $restr - #end if - - -o "$output" - </command> - - <macros> - <import>toolshed_macros.xml</import> - <macro name="getreadgroupinfo"> - <conditional name="rginfo"> - <param help="" label="source of new read-group information" name="source" type="select"> - <option value="from_file">existing SAM file</option> - <option value="from_form">input form</option> - </param> - <when value="from_file"> - <param format="sam" help="use the read group information found in this file" label="read-group template file in SAM format" name="data" type="data" /> - <repeat default="0" help="read-group information found in the input file, by default, gets updated / replaced with information from template file read-groups with matching IDs. Alternatively, you may specify explicit read-group mappings below." min="0" name="rg" title="custom read-group mapping"> - <param label="modify input file information for read-group ID (will create the read-group if it does not exist)" name="source_id" type="text" /> - <param label="with template file information for read-group ID" name="rg_id" type="text" /> - </repeat> - </when> - <when value="from_form"> - <repeat default="1" min="1" name="rg" title="new read-group info"> - <param help="required field" label="read-group ID (will create the read-group if it does not exist)" name="source_id" type="text" /> - <param name="rg_id" type="hidden" value="" /> - <param help="required field" label="sample name" name="rg_sm" type="text" /> - <param label="description" name="rg_ds" type="text" /> - <param label="date (YY-MM-DD format) the run was produced" name="rg_date" type="text" /> - <param label="name of sequencing center" name="rg_cn" type="text" /> - <param label="read-group library" name="rg_lb" type="text" /> - <param label="platform/technology used to produce the reads" name="rg_pl" type="text" /> - <param label="predicted median insert size" name="rg_pi" type="text" /> - <param label="platform unit; unique identifier" name="rg_pu" type="text" /> - </repeat> - </when> - </conditional> - </macro> - </macros> - - <inputs> - - <param format="bam" help="the file to reheader." label="input file in BAM format" name="inputfile" type="data" /> - - <conditional name="rg"> - <param help="Replace mode will ignore ALL existing read group information in the input file and use ONLY template information, Update mode will COPY existing input file information and UPDATE it with template information; choose No, ... to leave read-group information alone." label="modify read-group information ?" name="treat_rg" type="select"> - <option value="ignore">No, do not change read-groups.</option> - <option value="update">Yes, update existing information</option> - <option value="replace">Yes, replace existing information</option> - </param> - <when value="update"> - <expand macro="getreadgroupinfo" /> - </when> - <when value="replace"> - <expand macro="getreadgroupinfo" /> - </when> - </conditional> - - <conditional name="co"> - <param help="" label="modify comments in the input file ?" name="treat_co" type="select"> - <option value="ignore">No, do not change comments.</option> - <option value="update">Yes, append new comments to existing ones</option> - <option value="replace">Yes, replace all existing comments</option> - </param> - <when value="update"> - <repeat default="0" min="0" name="coinfo" title="comment line"> - <param name="line" size="80" type="text" /> - </repeat> - </when> - <when value="replace"> - <repeat default="0" min="0" name="coinfo" title="comment line"> - <param name="line" size="80" type="text" /> - </repeat> - </when> - </conditional> - - <repeat default="0" help="Warning: changing read-group IDs may increase job runtime substantially." min="0" name="rg_renaming" title="rename read-group"> - <param help="as it appears in the current input file header" label="old name" name="from" size="30" type="text" /> - <param label="new name" name="to" size="30" type="text" /> - </repeat> - - <repeat default="0" help="Warning: changing sequence names may increase job runtime substantially." min="0" name="sq_renaming" title="rename sequence"> - <param help="as it appears in the current input file header" label="old name" name="from" size="30" type="text" /> - <param label="new name" name="to" size="30" type="text" /> - </repeat> - - </inputs> - - <outputs> - <data format="bam" label="(Re)headered bam file from MiModd ${tool.name} on ${on_string}" name="output"> - </data> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool generates a copy of the BAM input file with a modified header (i.e., metadata). - -It can update or replace read-group information (i.e., information about the samples in the file), add or replace comment lines, and rename reference sequences declared in the header. - -The tool ensures that the resulting BAM file is valid and can be further processed by other MiModD tools and standard software like samtools. It aborts with an error message if a valid BAM file cannot be generated with the user-specified settings. - -The template information used to modify or replace the input file metadata is provided through forms or, in the case of read-group information, can be taken from an existing SAM file as can be generated, for example, with the *NGS Run Annotation* tool. - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/sam_header.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,128 +0,0 @@ -<tool id="ngs_run_annotation" name="NGS Run Annotation" version="0.1.7.3"> - <description>Create a SAM format header from run metadata for sample annotation.</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd header - - --rg-id "$rg_id" - --rg-sm "$rg_sm" - - #if $str($rg_cn): - --rg-cn "$rg_cn" - #end if - #if $str($rg_ds): - --rg-ds "$rg_ds" - #end if - #if $str($rg_date): - --rg-dt "$rg_date" - #end if - #if $str($rg_lb): - --rg-lb "$rg_lb" - #end if - #if $str($rg_pl): - --rg-pl "$rg_pl" - #end if - #if $str($rg_pi): - --rg-pi "$rg_pi" - #end if - #if $str($rg_pu): - --rg-pu "$rg_pu" - #end if - - --ofile "$outputfile" - - </command> - - <inputs> - <param label="read-group ID (required)" name="rg_id" size="80" type="text"> - <sanitizer invalid_char=""> - <valid initial="string.printable"> - <remove value=""" /> - </valid> - <mapping initial="none"> - <add source=""" target="\"" /> - </mapping> - </sanitizer> - </param> - <param label="sample name (required)" name="rg_sm" size="80" type="text"> - <sanitizer invalid_char=""> - <valid initial="string.printable"> - <remove value=""" /> - </valid> - <mapping initial="none"> - <add source=""" target="\"" /> - </mapping> - </sanitizer> - </param> - <param label="description" name="rg_ds" size="80" type="text"> - <sanitizer invalid_char=""> - <valid initial="string.printable"> - <remove value=""" /> - </valid> - <mapping initial="none"> - <add source=""" target="\"" /> - </mapping> - </sanitizer> - </param> - <param label="date (YYYY-MM-DD) the run was produced" name="rg_date" type="text" /> - <param label="name of sequencing center" name="rg_cn" size="80" type="text"> - <sanitizer invalid_char=""> - <valid initial="string.printable"> - <remove value=""" /> - </valid> - <mapping initial="none"> - <add source=""" target="\"" /> - </mapping> - </sanitizer> - </param> - <param label="read-group library" name="rg_lb" size="80" type="text"> - <sanitizer invalid_char=""> - <valid initial="string.printable"> - <remove value=""" /> - </valid> - <mapping initial="none"> - <add source=""" target="\"" /> - </mapping> - </sanitizer> - </param> - <param label="platform/technology used to produce the reads" name="rg_pl" type="text" /> - <param label="predicted median insert size" name="rg_pi" type="text" /> - <param label="platform unit; unique identifier" name="rg_pu" size="80" type="text"> - <sanitizer invalid_char=""> - <valid initial="string.printable"> - <remove value=""" /> - </valid> - <mapping initial="none"> - <add source=""" target="\"" /> - </mapping> - </sanitizer> - </param> - </inputs> - - <outputs> - <data format="sam" label="${rg_sm} (${rg_id}) header information from MiModd ${tool.name} on ${on_string}" name="outputfile" /> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -This tool takes the user-provided information about a next-generation sequencing run and constructs a valid header in the SAM file format from it. - -The result file can be used by the tools *Convert* and *Reheader* or in the *SNAP Read Alignment* step to add run metadata to sequenced reads files (or to overwrite pre-existing information). - -**Note:** - -**MiModD requires run metadata for every input file at the Alignment step !** - -**Tip:** - -While you can do Alignments from fastq file format by providing a custom header file directly to the *SNAP Read Alignment* tool, we **recommend** you to first convert all input files to and archive all datasets in SAM/BAM format with appropriate header information prior to any downstream analysis. Although a bit more time-consuming, this practice protects against information loss and ensures that the input datasets will remain useful for others in the future. - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/snap_caller.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,241 +0,0 @@ -<tool id="read_alignment" name="SNAP Read Alignment" version="0.1.7.3"> - <description>Map sequence reads to a reference genome using SNAP</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd snap-batch -s - ## SNAP calls (considering different cases) - - #for $i in $datasets - "snap ${i.mode_choose.mode} '$ref_genome' - #if $str($i.mode_choose.mode) == "paired" and $str($i.mode_choose.input.iformat) in ("fastq", "gz"): -'${i.mode_choose.input.ifile1}' '${i.mode_choose.input.ifile2}' - #else: -'${i.mode_choose.input.ifile}' - #end if ---ofile '$outputfile' --iformat ${i.mode_choose.input.iformat} --oformat $oformat ---idx-seedsize '$set.seedsize' ---idx-slack '$set.slack' --maxseeds '$set.maxseeds' --maxhits '$set.maxhits' --clipping $set.clipping --maxdist '$set.maxdist' --confdiff '$set.confdiff' --confadapt '$set.confadpt' - #if $i.mode_choose.input.header: ---header '${i.mode_choose.input.header}' - #end if - #if $str($i.mode_choose.mode) == "paired": ---spacing '$set.sp_min' '$set.sp_max' - #end if - #if $str($set.selectivity) != "off": ---selectivity '$set.selectivity' - #end if - #if $str($set.filter_output) != "off": ---filter-output $set.filter_output - #end if - #if $str($set.sort) == "off": ---no-sort - #end if - #if $str($set.mmatch_notation) != "general": --X - #end if - #if $set.discard_overlapping_mates: ---discard-overlapping-mates - ## remove ',' (and possibly adjacent whitespace) and replace with ' ' - '#echo ("' '".join($set.discard_overlapping_mates.replace(" ", "").split(',')))#' - #end if ---verbose -" - #end for - </command> - - <inputs> - ## mandatory arguments (and mode-conditionals) - - <param format="fasta" help="The fasta reference genome that SNAP should align reads against." label="reference genome" name="ref_genome" type="data" /> - - <repeat default="1" min="1" name="datasets" title="datasets"> - <conditional name="mode_choose"> - <param help="Reads obtained from single-end sequencing runs should be aligned in 'single' mode, paired-end reads in 'paired' mode. **WARNING**: if the read input file is in SAM/BAM format, the current version of this tool will **not** verify the mode and may produce erroneous alignments with wrong settings!" label="choose mode" name="mode" type="select"> - <option value="single">single-end</option> - <option value="paired">paired-end</option> - </param> - - <when value="single"> - <conditional name="input"> - <param label="input file format" name="iformat" type="select"> - <option value="bam">BAM</option> - <option value="sam">SAM</option> - <option value="gz">gz</option> - <option value="fastq">fastq</option> - </param> - <when value="bam"> - <param format="bam" label="input file" name="ifile" type="data" /> - <param format="sam" label="custom header file" name="header" optional="true" type="data" /> - </when> - <when value="sam"> - <param format="sam" label="input file" name="ifile" type="data" /> - <param format="sam" label="custom header file" name="header" optional="true" type="data" /> - </when> - <when value="gz"> - <param label="input file" name="ifile" type="data" /> - <param format="sam" label="header file" name="header" type="data" /> - </when> - <when value="fastq"> - <param format="fastq" label="input file" name="ifile" type="data" /> - <param format="sam" label="header file" name="header" type="data" /> - </when> - </conditional> - </when> - <when value="paired"> - <conditional name="input"> - <param label="input file format" name="iformat" type="select"> - <option value="bam">BAM</option> - <option value="sam">SAM</option> - <option value="gz">gz</option> - <option value="fastq">fastq</option> - </param> - <when value="bam"> - <param format="bam" label="input file" name="ifile" type="data" /> - <param format="sam" label="custom header file" name="header" optional="true" type="data" /> - </when> - <when value="sam"> - <param format="sam" label="input file" name="ifile" type="data" /> - <param format="sam" label="custom header file" name="header" optional="true" type="data" /> - </when> - <when value="fastq"> - <param format="fastq" label="inputfile with the first set of reads of paired-end data" name="ifile1" type="data" /> - <param format="fastq" label="inputfile with the second set of reads of paired-end data" name="ifile2" type="data" /> - <param format="sam" help="required" label="header file" name="header" type="data" /> - </when> - <when value="gz"> - <param label="inputfile with the first set of reads of paired-end data" name="ifile1" type="data" /> - <param label="inputfile with the second set of reads of paired-end data" name="ifile2" type="data" /> - <param format="sam" help="required" label="header file" name="header" type="data" /> - </when> - </conditional> - </when> - </conditional> - </repeat> - - <param label="output file format" name="oformat" type="select"> - <option value="bam">BAM</option> - <option value="sam">SAM</option> - </param> - - ## optional arguments - - <conditional name="set"> - <param help="This section lets you specify the detailed parameter settings for the SNAP aligner. Only change them if you know what you are doing, i.e., read the documentation first." label="further parameter settings" name="settings_mode" type="select"> - <option value="default">default settings</option> - <option value="change">change settings</option> - </param> - - ## default settings - - <when value="default"> - <param name="seedsize" type="hidden" value="20" /> - <param name="slack" type="hidden" value="0.3" /> - <param name="sp_min" type="hidden" value="100" /> - <param name="sp_max" type="hidden" value="10000" /> - <param name="maxdist" type="hidden" value="8" /> - <param name="confdiff" type="hidden" value="2" /> - <param name="confadpt" type="hidden" value="7" /> - - <param name="maxseeds" type="hidden" value="25" /> - <param name="maxhits" type="hidden" value="250" /> - <param name="clipping" type="hidden" value="++" /> - - <param name="selectivity" type="hidden" value="off" /> - <param name="filter_output" type="hidden" value="off" /> - <param name="sort" type="hidden" value="0" /> - <param name="mmatch_notation" type="hidden" value="general" /> - <param name="discard_overlapping_mates" type="hidden" value="" /> - </when> - - ## change settings - - <when value="change"> - <param help="Length of the seeds used in the reference genome hash table (SNAP index option -s)." label="seed size (default: 20)" name="seedsize" type="integer" value="20" /> - <param help="Corresponds to the -h option of SNAP index." label="hash table slack size (default: 0.3)" name="slack" type="float" value="0.3" /> - - ## paired-end specific options - <param help="Corresponds to the first value of the SNAP option -s. Affects paired-end data only." label="minimum spacing to allow between paired ends (default: 100)" name="sp_min" type="integer" value="100" /> - <param help="Corresponds to the second value of the SNAP option -s. Affects paired-end data only." label="maximum spacing to allow between paired ends (default: 10000)" name="sp_max" type="integer" value="10000" /> - <param display="checkboxes" help="Consider overlapping mate pairs of the given orientation type(s) anomalous and discard them; allowed values: RF, FR, FF, RR; multiple types may be specified as a comma-separated list and ALL can be used as a shortcut for discarding all overlapping mate pairs; leave blank to retain all overlapping pairs. Affects paired-end data only." label="discard overlapping read pairs of type" multiple="true" name="discard_overlapping_mates" type="text" /> - <param help="maximum edit distance allowed per read or pair (SNAP option -d); higher values allow more divergent alignments to be found, but increase the rate of misalignments." label="edit distance (default: 8)" name="maxdist" type="integer" value="8" /> - <param help="Maximum hits to consider per seed (SNAP option -h); don't use a seed region in the alignment process if it matches more than maxhits regions in the reference genome. Higher values reduce the rate of misalignments, but reduce performance." label="maximum hits per seed (default: 250)" name="maxhits" type="integer" value="250" /> - <param help="Confidence threshold (SNAP option -c); the minimum edit distance difference between two alternate alignments required to reject the poorer alignment as suboptimal; higher values increase the rate of ambiguously aligned reads." label="confidence threshold (default: 2)" name="confdiff" type="integer" value="2" /> - <param help="Specifies how many seeds of a read may be ignored (based on the maximum hits value above) before the confidence threshold above gets increased by one for that read; helps fine-tuning alignment accuracy in repetitive regions of the genome." label="adaptive confdiff behaviour (default: 7)" name="confadpt" type="integer" value="7" /> - <param help="Number of seeds to use per read (SNAP option -n) when trying to match it to the reference genome; higher numbers will increase the rate of aligned reads and reduce the rate of misalignments, but will reduce performance." label="maximum seeds per read (default: 25)" name="maxseeds" type="integer" value="25" /> - <param help="Specifies from which end of a read low-quality bases should be clipped (SNAP option -Cxx)" label="read clipping (default: from back and front)" name="clipping" type="select"> - <option value="++">from back and front</option> - <option value="x+">from back only</option> - <option value="+x">from front only</option> - <option value="xx">no clipping</option> - </param> - <param help="randomly choose 1/selectivity of the reads to score (SNAP option -S). The tool uses the default of 1 (or a 0 setting) to indicate that all reads should be worked with." label="selectivity (default: 1)" name="selectivity" type="integer" value="1" /> - <param help="filter output (SNAP option -F for certain classes of reads." label="filter output (default: no filtering)" name="filter_output" type="select"> - <option value="off">no filtering</option> - <option value="a">aligned only</option> - <option value="s">single-aligned only</option> - <option value="u">unaligned only</option> - </param> - <param help="Sort the output file by alignment location (SNAP option --so)." label="output sorting (default: sort by read coordinates)" name="sort" type="select"> - <option value="0">sort by read coordinates</option> - <option value="off">no sorting</option> - </param> - <param help="Indicates whether CIGAR strings in the generated SAM/BAM file should use M (alignment match) rather than = and X (sequence (mis-)match). Warning: Downstream variant calling based on samtools currently relies on the old-style M notation!!" label="CIGAR symbols for alignment matches/mismatches (default: M notation)" name="mmatch_notation" type="select"> - <option value="general">use M for both matches and mismatches</option> - <option value="differentiate">use = for matches, X for mismatches</option> - </param> - </when> - </conditional> -</inputs> - -<outputs> - <data format="bam" label="Aligned reads from MiModd ${tool.name} on ${on_string}" name="outputfile"> - <change_format> - <when format="sam" input="oformat" value="sam" /> - </change_format> - </data> -</outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool aligns the sequenced reads in an arbitrary number of input datasets against a common reference genome and stores the results in a single, possibly multi-sample output file. It supports a variety of different sequenced reads input formats, i.e., SAM, BAM, fastq and gzipped fastq, and both single-end and paired-end data. - -Internally, the tool uses the ultrafast, hashtable-based aligner SNAP (http://snap.cs.berkeley.edu), hence its name. - -**Notes:** - -1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. - -2) To use paired-end fastq data with the tool the read mate information needs to be split over two fastq files in corresponding order. - - **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. - -3) The tool supports the alignment of reads from the same sequencing run, but distributed across several input files. - - Generally, it expects the reads from each input dataset to belong to one read-group and will abort with an error message if any input dataset declares more than one read group or sample names in its header. Different datasets, however, are allowed to contain reads from the same read-group (as indicated by matching read-group IDs and sample names in their headers), in which case the reads will be combined into one group in the output. - -4) Read-group information is required for every input dataset! - - We generally recommend to store NGS datasets in SAM/BAM format with run metadata stored in the file header. You can use the *NGS Run Annotation* and *Convert* tools to convert data in fastq format to SAM/BAM with added run information. - - While it is not our recommended approach, you can, if you prefer it, align reads from fastq files or SAM/BAM files without header read-group information. To do so, you **must** specify a SAM file that provides the missing information in its header along with the input dataset. You can generate a SAM header file with the *NGS Run Annotation* tool. - - Optionally, a SAM header file can also be used to replace existing read-group information in a headered SAM/BAM input file. This can be used to resolve read-group ID conflicts between multiple input files at tool runtime. - -5) The options available under *further parameter settings* can have **big** effects on the alignment quality. You are strongly encouraged to consult the `tool documentation`_ for detailed explanations of the available options. - -6) Currently, you cannot configure aligner-specific options separately for specific input files from within this Galaxy tool. If you need this advanced level of control, you should use the command line tool ``mimodd snap-batch``. - -.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 -.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy -.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest -.. _tool documentation: http://mimodd.readthedocs.org/en/latest/tool_doc.html#snap - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/snp_caller_caller.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,66 +0,0 @@ -<tool id="variant_calling" name="Variant Calling" version="0.1.7.3"> - <description>From a reference and aligned reads generate a BCF file with position-specific variant likelihoods and coverage information</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd varcall - - "$ref_genome" - #for $l in $list_input - "${l.inputfile}" - #end for - --ofile "$output_vcf" - --depth "$depth" - $group_by_id - $no_md5_check - --verbose - --quiet - </command> - - <inputs> - <param format="fasta" label="reference genome" name="ref_genome" type="data" /> - <repeat default="1" min="1" name="list_input" title="Aligned reads input source"> - <param format="bam" label="input file" name="inputfile" type="data" /> - </repeat> - <param checked="false" falsevalue="" help="If selected, this option ensures that only the read group id (but not the sample name) is considered in grouping reads in the input file(s). If turned off, read groups with identical sample names are automatically pooled and analyzed together even if they come from different NGS runs." label="group reads based on read group id only" name="group_by_id" truevalue="-i" type="boolean" /> - <param checked="false" falsevalue="" help="leave turned on to avoid accidental variant calling against a wrong reference genome version (see the tool help below)." label="turn off md5 sum verification" name="no_md5_check" truevalue="-x" type="boolean" /> - <param help="to avoid excessive use of memory" label="maximum per-BAM depth (default: 250)" name="depth" type="integer" value="250" /> - </inputs> - - <outputs> - <data format="bcf" label="Variant Calls from MiModd Variant Calling on ${on_string}" name="output_vcf" /> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool transforms the read-centered information of its aligned reads input files into position-centered information. - -**It produces a BCF file that serves as the basis for all further variant analyses with MiModD**. - -**Notes:** - -By default, the tool will check whether the input BAM file(s) provide(s) MD5 checksums for the reference genome sequences used during read alignment (the *SNAP Read Alignment* tool stores these in the BAM file header). If it finds MD5 sums for all sequences, it will compare them to the actual checksums of the sequences in the specified reference genome and -check that every sequence mentioned in any BAM input file has a counterpart with matching MD5 sum in the reference genome and abort with an error message if that is not the case. If it finds sequences with matching checksum, but different names in the reference genome, it will use the name from the reference genome file in its output. - -This behavior has two benefits: - -1) It protects from accidental variant calling against a wrong reference genome (i.e., a different one than that used during the alignment step), which would result in wrong calls. This is the primary reason why we recommend to leave the check activated - -2) It provides an opportunity to change sequence names between aligned reads files and variant call files by providing a reference genome file with altered sequence names (but identical sequence data). - -Since there may be rare cases where you *really* want to align against a reference genome with different checksums (e.g., you may have edited the reference sequence based on the alignment results), the check can be turned off, but only do this if you know exactly why. - ------------ - -Internally, the tool uses samtools mpileup combined with bcftools to do all per-nucleotide calculations. - -It exposes just a single configuration parameter of these tools - the *maximum per-BAM depth*. Through this parameter, the maximum number of reads considered for variant calling at any site can be controlled. Its default value of 250 is taken from *samtools mpileup* and usually suitable. Consider, however, that this gives the maximum read number per input file, so if you have a large number of samples in one input file, it could become necessary to increase the value to get sufficient reads considered per sample. - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/snpeff_genomes.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,24 +0,0 @@ -<tool id="snpeff_genomes" name="List Installed SnpEff Genomes" version="0.1.7.3"> - <description>Checks the local SnpEff installation to compile a list of currently installed genomes</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd snpeff-genomes -o "$outputfile" - </command> - <outputs> - <data format="tabular" name="outputfile" /> - </outputs> -<help> -.. class:: infomark - -**What it does** - -When executed this tool searches the host machine's SnpEff installation for properly registered and installed -genome annotation files. The resulting list is added as a plain text file to your history for use with the *Variant Annotation* Tool. - -</help> - -</tool>
--- a/mimodd_bitbucket_wrappers/toolshed_macros.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,7 +0,0 @@ -<macros> - <xml name="requirements"> - <requirements> - <requirement type="package" version="0.1.7.3">mimodd</requirement> - </requirements> - </xml> -</macros>
--- a/mimodd_bitbucket_wrappers/varextract.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,101 +0,0 @@ -<tool id="extract_variants" name="Extract Variant Sites" version="0.1.7.3"> - <description>from a BCF file</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd varextract "$ifile" - #if $len($sitesinfo) - -p - #for $source in $sitesinfo - "${source.pre_vcf}" - #end for - #end if - --ofile "$output_vcf" - $keep_alts - --verbose - </command> - - <inputs> - <param format="bcf" help="Use the Variant Calling tool to generate the input for this tool." label="BCF input file" name="ifile" type="data" /> - <repeat default="0" name="sitesinfo" title="include information from pre-calculated vcf file"> - <param format="vcf" label="independently generated vcf file" name="pre_vcf" type="data" /> - </repeat> - <param checked="false" falsevalue="" help="If selected, the VCF output will include ALL sites for which non-reference bases have been observed, i.e., even those not considered allelic sites by the variant caller." label="keep all sites with alternate bases" name="keep_alts" truevalue="-a" type="boolean" /> - </inputs> - <outputs> - <data format="vcf" label="Variants extracted with MiModd from ${on_string}" name="output_vcf" /> - </outputs> - -<help> -.. class:: infomark - - **What it does** - -The tool takes as input a BCF file like the ones produced by the *Variant Calling* tool, extracts just the variant sites from it and reports them in VCF format. - -If the BCF input file specifies multiple samples, sites are included if they qualify as variant sites in at least one sample. - -In a typical analysis workflow, you will use the tool's VCF output as input for the *VCF Filter* tool to cut down the often still impressive list of sites to a subset with relevance to your project. - -**Options:** - -1) By default, a variant site is considered to be a position in the genome for which a non-reference allele appears in the inferred genotype of any sample. - - You can select the *keep all sites with alternate bases* option, if instead you want to extract all sites, for which at least one non-reference base has been observed (whether resulting in a non-reference allele call or not). Using this option should rarely be necessary, but could be occassionally helpful for closer inspection of candidate genomic regions. - -2) During the process of variant extraction the tool can take into account genome positions specified in one or more independently generated VCF files. If such additional VCF input is provided, the tool output will contain the samples found in these files as additional samples and sites from the main BCF file will be included if they either qualify as variant sites in at least one sample specified in the BCF or if they are listed in any of the additional VCF files. - - Optional VCF input can be particularly useful in one of the following situations: - - *scenario i* - you have prior information that leads you to think that certain genome positions are of special relevance for your project and, thus, you are interested in the statistics produced by the variant caller for these positions even if they are not considered variant sites. In this case you can use a minimal VCF file to guide the variant extraction process to include these positions. This minimal VCF file needs a minimal header: - - ``##fileformat=VCFv4.2`` - - followed by positional information like in this example:: - - #CHROM POS ID REF ALT QUAL FILTER INFO - chrI 1222 . . . . . . - chrI 2651 . . . . . . - chrI 3659 . . . . . . - chrI 3731 . . . . . . - - , where columns are tab-separated and . serves as a placeholder for missing information. - - *scenario ii* - you have actual variant calls from an additional sample, but you do not have access to the original sequenced reads data (if you had, the recommended approach would be to align this data along with your other sequencing data or, at least, to perform the *Variant Calling* step together). - - This situation is often encountered with published datasets. Assume you have obtained a list of known single nucleotide variants (SNVs) found in one particular strain of your favorite model organism and you would like to know which of these SNVs are present in the related strains you have sequenced. You have aligned the sequenced reads from your samples and have used the *Variant Calling* tool, which has generated a BCF file ready for variant extraction. If the SNV list for the previously sequenced strain is in VCF format already, you can now just plug it into the analysis process by specifying it in the tool interface as an *independently generated vcf file*. The resulting vcf output file will contain all SNV sites along with the variant sites found in the BCF alone. You can then proceed to the *VCF Filter* tool to look at the original SNV sites only or to investigate any other interesting subset of sites. If the SNV list is in some other format, you will have o convert it to VCF first. At a minimum, the file must have a ``##fileformat`` header line like the previous example and have the ``REF`` and ``ALT`` column filled in like so:: - - #CHROM POS ID REF ALT QUAL FILTER INFO - chrI 1897409 . A G . . . - chrI 1897492 . C T . . . - chrI 1897616 . C A . . . - chrI 1897987 . A T . . . - chrI 1898185 . C T . . . - chrI 1898715 . G A . . . - chrI 1898729 . T C . . . - chrI 1900288 . T A . . . - - , in which case the tool will assume that the corresponding sample is homozygous for each of the SNVs. If you need to distinguish between homozygous and heterozygous SNVs you will have to extend the format to include a format and a sample column with genotype (GT) information like in this example:: - - #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleX - chrI 1897409 . A G . . . GT 1/1 - chrI 1897492 . C T . . . GT 0/1 - chrI 1897616 . C A . . . GT 0/1 - chrI 1897987 . A T . . . GT 0/1 - chrI 1898185 . C T . . . GT 0/1 - chrI 1898715 . G A . . . GT 0/1 - chrI 1898729 . T C . . . GT 0/1 - chrI 1900288 . T A . . . GT 0/1 - - , in which sampleX would be heterozygous for all SNVs except the first. - - .. class:: warningmark - - If the optional VCF input contains INDEL calls, these will be ignored by the tool. - - -</help> -</tool>
--- a/mimodd_bitbucket_wrappers/vcf_filter.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,133 +0,0 @@ -<tool id="vcf_filter" name="VCF Filter" version="0.1.7.3"> - <description>Extracts lines from a vcf variant file based on field-specific filters</description> - <macros> - <import>toolshed_macros.xml</import> - </macros> - <expand macro="requirements" /> - <version_command>mimodd version -q</version_command> - <command> - mimodd vcf-filter - "$inputfile" - -o "$outputfile" - #if len($datasets): - -s - #for $i in $datasets - "$i.sample" - #end for - --gt - #for $i in $datasets - ## remove whitespace from free-text input - "#echo ("".join($i.GT.split()) or "ANY")#" - #echo " " - #end for - --dp - #for $i in $datasets - "$i.DP" - #end for - --gq - #for $i in $datasets - "$i.GQ" - #end for - --af - #for $i in $datasets - "#echo ($i.AF or "::")#" - #end for - #end if - #if len($regions): - -r - #for $i in $regions - #if $i.stop: - "$i.chrom:$i.start-$i.stop" - #else: - "$i.chrom:$i.start" - #end if - #end for - #end if - #if $vfilter: - --vfilter - ## remove ',' and replace with ' ' - "#echo ('" "'.join($vfilter.split(',')))#" - #end if - $vartype - </command> - - <inputs> - <param format="vcf" label="VCF input file" name="inputfile" type="data" /> - <repeat default="0" min="0" name="datasets" title="Sample-specific Filter"> - <param help="name of a sample as it appears in the VCF input file and that indicates the sample that this filter should be applied to." label="sample" name="sample" type="text" /> - <param help="keep only variants for which the genotype of the sample matches the specified pattern; format: x/x where x = 0 is wildtype and x = 1 is mutant. Multiple genotypes can be specified as a comma-separated list." label="genotype pattern(s) for the inclusion of variants" name="GT" type="text" /> - <param help="keep only variants with at least this sample-specific coverage at the variant site" label="depth of coverage for the sample at the variant site" name="DP" type="integer" value="0" /> - <param help="keep only variants for which the genotype prediction for the sample has at least this quality" label="genotype quality for the variant in the sample" name="GQ" type="integer" value="0" /> - <param help="expected format: [allele number]:[minimal fraction]:[maximal fraction]; keep only variants for which the fraction of sample-specific reads supporting a given allele number is between minimal and maximal fraction; if allele number is omitted, the filter operates on the most frequent non-reference allele instead" label="allelic fraction filter" name="AF" type="text" /> - </repeat> - <repeat default="0" help="Filter variant sites by their position in the genome. If multiple Region Filters are specified, all variants that fall in ONE of the regions are reported." min="0" name="regions" title="Region Filter"> - <param label="Chromosome" name="chrom" type="text" /> - <param label="Region Start" name="start" type="text" /> - <param label="Region End" name="stop" type="text" /> - </repeat> - <param label="Select the types of variants to include in the output" name="vartype" type="select"> - <option value="">all types of variants</option> - <option value="--no-indels">exclude indels</option> - <option value="--indels-only">only indels</option> - </param> - <param help="Filter output by sample name; only the sample-specific columns with their sample name matching any of the comma separated filters will be retained in the output." label="sample" name="vfilter" type="text" /> - </inputs> - - <outputs> - <data format="vcf" name="outputfile" /> - </outputs> - - <help> -.. class:: infomark - - **What it does** - -The tool filters a variant file in VCF format to generate a new VCF file with only a subset of the original variants. - -The following types of variant filters can be set up: - -1) Sample-specific filters: - - Filter variants based on their characteristics in the sequenced reads of a specific sample. Multiple sample-specific filters are combined by logical AND, i.e., only variants that pass ALL sample-specific filters are kept. - -2) Region filters: - - Filter variants based on the genomic region they affect. Multiple region filters are combined by logical OR, i.e., variants passing ANY region filter are kept. - -3) Variant type filter: - - Filter variants by their type, i.e. whether they are single nucleotide variations (SNVs) or indels - -In addition, the *sample* filter can be used to reduce the samples encoded in a multi-sample VCF file to just those specified by the filter. -The *sample* filter is included mainly for compatibility reasons: if an external tool cannot deal with the multisample file format, but instead looks only at the first sample-specific column of the file, you can use the filter to turn the multi-sample file into a single-sample file. Besides, the filter can also be used to change the order of the samples since it will sort the samples in the order specified in the filter field. - -**Examples of sample-specific filters:** - -*Simple genotype pattern* - -genotype pattern: 1/1 ==> keep all variants in the vcf input file for which the specified sample's genotype is homozygous mutant - -*Complex genotype pattern* - -genotype pattern: 0/1, 0/0 ==> keep all variants for which the sample's genotype is either heterozygous or homozygous wildtype - -*Multiple sample-specific filters* - -Filter 1: genotype pattern: 0/0, Filter 2: genotype pattern 1/1: -==> keep all variants for which the first sample's gentoype is homozygous wildtype **and** the second sample's genotype is homozygous mutant - -*Combining sample-specific filter criteria* - -genotype pattern: 1/1, depth of coverage: 3, genotype quality: 9 -==> keep variants for which the sample's genotype is homozygous mutant **and** for which this genotype assignment is corroborated by a genotype quality score of at least 9 -**and** at least three reads from the sample cover the variant site - -**TIP:** - -As in the example above, genotype quality is typically most useful in combination with a genotype pattern. -It acts then, effectively, to make the genotype filter more stringent. - - - - </help> -</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/reheader.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,202 @@ +<tool id="reheader" name="Reheader BAM file" version="0.1.7.3"> + <description>From a BAM file generate a new file with the original header (if any) replaced or modified by that found in a second SAM file</description> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + #if ($str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form") or $str($co.treat_co) != "ignore": + mimodd header + #if $str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form": + #for $rginfo in $rg.rginfo.rg + #if $str($rginfo.source_id): + --rg-id "${rginfo.source_id}" + #end if + #if $str($rginfo.rg_sm): + --rg-sm "${rginfo.rg_sm}" + #end if + #if $str($rginfo.rg_cn): + --rg-cn "${rginfo.rg_cn}" + #else: + --rg-cn "" + #end if + #if $str($rginfo.rg_ds): + --rg-ds "${rginfo.rg_ds}" + #else: + --rg-ds "" + #end if + #if $str($rginfo.rg_date): + --rg-dt "${rginfo.rg_date}" + #else: + --rg-dt "" + #end if + #if $str($rginfo.rg_lb): + --rg-lb "${rginfo.rg_lb}" + #else: + --rg-lb "" + #end if + #if $str($rginfo.rg_pl): + --rg-pl "${rginfo.rg_pl}" + #else: + --rg-pl "" + #end if + #if $str($rginfo.rg_pi): + --rg-pi "${rginfo.rg_pi}" + #else: + --rg-pi "" + #end if + #if $str($rginfo.rg_pu): + --rg-pu "${rginfo.rg_pu}" + #else: + --rg-pu "" + #end if + #end for + #end if + #if $str($co.treat_co) != "ignore": + --co + #for $comment in $co.coinfo + #if $str($comment.line): + "${comment.line}" + #end if + #end for + #end if + | + #end if + mimodd reheader "$inputfile" --sq ignore + --rg ${rg.treat_rg} + #if $str($rg.treat_rg) != "ignore": + #if $str($rg.rginfo.source) == "from_file": + "${rg.rginfo.data}" + #else: + - + #end if + #for $rgmapping in $rg.rginfo.rg + #if $str($rgmapping.source_id) and $str($rgmapping.rg_id): + "$str($rgmapping.source_id)" : "$str($rgmapping.rg_id)" + #end if + #end for + #end if + + --co ${co.treat_co} + #if $str($co.treat_co) != "ignore": + - + #end if + + #set $restr = "" + #for $rename in $rg_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') + #end for + #if $restr + --rgm $restr + #end if + + #set $restr = "" + #for $rename in $sq_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') + #end for + #if $restr + --sqm $restr + #end if + + -o "$output" + </command> + + <macros> + <import>toolshed_macros.xml</import> + <macro name="getreadgroupinfo"> + <conditional name="rginfo"> + <param help="" label="source of new read-group information" name="source" type="select"> + <option value="from_file">existing SAM file</option> + <option value="from_form">input form</option> + </param> + <when value="from_file"> + <param format="sam" help="use the read group information found in this file" label="read-group template file in SAM format" name="data" type="data" /> + <repeat default="0" help="read-group information found in the input file, by default, gets updated / replaced with information from template file read-groups with matching IDs. Alternatively, you may specify explicit read-group mappings below." min="0" name="rg" title="custom read-group mapping"> + <param label="modify input file information for read-group ID (will create the read-group if it does not exist)" name="source_id" type="text" /> + <param label="with template file information for read-group ID" name="rg_id" type="text" /> + </repeat> + </when> + <when value="from_form"> + <repeat default="1" min="1" name="rg" title="new read-group info"> + <param help="required field" label="read-group ID (will create the read-group if it does not exist)" name="source_id" type="text" /> + <param name="rg_id" type="hidden" value="" /> + <param help="required field" label="sample name" name="rg_sm" type="text" /> + <param label="description" name="rg_ds" type="text" /> + <param label="date (YY-MM-DD format) the run was produced" name="rg_date" type="text" /> + <param label="name of sequencing center" name="rg_cn" type="text" /> + <param label="read-group library" name="rg_lb" type="text" /> + <param label="platform/technology used to produce the reads" name="rg_pl" type="text" /> + <param label="predicted median insert size" name="rg_pi" type="text" /> + <param label="platform unit; unique identifier" name="rg_pu" type="text" /> + </repeat> + </when> + </conditional> + </macro> + </macros> + + <inputs> + + <param format="bam" help="the file to reheader." label="input file in BAM format" name="inputfile" type="data" /> + + <conditional name="rg"> + <param help="Replace mode will ignore ALL existing read group information in the input file and use ONLY template information, Update mode will COPY existing input file information and UPDATE it with template information; choose No, ... to leave read-group information alone." label="modify read-group information ?" name="treat_rg" type="select"> + <option value="ignore">No, do not change read-groups.</option> + <option value="update">Yes, update existing information</option> + <option value="replace">Yes, replace existing information</option> + </param> + <when value="update"> + <expand macro="getreadgroupinfo" /> + </when> + <when value="replace"> + <expand macro="getreadgroupinfo" /> + </when> + </conditional> + + <conditional name="co"> + <param help="" label="modify comments in the input file ?" name="treat_co" type="select"> + <option value="ignore">No, do not change comments.</option> + <option value="update">Yes, append new comments to existing ones</option> + <option value="replace">Yes, replace all existing comments</option> + </param> + <when value="update"> + <repeat default="0" min="0" name="coinfo" title="comment line"> + <param name="line" size="80" type="text" /> + </repeat> + </when> + <when value="replace"> + <repeat default="0" min="0" name="coinfo" title="comment line"> + <param name="line" size="80" type="text" /> + </repeat> + </when> + </conditional> + + <repeat default="0" help="Warning: changing read-group IDs may increase job runtime substantially." min="0" name="rg_renaming" title="rename read-group"> + <param help="as it appears in the current input file header" label="old name" name="from" size="30" type="text" /> + <param label="new name" name="to" size="30" type="text" /> + </repeat> + + <repeat default="0" help="Warning: changing sequence names may increase job runtime substantially." min="0" name="sq_renaming" title="rename sequence"> + <param help="as it appears in the current input file header" label="old name" name="from" size="30" type="text" /> + <param label="new name" name="to" size="30" type="text" /> + </repeat> + + </inputs> + + <outputs> + <data format="bam" label="(Re)headered bam file from MiModd ${tool.name} on ${on_string}" name="output"> + </data> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool generates a copy of the BAM input file with a modified header (i.e., metadata). + +It can update or replace read-group information (i.e., information about the samples in the file), add or replace comment lines, and rename reference sequences declared in the header. + +The tool ensures that the resulting BAM file is valid and can be further processed by other MiModD tools and standard software like samtools. It aborts with an error message if a valid BAM file cannot be generated with the user-specified settings. + +The template information used to modify or replace the input file metadata is provided through forms or, in the case of read-group information, can be taken from an existing SAM file as can be generated, for example, with the *NGS Run Annotation* tool. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sam_header.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,128 @@ +<tool id="ngs_run_annotation" name="NGS Run Annotation" version="0.1.7.3"> + <description>Create a SAM format header from run metadata for sample annotation.</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd header + + --rg-id "$rg_id" + --rg-sm "$rg_sm" + + #if $str($rg_cn): + --rg-cn "$rg_cn" + #end if + #if $str($rg_ds): + --rg-ds "$rg_ds" + #end if + #if $str($rg_date): + --rg-dt "$rg_date" + #end if + #if $str($rg_lb): + --rg-lb "$rg_lb" + #end if + #if $str($rg_pl): + --rg-pl "$rg_pl" + #end if + #if $str($rg_pi): + --rg-pi "$rg_pi" + #end if + #if $str($rg_pu): + --rg-pu "$rg_pu" + #end if + + --ofile "$outputfile" + + </command> + + <inputs> + <param label="read-group ID (required)" name="rg_id" size="80" type="text"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\"" /> + </mapping> + </sanitizer> + </param> + <param label="sample name (required)" name="rg_sm" size="80" type="text"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\"" /> + </mapping> + </sanitizer> + </param> + <param label="description" name="rg_ds" size="80" type="text"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\"" /> + </mapping> + </sanitizer> + </param> + <param label="date (YYYY-MM-DD) the run was produced" name="rg_date" type="text" /> + <param label="name of sequencing center" name="rg_cn" size="80" type="text"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\"" /> + </mapping> + </sanitizer> + </param> + <param label="read-group library" name="rg_lb" size="80" type="text"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\"" /> + </mapping> + </sanitizer> + </param> + <param label="platform/technology used to produce the reads" name="rg_pl" type="text" /> + <param label="predicted median insert size" name="rg_pi" type="text" /> + <param label="platform unit; unique identifier" name="rg_pu" size="80" type="text"> + <sanitizer invalid_char=""> + <valid initial="string.printable"> + <remove value=""" /> + </valid> + <mapping initial="none"> + <add source=""" target="\"" /> + </mapping> + </sanitizer> + </param> + </inputs> + + <outputs> + <data format="sam" label="${rg_sm} (${rg_id}) header information from MiModd ${tool.name} on ${on_string}" name="outputfile" /> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +This tool takes the user-provided information about a next-generation sequencing run and constructs a valid header in the SAM file format from it. + +The result file can be used by the tools *Convert* and *Reheader* or in the *SNAP Read Alignment* step to add run metadata to sequenced reads files (or to overwrite pre-existing information). + +**Note:** + +**MiModD requires run metadata for every input file at the Alignment step !** + +**Tip:** + +While you can do Alignments from fastq file format by providing a custom header file directly to the *SNAP Read Alignment* tool, we **recommend** you to first convert all input files to and archive all datasets in SAM/BAM format with appropriate header information prior to any downstream analysis. Although a bit more time-consuming, this practice protects against information loss and ensures that the input datasets will remain useful for others in the future. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snap_caller.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,241 @@ +<tool id="read_alignment" name="SNAP Read Alignment" version="0.1.7.3"> + <description>Map sequence reads to a reference genome using SNAP</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd snap-batch -s + ## SNAP calls (considering different cases) + + #for $i in $datasets + "snap ${i.mode_choose.mode} '$ref_genome' + #if $str($i.mode_choose.mode) == "paired" and $str($i.mode_choose.input.iformat) in ("fastq", "gz"): +'${i.mode_choose.input.ifile1}' '${i.mode_choose.input.ifile2}' + #else: +'${i.mode_choose.input.ifile}' + #end if +--ofile '$outputfile' --iformat ${i.mode_choose.input.iformat} --oformat $oformat +--idx-seedsize '$set.seedsize' +--idx-slack '$set.slack' --maxseeds '$set.maxseeds' --maxhits '$set.maxhits' --clipping $set.clipping --maxdist '$set.maxdist' --confdiff '$set.confdiff' --confadapt '$set.confadpt' + #if $i.mode_choose.input.header: +--header '${i.mode_choose.input.header}' + #end if + #if $str($i.mode_choose.mode) == "paired": +--spacing '$set.sp_min' '$set.sp_max' + #end if + #if $str($set.selectivity) != "off": +--selectivity '$set.selectivity' + #end if + #if $str($set.filter_output) != "off": +--filter-output $set.filter_output + #end if + #if $str($set.sort) == "off": +--no-sort + #end if + #if $str($set.mmatch_notation) != "general": +-X + #end if + #if $set.discard_overlapping_mates: +--discard-overlapping-mates + ## remove ',' (and possibly adjacent whitespace) and replace with ' ' + '#echo ("' '".join($set.discard_overlapping_mates.replace(" ", "").split(',')))#' + #end if +--verbose +" + #end for + </command> + + <inputs> + ## mandatory arguments (and mode-conditionals) + + <param format="fasta" help="The fasta reference genome that SNAP should align reads against." label="reference genome" name="ref_genome" type="data" /> + + <repeat default="1" min="1" name="datasets" title="datasets"> + <conditional name="mode_choose"> + <param help="Reads obtained from single-end sequencing runs should be aligned in 'single' mode, paired-end reads in 'paired' mode. **WARNING**: if the read input file is in SAM/BAM format, the current version of this tool will **not** verify the mode and may produce erroneous alignments with wrong settings!" label="choose mode" name="mode" type="select"> + <option value="single">single-end</option> + <option value="paired">paired-end</option> + </param> + + <when value="single"> + <conditional name="input"> + <param label="input file format" name="iformat" type="select"> + <option value="bam">BAM</option> + <option value="sam">SAM</option> + <option value="gz">gz</option> + <option value="fastq">fastq</option> + </param> + <when value="bam"> + <param format="bam" label="input file" name="ifile" type="data" /> + <param format="sam" label="custom header file" name="header" optional="true" type="data" /> + </when> + <when value="sam"> + <param format="sam" label="input file" name="ifile" type="data" /> + <param format="sam" label="custom header file" name="header" optional="true" type="data" /> + </when> + <when value="gz"> + <param label="input file" name="ifile" type="data" /> + <param format="sam" label="header file" name="header" type="data" /> + </when> + <when value="fastq"> + <param format="fastq" label="input file" name="ifile" type="data" /> + <param format="sam" label="header file" name="header" type="data" /> + </when> + </conditional> + </when> + <when value="paired"> + <conditional name="input"> + <param label="input file format" name="iformat" type="select"> + <option value="bam">BAM</option> + <option value="sam">SAM</option> + <option value="gz">gz</option> + <option value="fastq">fastq</option> + </param> + <when value="bam"> + <param format="bam" label="input file" name="ifile" type="data" /> + <param format="sam" label="custom header file" name="header" optional="true" type="data" /> + </when> + <when value="sam"> + <param format="sam" label="input file" name="ifile" type="data" /> + <param format="sam" label="custom header file" name="header" optional="true" type="data" /> + </when> + <when value="fastq"> + <param format="fastq" label="inputfile with the first set of reads of paired-end data" name="ifile1" type="data" /> + <param format="fastq" label="inputfile with the second set of reads of paired-end data" name="ifile2" type="data" /> + <param format="sam" help="required" label="header file" name="header" type="data" /> + </when> + <when value="gz"> + <param label="inputfile with the first set of reads of paired-end data" name="ifile1" type="data" /> + <param label="inputfile with the second set of reads of paired-end data" name="ifile2" type="data" /> + <param format="sam" help="required" label="header file" name="header" type="data" /> + </when> + </conditional> + </when> + </conditional> + </repeat> + + <param label="output file format" name="oformat" type="select"> + <option value="bam">BAM</option> + <option value="sam">SAM</option> + </param> + + ## optional arguments + + <conditional name="set"> + <param help="This section lets you specify the detailed parameter settings for the SNAP aligner. Only change them if you know what you are doing, i.e., read the documentation first." label="further parameter settings" name="settings_mode" type="select"> + <option value="default">default settings</option> + <option value="change">change settings</option> + </param> + + ## default settings + + <when value="default"> + <param name="seedsize" type="hidden" value="20" /> + <param name="slack" type="hidden" value="0.3" /> + <param name="sp_min" type="hidden" value="100" /> + <param name="sp_max" type="hidden" value="10000" /> + <param name="maxdist" type="hidden" value="8" /> + <param name="confdiff" type="hidden" value="2" /> + <param name="confadpt" type="hidden" value="7" /> + + <param name="maxseeds" type="hidden" value="25" /> + <param name="maxhits" type="hidden" value="250" /> + <param name="clipping" type="hidden" value="++" /> + + <param name="selectivity" type="hidden" value="off" /> + <param name="filter_output" type="hidden" value="off" /> + <param name="sort" type="hidden" value="0" /> + <param name="mmatch_notation" type="hidden" value="general" /> + <param name="discard_overlapping_mates" type="hidden" value="" /> + </when> + + ## change settings + + <when value="change"> + <param help="Length of the seeds used in the reference genome hash table (SNAP index option -s)." label="seed size (default: 20)" name="seedsize" type="integer" value="20" /> + <param help="Corresponds to the -h option of SNAP index." label="hash table slack size (default: 0.3)" name="slack" type="float" value="0.3" /> + + ## paired-end specific options + <param help="Corresponds to the first value of the SNAP option -s. Affects paired-end data only." label="minimum spacing to allow between paired ends (default: 100)" name="sp_min" type="integer" value="100" /> + <param help="Corresponds to the second value of the SNAP option -s. Affects paired-end data only." label="maximum spacing to allow between paired ends (default: 10000)" name="sp_max" type="integer" value="10000" /> + <param display="checkboxes" help="Consider overlapping mate pairs of the given orientation type(s) anomalous and discard them; allowed values: RF, FR, FF, RR; multiple types may be specified as a comma-separated list and ALL can be used as a shortcut for discarding all overlapping mate pairs; leave blank to retain all overlapping pairs. Affects paired-end data only." label="discard overlapping read pairs of type" multiple="true" name="discard_overlapping_mates" type="text" /> + <param help="maximum edit distance allowed per read or pair (SNAP option -d); higher values allow more divergent alignments to be found, but increase the rate of misalignments." label="edit distance (default: 8)" name="maxdist" type="integer" value="8" /> + <param help="Maximum hits to consider per seed (SNAP option -h); don't use a seed region in the alignment process if it matches more than maxhits regions in the reference genome. Higher values reduce the rate of misalignments, but reduce performance." label="maximum hits per seed (default: 250)" name="maxhits" type="integer" value="250" /> + <param help="Confidence threshold (SNAP option -c); the minimum edit distance difference between two alternate alignments required to reject the poorer alignment as suboptimal; higher values increase the rate of ambiguously aligned reads." label="confidence threshold (default: 2)" name="confdiff" type="integer" value="2" /> + <param help="Specifies how many seeds of a read may be ignored (based on the maximum hits value above) before the confidence threshold above gets increased by one for that read; helps fine-tuning alignment accuracy in repetitive regions of the genome." label="adaptive confdiff behaviour (default: 7)" name="confadpt" type="integer" value="7" /> + <param help="Number of seeds to use per read (SNAP option -n) when trying to match it to the reference genome; higher numbers will increase the rate of aligned reads and reduce the rate of misalignments, but will reduce performance." label="maximum seeds per read (default: 25)" name="maxseeds" type="integer" value="25" /> + <param help="Specifies from which end of a read low-quality bases should be clipped (SNAP option -Cxx)" label="read clipping (default: from back and front)" name="clipping" type="select"> + <option value="++">from back and front</option> + <option value="x+">from back only</option> + <option value="+x">from front only</option> + <option value="xx">no clipping</option> + </param> + <param help="randomly choose 1/selectivity of the reads to score (SNAP option -S). The tool uses the default of 1 (or a 0 setting) to indicate that all reads should be worked with." label="selectivity (default: 1)" name="selectivity" type="integer" value="1" /> + <param help="filter output (SNAP option -F for certain classes of reads." label="filter output (default: no filtering)" name="filter_output" type="select"> + <option value="off">no filtering</option> + <option value="a">aligned only</option> + <option value="s">single-aligned only</option> + <option value="u">unaligned only</option> + </param> + <param help="Sort the output file by alignment location (SNAP option --so)." label="output sorting (default: sort by read coordinates)" name="sort" type="select"> + <option value="0">sort by read coordinates</option> + <option value="off">no sorting</option> + </param> + <param help="Indicates whether CIGAR strings in the generated SAM/BAM file should use M (alignment match) rather than = and X (sequence (mis-)match). Warning: Downstream variant calling based on samtools currently relies on the old-style M notation!!" label="CIGAR symbols for alignment matches/mismatches (default: M notation)" name="mmatch_notation" type="select"> + <option value="general">use M for both matches and mismatches</option> + <option value="differentiate">use = for matches, X for mismatches</option> + </param> + </when> + </conditional> +</inputs> + +<outputs> + <data format="bam" label="Aligned reads from MiModd ${tool.name} on ${on_string}" name="outputfile"> + <change_format> + <when format="sam" input="oformat" value="sam" /> + </change_format> + </data> +</outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool aligns the sequenced reads in an arbitrary number of input datasets against a common reference genome and stores the results in a single, possibly multi-sample output file. It supports a variety of different sequenced reads input formats, i.e., SAM, BAM, fastq and gzipped fastq, and both single-end and paired-end data. + +Internally, the tool uses the ultrafast, hashtable-based aligner SNAP (http://snap.cs.berkeley.edu), hence its name. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) To use paired-end fastq data with the tool the read mate information needs to be split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) The tool supports the alignment of reads from the same sequencing run, but distributed across several input files. + + Generally, it expects the reads from each input dataset to belong to one read-group and will abort with an error message if any input dataset declares more than one read group or sample names in its header. Different datasets, however, are allowed to contain reads from the same read-group (as indicated by matching read-group IDs and sample names in their headers), in which case the reads will be combined into one group in the output. + +4) Read-group information is required for every input dataset! + + We generally recommend to store NGS datasets in SAM/BAM format with run metadata stored in the file header. You can use the *NGS Run Annotation* and *Convert* tools to convert data in fastq format to SAM/BAM with added run information. + + While it is not our recommended approach, you can, if you prefer it, align reads from fastq files or SAM/BAM files without header read-group information. To do so, you **must** specify a SAM file that provides the missing information in its header along with the input dataset. You can generate a SAM header file with the *NGS Run Annotation* tool. + + Optionally, a SAM header file can also be used to replace existing read-group information in a headered SAM/BAM input file. This can be used to resolve read-group ID conflicts between multiple input files at tool runtime. + +5) The options available under *further parameter settings* can have **big** effects on the alignment quality. You are strongly encouraged to consult the `tool documentation`_ for detailed explanations of the available options. + +6) Currently, you cannot configure aligner-specific options separately for specific input files from within this Galaxy tool. If you need this advanced level of control, you should use the command line tool ``mimodd snap-batch``. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest +.. _tool documentation: http://mimodd.readthedocs.org/en/latest/tool_doc.html#snap + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snp_caller_caller.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,66 @@ +<tool id="variant_calling" name="Variant Calling" version="0.1.7.3"> + <description>From a reference and aligned reads generate a BCF file with position-specific variant likelihoods and coverage information</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd varcall + + "$ref_genome" + #for $l in $list_input + "${l.inputfile}" + #end for + --ofile "$output_vcf" + --depth "$depth" + $group_by_id + $no_md5_check + --verbose + --quiet + </command> + + <inputs> + <param format="fasta" label="reference genome" name="ref_genome" type="data" /> + <repeat default="1" min="1" name="list_input" title="Aligned reads input source"> + <param format="bam" label="input file" name="inputfile" type="data" /> + </repeat> + <param checked="false" falsevalue="" help="If selected, this option ensures that only the read group id (but not the sample name) is considered in grouping reads in the input file(s). If turned off, read groups with identical sample names are automatically pooled and analyzed together even if they come from different NGS runs." label="group reads based on read group id only" name="group_by_id" truevalue="-i" type="boolean" /> + <param checked="false" falsevalue="" help="leave turned on to avoid accidental variant calling against a wrong reference genome version (see the tool help below)." label="turn off md5 sum verification" name="no_md5_check" truevalue="-x" type="boolean" /> + <param help="to avoid excessive use of memory" label="maximum per-BAM depth (default: 250)" name="depth" type="integer" value="250" /> + </inputs> + + <outputs> + <data format="bcf" label="Variant Calls from MiModd Variant Calling on ${on_string}" name="output_vcf" /> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool transforms the read-centered information of its aligned reads input files into position-centered information. + +**It produces a BCF file that serves as the basis for all further variant analyses with MiModD**. + +**Notes:** + +By default, the tool will check whether the input BAM file(s) provide(s) MD5 checksums for the reference genome sequences used during read alignment (the *SNAP Read Alignment* tool stores these in the BAM file header). If it finds MD5 sums for all sequences, it will compare them to the actual checksums of the sequences in the specified reference genome and +check that every sequence mentioned in any BAM input file has a counterpart with matching MD5 sum in the reference genome and abort with an error message if that is not the case. If it finds sequences with matching checksum, but different names in the reference genome, it will use the name from the reference genome file in its output. + +This behavior has two benefits: + +1) It protects from accidental variant calling against a wrong reference genome (i.e., a different one than that used during the alignment step), which would result in wrong calls. This is the primary reason why we recommend to leave the check activated + +2) It provides an opportunity to change sequence names between aligned reads files and variant call files by providing a reference genome file with altered sequence names (but identical sequence data). + +Since there may be rare cases where you *really* want to align against a reference genome with different checksums (e.g., you may have edited the reference sequence based on the alignment results), the check can be turned off, but only do this if you know exactly why. + +----------- + +Internally, the tool uses samtools mpileup combined with bcftools to do all per-nucleotide calculations. + +It exposes just a single configuration parameter of these tools - the *maximum per-BAM depth*. Through this parameter, the maximum number of reads considered for variant calling at any site can be controlled. Its default value of 250 is taken from *samtools mpileup* and usually suitable. Consider, however, that this gives the maximum read number per input file, so if you have a large number of samples in one input file, it could become necessary to increase the value to get sufficient reads considered per sample. + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snpeff_genomes.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,24 @@ +<tool id="snpeff_genomes" name="List Installed SnpEff Genomes" version="0.1.7.3"> + <description>Checks the local SnpEff installation to compile a list of currently installed genomes</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd snpeff-genomes -o "$outputfile" + </command> + <outputs> + <data format="tabular" name="outputfile" /> + </outputs> +<help> +.. class:: infomark + +**What it does** + +When executed this tool searches the host machine's SnpEff installation for properly registered and installed +genome annotation files. The resulting list is added as a plain text file to your history for use with the *Variant Annotation* Tool. + +</help> + +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/toolshed_macros.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,7 @@ +<macros> + <xml name="requirements"> + <requirements> + <requirement type="package" version="0.1.7.3">mimodd</requirement> + </requirements> + </xml> +</macros>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/varextract.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,101 @@ +<tool id="extract_variants" name="Extract Variant Sites" version="0.1.7.3"> + <description>from a BCF file</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd varextract "$ifile" + #if $len($sitesinfo) + -p + #for $source in $sitesinfo + "${source.pre_vcf}" + #end for + #end if + --ofile "$output_vcf" + $keep_alts + --verbose + </command> + + <inputs> + <param format="bcf" help="Use the Variant Calling tool to generate the input for this tool." label="BCF input file" name="ifile" type="data" /> + <repeat default="0" name="sitesinfo" title="include information from pre-calculated vcf file"> + <param format="vcf" label="independently generated vcf file" name="pre_vcf" type="data" /> + </repeat> + <param checked="false" falsevalue="" help="If selected, the VCF output will include ALL sites for which non-reference bases have been observed, i.e., even those not considered allelic sites by the variant caller." label="keep all sites with alternate bases" name="keep_alts" truevalue="-a" type="boolean" /> + </inputs> + <outputs> + <data format="vcf" label="Variants extracted with MiModd from ${on_string}" name="output_vcf" /> + </outputs> + +<help> +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file like the ones produced by the *Variant Calling* tool, extracts just the variant sites from it and reports them in VCF format. + +If the BCF input file specifies multiple samples, sites are included if they qualify as variant sites in at least one sample. + +In a typical analysis workflow, you will use the tool's VCF output as input for the *VCF Filter* tool to cut down the often still impressive list of sites to a subset with relevance to your project. + +**Options:** + +1) By default, a variant site is considered to be a position in the genome for which a non-reference allele appears in the inferred genotype of any sample. + + You can select the *keep all sites with alternate bases* option, if instead you want to extract all sites, for which at least one non-reference base has been observed (whether resulting in a non-reference allele call or not). Using this option should rarely be necessary, but could be occassionally helpful for closer inspection of candidate genomic regions. + +2) During the process of variant extraction the tool can take into account genome positions specified in one or more independently generated VCF files. If such additional VCF input is provided, the tool output will contain the samples found in these files as additional samples and sites from the main BCF file will be included if they either qualify as variant sites in at least one sample specified in the BCF or if they are listed in any of the additional VCF files. + + Optional VCF input can be particularly useful in one of the following situations: + + *scenario i* - you have prior information that leads you to think that certain genome positions are of special relevance for your project and, thus, you are interested in the statistics produced by the variant caller for these positions even if they are not considered variant sites. In this case you can use a minimal VCF file to guide the variant extraction process to include these positions. This minimal VCF file needs a minimal header: + + ``##fileformat=VCFv4.2`` + + followed by positional information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1222 . . . . . . + chrI 2651 . . . . . . + chrI 3659 . . . . . . + chrI 3731 . . . . . . + + , where columns are tab-separated and . serves as a placeholder for missing information. + + *scenario ii* - you have actual variant calls from an additional sample, but you do not have access to the original sequenced reads data (if you had, the recommended approach would be to align this data along with your other sequencing data or, at least, to perform the *Variant Calling* step together). + + This situation is often encountered with published datasets. Assume you have obtained a list of known single nucleotide variants (SNVs) found in one particular strain of your favorite model organism and you would like to know which of these SNVs are present in the related strains you have sequenced. You have aligned the sequenced reads from your samples and have used the *Variant Calling* tool, which has generated a BCF file ready for variant extraction. If the SNV list for the previously sequenced strain is in VCF format already, you can now just plug it into the analysis process by specifying it in the tool interface as an *independently generated vcf file*. The resulting vcf output file will contain all SNV sites along with the variant sites found in the BCF alone. You can then proceed to the *VCF Filter* tool to look at the original SNV sites only or to investigate any other interesting subset of sites. If the SNV list is in some other format, you will have o convert it to VCF first. At a minimum, the file must have a ``##fileformat`` header line like the previous example and have the ``REF`` and ``ALT`` column filled in like so:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1897409 . A G . . . + chrI 1897492 . C T . . . + chrI 1897616 . C A . . . + chrI 1897987 . A T . . . + chrI 1898185 . C T . . . + chrI 1898715 . G A . . . + chrI 1898729 . T C . . . + chrI 1900288 . T A . . . + + , in which case the tool will assume that the corresponding sample is homozygous for each of the SNVs. If you need to distinguish between homozygous and heterozygous SNVs you will have to extend the format to include a format and a sample column with genotype (GT) information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleX + chrI 1897409 . A G . . . GT 1/1 + chrI 1897492 . C T . . . GT 0/1 + chrI 1897616 . C A . . . GT 0/1 + chrI 1897987 . A T . . . GT 0/1 + chrI 1898185 . C T . . . GT 0/1 + chrI 1898715 . G A . . . GT 0/1 + chrI 1898729 . T C . . . GT 0/1 + chrI 1900288 . T A . . . GT 0/1 + + , in which sampleX would be heterozygous for all SNVs except the first. + + .. class:: warningmark + + If the optional VCF input contains INDEL calls, these will be ignored by the tool. + + +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/vcf_filter.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,133 @@ +<tool id="vcf_filter" name="VCF Filter" version="0.1.7.3"> + <description>Extracts lines from a vcf variant file based on field-specific filters</description> + <macros> + <import>toolshed_macros.xml</import> + </macros> + <expand macro="requirements" /> + <version_command>mimodd version -q</version_command> + <command> + mimodd vcf-filter + "$inputfile" + -o "$outputfile" + #if len($datasets): + -s + #for $i in $datasets + "$i.sample" + #end for + --gt + #for $i in $datasets + ## remove whitespace from free-text input + "#echo ("".join($i.GT.split()) or "ANY")#" + #echo " " + #end for + --dp + #for $i in $datasets + "$i.DP" + #end for + --gq + #for $i in $datasets + "$i.GQ" + #end for + --af + #for $i in $datasets + "#echo ($i.AF or "::")#" + #end for + #end if + #if len($regions): + -r + #for $i in $regions + #if $i.stop: + "$i.chrom:$i.start-$i.stop" + #else: + "$i.chrom:$i.start" + #end if + #end for + #end if + #if $vfilter: + --vfilter + ## remove ',' and replace with ' ' + "#echo ('" "'.join($vfilter.split(',')))#" + #end if + $vartype + </command> + + <inputs> + <param format="vcf" label="VCF input file" name="inputfile" type="data" /> + <repeat default="0" min="0" name="datasets" title="Sample-specific Filter"> + <param help="name of a sample as it appears in the VCF input file and that indicates the sample that this filter should be applied to." label="sample" name="sample" type="text" /> + <param help="keep only variants for which the genotype of the sample matches the specified pattern; format: x/x where x = 0 is wildtype and x = 1 is mutant. Multiple genotypes can be specified as a comma-separated list." label="genotype pattern(s) for the inclusion of variants" name="GT" type="text" /> + <param help="keep only variants with at least this sample-specific coverage at the variant site" label="depth of coverage for the sample at the variant site" name="DP" type="integer" value="0" /> + <param help="keep only variants for which the genotype prediction for the sample has at least this quality" label="genotype quality for the variant in the sample" name="GQ" type="integer" value="0" /> + <param help="expected format: [allele number]:[minimal fraction]:[maximal fraction]; keep only variants for which the fraction of sample-specific reads supporting a given allele number is between minimal and maximal fraction; if allele number is omitted, the filter operates on the most frequent non-reference allele instead" label="allelic fraction filter" name="AF" type="text" /> + </repeat> + <repeat default="0" help="Filter variant sites by their position in the genome. If multiple Region Filters are specified, all variants that fall in ONE of the regions are reported." min="0" name="regions" title="Region Filter"> + <param label="Chromosome" name="chrom" type="text" /> + <param label="Region Start" name="start" type="text" /> + <param label="Region End" name="stop" type="text" /> + </repeat> + <param label="Select the types of variants to include in the output" name="vartype" type="select"> + <option value="">all types of variants</option> + <option value="--no-indels">exclude indels</option> + <option value="--indels-only">only indels</option> + </param> + <param help="Filter output by sample name; only the sample-specific columns with their sample name matching any of the comma separated filters will be retained in the output." label="sample" name="vfilter" type="text" /> + </inputs> + + <outputs> + <data format="vcf" name="outputfile" /> + </outputs> + + <help> +.. class:: infomark + + **What it does** + +The tool filters a variant file in VCF format to generate a new VCF file with only a subset of the original variants. + +The following types of variant filters can be set up: + +1) Sample-specific filters: + + Filter variants based on their characteristics in the sequenced reads of a specific sample. Multiple sample-specific filters are combined by logical AND, i.e., only variants that pass ALL sample-specific filters are kept. + +2) Region filters: + + Filter variants based on the genomic region they affect. Multiple region filters are combined by logical OR, i.e., variants passing ANY region filter are kept. + +3) Variant type filter: + + Filter variants by their type, i.e. whether they are single nucleotide variations (SNVs) or indels + +In addition, the *sample* filter can be used to reduce the samples encoded in a multi-sample VCF file to just those specified by the filter. +The *sample* filter is included mainly for compatibility reasons: if an external tool cannot deal with the multisample file format, but instead looks only at the first sample-specific column of the file, you can use the filter to turn the multi-sample file into a single-sample file. Besides, the filter can also be used to change the order of the samples since it will sort the samples in the order specified in the filter field. + +**Examples of sample-specific filters:** + +*Simple genotype pattern* + +genotype pattern: 1/1 ==> keep all variants in the vcf input file for which the specified sample's genotype is homozygous mutant + +*Complex genotype pattern* + +genotype pattern: 0/1, 0/0 ==> keep all variants for which the sample's genotype is either heterozygous or homozygous wildtype + +*Multiple sample-specific filters* + +Filter 1: genotype pattern: 0/0, Filter 2: genotype pattern 1/1: +==> keep all variants for which the first sample's gentoype is homozygous wildtype **and** the second sample's genotype is homozygous mutant + +*Combining sample-specific filter criteria* + +genotype pattern: 1/1, depth of coverage: 3, genotype quality: 9 +==> keep variants for which the sample's genotype is homozygous mutant **and** for which this genotype assignment is corroborated by a genotype quality score of at least 9 +**and** at least three reads from the sample cover the variant site + +**TIP:** + +As in the example above, genotype quality is typically most useful in combination with a genotype pattern. +It acts then, effectively, to make the genotype filter more stringent. + + + + </help> +</tool>
