# HG changeset patch # User wolma # Date 1490690044 14400 # Node ID d6ec32ce882b888cedcddf3eada0fd476066aefc # Parent 7f70281124394a56bd98258d8cf8527c7e51f1a9 Uploaded diff -r 7f7028112439 -r d6ec32ce882b annotate_variants.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/annotate_variants.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,169 @@ + + Predict the effects of SNPs and indels on known genes in the reference genome using SnpEff + + toolshed_macros.xml + + + mimodd version -q + + mimodd annotate + + "$inputfile" + + #if $str($annotool.name)=='snpeff': + --genome "${annotool.genomeVersion}" + #if $annotool.ori_output: + --snpeff-out "$snpeff_file" + #end if + #if $annotool.stats: + --stats "$summary_file" + #end if + ${annotool.snpeff_settings.chr} ${annotool.snpeff_settings.no_us} ${annotool.snpeff_settings.no_ds} ${annotool.snpeff_settings.no_intron} ${annotool.snpeff_settings.no_intergenic} ${annotool.snpeff_settings.no_utr} + #if $annotool.snpeff_settings.min_cov: + --minC "${annotool.snpeff_settings.min_cov}" + #end if + #if $annotool.snpeff_settings.min_qual: + --minQ "${annotool.snpeff_settings.min_qual}" + #end if + #if $annotool.snpeff_settings.ud: + --ud "${annotool.snpeff_settings.ud}" + #end if + #end if + + --ofile "$outputfile" + #if $str($formatting.oformat) == "text": + --oformat text + #end if + #if $str($formatting.oformat) == "html": + #if $formatting.formatter_file: + --link "${formatting.formatter_file}" + #end if + #if $formatting.species + --species "${formatting.species}" + #end if + #end if + + #if $str($grouping): + --grouping $grouping + #end if + --verbose + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ## default settings for SnpEff + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + (annotool['name']=="snpeff" and annotool['ori_output']) + + + (annotool['name']=="snpeff" and annotool['stats']) + + + + +.. class:: infomark + + **What it does** + +The tool turns a variant list in VCF format into a more readable summary table listing variant sites and effects. + +If installed, the variant annotation tool SnpEff can be used transparently to determine the genomic features, e.g., genes or transcripts, affected by the variants. + +Use of this feature requires that you have an appropriate SnpEff genome file installed on the host machine. You can use the *List installed SnpEff genomes* tool to generate a list of all available SnpEff genomes. +This list can then be used (by selecting the dataset as the *genome list*) to populate the *genome* dropdown menu, from which you can select the SnpEff genome file to be used for the annotation. + +As output file formats HTML or plain text are supported. +In HTML mode, variant positions and/or affected genomic features can be turned into hyperlinks to corresponding views in web-based genome browsers and databases. + +The behavior of this feature depends on: + +1) Recognition of the species that is analyzed + + You can declare the species you are working with using the *Species* text field. + If you are not declaring the species explicitly, but are choosing SnpEff for effect annotation, the tool will usually be able to auto-detect the species from the SnpEff genome you are using. + If no species gets assigned in either way, no hyperlinks will be generated and the html output will look essentially like plain text. + +2) Available hyperlink formatting rules for this species + + When the species has been recognized, the tool checks if you have selected an *optional file with hyperlink formatting instructions*. + If you did and that file contains an entry matching the recognized species, that entry will be used as a template to construct the hyperlinks. + If no matching entry is found in the file, an error will be raised. + + If you did not supply a hyperlink formatting instruction file, the tool will consult an internal lookup table to see if it finds default rules for the construction of the hyperlinks for the species. + If not, no hyperlinks will be generated and the html output will look essentially like plain text. + + **TIP:** + MiModD's internal hyperlink formatting lookup tables are maintained and growing with every new version, but since weblinks are changing frequently as well, it is possible that you will encounter broken hyperlinks for your species of interest. In such a case, you can resort to two things: `tell us about the problem`_ to make sure it gets fixed in the next release and, in the meantime, use a custom file with hyperlink formatting instructions to overwrite the default entry for your species. + +.. _tell us about the problem: mailto:mimodd@googlegroups.com + + diff -r 7f7028112439 -r d6ec32ce882b bamsort.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bamsort.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,52 @@ + + Sort a BAM file by coordinates (or names) of the mapped reads + + toolshed_macros.xml + + + mimodd version -q + + mimodd sort "$input.ifile" -o "$output" --iformat $input.iformat --oformat $oformat $by_name + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool sorts a BAM file of aligned reads, typically by the reference genome coordinates that the reads have been mapped to. + +Coordinate-sorted input files are expected by most downstream MiModD tools, but note that the *SNAP Read Alignment* produces coordinate-sorted output by default and it is only necessary to sort files that come from other sources or from *SNAP Read Alignment* jobs with a custom sort order. + +The option *Sort by read names instead of coordinates* is useful if you want to re-align coordinate-sorted paired-end data. In *paired-end mode*, the *SNAP Read Alignment* tool expects the reads in the input file to be arranged in read pairs, i.e., the forward read information of a pair must be followed immediately by its reverse mate information, which is typically not the case in coordinate-sorted files. Resorting such files by read names fixes this problem. + + + diff -r 7f7028112439 -r d6ec32ce882b cloudmap.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cloudmap.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,340 @@ + + Map causative mutations by multi-variant linkage analysis. + + mimodd version -q + + mimodd map ${opt.mode} "${opt.source.ifile}" + #if $str($opt.source.sample): + -m "${opt.source.sample}" + #end if + #if $str($opt.source.related_parent_sample): + -r "${opt.source.related_parent_sample}" + #end if + #if $str($opt.source.unrelated_parent_sample): + -u "${opt.source.unrelated_parent_sample}" + #end if + $opt.source.infer_missing + -o "$ofile" + #if $str($opt.source.seqdict_required.required) == "yes": + -s "${opt.source.seqdict_required.seqdict}" + #end if + $opt.source.norm + #if $len($opt.source.bin_sizes): + --bin-sizes + #for $size in $opt.source.bin_sizes: + "${size.bin_size}" + #end for + #end if + #if $str($opt.source.tabfile): + $str($opt.source.tabfile) $tfile + #end if + #if $str($opt.source.plotopts.plots): + $str($opt.source.plotopts.plots) "$pfile" + $str($opt.source.plotopts.xlim) + #if $str($opt.source.plotopts.hylim): + --ylim-hist $str($opt.source.plotopts.hylim) + #end if + #if $str($opt.source.plotopts.hcols) and $len($opt.source.plotopts.hcols): + --hist-colors + #for $color in $opt.source.plotopts.hcols: + "${color.hcolor}" + #end for + #end if + #if $str($opt.source.plotopts.sylim): + --ylim-scatter $str($opt.source.plotopts.sylim) + #end if + #if $str($opt.source.plotopts.pcol): + --points-color "$str($opt.source.plotopts.pcol)" + #end if + #if $str($opt.source.plotopts.lcol): + --loess-color "$str($opt.source.plotopts.lcol)" + #end if + #if $str($opt.source.plotopts.span): + --loess-span "$str($opt.source.plotopts.span)" + #end if + #end if + + + + + toolshed_macros.xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + (opt['source']['tabfile']) + + + (opt['source']['plotopts']['plots']) + + + + +.. class:: infomark + + **What it does** + +This tool is a complete rewrite of and improves the EMS Variant Density and Hawaiian Variant Mapping tools of `CloudMap`_. It is the most downstream tool in `mapping-by-sequencing analysis workflows in MiModD`_. + +It can be used to analyze and visualize the inheritance pattern of variants detected and selected by other MiModD tools or as an alternative (and more versatile) plotting engine for data generated with `CloudMap`_. + +------------- + +**Usage Modes:** + +This tool can be run in one of two different modes depending on the type of mapping analysis that should be performed: + +1) *Simple Variant Density (SVD) Mapping* mode analyzes the density of variants along the reference genome by dividing each chromosome into regions of user-defined size (bins) and counting the variants found in each bin. + + All variants listed in the input file are analyzed in this mode, which means that as input you will typically want to use filtered lists of variants (as produced by the VCF Filter tool). + + The aim of SVD analysis is to identify clusters of variants in an outcrossed strain carrying a selectable unknown mutation, which is interpreted as linkage between the corresponding genomic region and the unknown mutation. + + This mode corresponds roughly to EMS Variant Density Mapping in CloudMap. + +2) *Variant Allele Frequency (VAF) Mapping** mode analyzes the inheritance pattern in cross-progeny at sites, at which the parents are homozygous for different alleles. + + The aim of VAF analysis is to identify clusters of variants with (near) homozygous inheritance in a F2 (or later generation) population obtained from a cross between a strain carrying a selectable unknown mutation and an unrelated mapping strain. Such a cluster is interpreted as linkage between the corresponding genomic region and the unknown mutation selected for in the F2 generation. + + This mode corresponds roughly to Hawaiian Variant Mapping in CloudMap, but can simultaneously take into account non-reference alleles found in either parent strain (CloudMap users may think of this as a combined Hawaiian Variant and Variant Discovery Mapping analysis). + +------------- + +**Input:** + +Valid input for this tool are VCF files (any VCF file in SVD mode, a MiModD-generated multi-sample VCF file in VAF mode) or a CloudMap tabular report file as generated by the Hawaiian Variant Mapping tool. Alternatively, the tool can generate (in both modes) its own tabular report file, which can be used as input instead of the original VCF file when rerunning the tool with different plotting parameters to reduce analysis time. + +.. class:: infomark + + CloudMap-generated tabular input files require, as additional input, a CloudMap-style sequence dictionary (even if the original CloudMap analysis was possible without one) as described in the original CloudMap paper. This file has a simple two-column tab-delimited format, in which each line lists the chromosome name (as it appears in the input VCF file) and the up-rounded length of the chromosome in megabases. + +------------- + +**Output:** + +The tool produces up to three output files: + +1) a default tabular file of binned variant counts that can be used to plot the data with external software such as Excel, + + +2) an optional pdf containing linkage plots, which should look just like the plots produced by CloudMap, but are optimized for file size and display speed and offer more user-configurable parameters and + + +3) an optional tabular per-variant report file, which can be configured to be either a valid input file for the corresponding original CloudMap tool (for users who really, really want to continue using CloudMap for plotting) or to be reusable in fast reruns of the tool (which can be useful to experiment with different plotting parameters). + +------------- + +**Settings:** + +1) Analysis settings + + *bin size to analyze variants in* - determines the width of the regions along each chromosome, in which variants are counted and analyzed together. + + Several bin sizes can be specified and for each size you will get a corresponding report section in the binned variant counts file and a histogram plot in the linkage plots file. + + *normalize variant counts to bin-width* - if selected (as per default) the variant counts for different bin sizes are not absolute, but normalized to the bin width + + *sample names (in VAF mode only)* - to analyze inheritance patterns, VAF mode needs information about the relationship between the samples defined in the input VCF file: + + The *mapping sample name* should be set to the name of the sample for which the inheritance pattern is to be analyzed (the pooled progeny population). + + The *name of the related sample* should be that of the parent sample that carried and brought in the unknown mutation to be mapped (or, alternatively, that of a closely related ancestor). + + Finally, the *name of the unrelated sample* should be that of the other parent strain used in the cross. + + At least one of the parent samples MUST be specified, but if the input file contains variant information for both parents, they can be analyzed together for higher mapping accuracy. If you are reanalyzing a tabular report file from a previous tool run or from CloudMap, the association between variants and samples is already incorporated into the input file and cannot be specified again. + +2) Graphical output settings + + .. class:: warningmark + + To be able to generate plots the system running MiModD needs to have the statistical programming environment R and its Python interface rpy2 installed. + + + *y-axes scaling* - if you want to override the defaults + + *x-axis scaling* - choose *preserve relative contig sizes* if you want the largest chromosome to fit the page width and smaller chromosomes to appear according to their relative size or choose *scale each contig to fit the plot width* if all chromosomes should exploit the available space + + *span value to be used in calculating the Loess regression line* - this value determines the degree of smoothing of the regression line through the scatterplot data. Information on loess regression and the loess span parameter can be found at http://en.wikipedia.org/wiki/Local_regression. The default is 0.1 as in CloudMap. + + *colors used for plotting* - can be selected freely from the offered palette. For histogram colors, the list of selected colors will be used to provide the colors for the different histograms plotted. If less colors than histograms (determined by the number of bin sizes selected) are specified, colors from the list will be recycled. + + +.. _CloudMap: https://usegalaxy.org/u/gm2123/p/cloudmap +.. _mapping-by-sequencing analysis workflows in MiModD: http://mimodd.readthedocs.org/en/latest/cloudmap.html + + diff -r 7f7028112439 -r d6ec32ce882b convert.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/convert.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,170 @@ + + between different sequence data formats + + toolshed_macros.xml + + + mimodd version -q + + #if $str($mode.split_on_rgs) or $str($mode.oformat)=="fastq" or $str($mode.oformat)=="gz": + echo "Your input data is now getting processed by MiModD. The output will be split into several files based on the read groups found in the input.\nThis history item will remain in the busy state until the job is finished.\nAfter the job is showing as finished, Galaxy will start adding the results files to your history one by one.\n\nThis may take a while to complete! \n\nYou should refresh your history to see if new files have arrived.\n\nThis message is for your information only and can be deleted from the history once the job has finished." > $output_split_on_read_groups; + + mkdir converted_data; + #end if + + mimodd convert + + #for $i in $mode.input_list + "${i.file1}" + #if $str($mode.iformat) in ("fastq_pe", "gz_pe"): + "${i.file2}" + #end if + #end for + #if $str($mode.header) != "None": + --header "$(mode.header)" + #end if + + #if $str($outputname) == "None": + --ofile converted_data/read_group + #else + --ofile "$outputname" + #end if + --iformat $(mode.iformat) + --oformat $(mode.oformat) + ${mode.split_on_rgs} + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + (not mode['split_on_rgs'] and mode['oformat'] not in ("fastq", "gz")) + + + + + + (mode['split_on_rgs'] or mode['oformat'] in ("fastq", "gz")) + + + + + + +.. class:: infomark + + **What it does** + +The tool converts between different file formats used for storing next-generation sequencing data. + +As input file types it can handle uncompressed or gzipped fastq, SAM or BAM format, which it can convert to SAM or BAM format. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to convert gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) The tool can convert fastq files representing data from paired-end sequencing runs to appropriate SAM/BAM format provided that the mate information is split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) Merging partial fastq (or gzipped fastq) files into a single SAM/BAM file is supported both for single-end and paired-end data. Simply add additional input datasets and select the appropriate files (pairs of files in case of paired-end data). + + Concatenation of SAM/BAM file during conversion is currently not supported. + +4) For input in fastq format a SAM header file providing run metadata **has to be specified**. The information in this file will be used as the header data of the new SAM/BAM file. You can use the *NGS Run Annotation* tool to generate a new header file for your data. + + For input in SAM/BAM format the tool will simply copy the existing header data to the new file. To modify the header of an existing SAM/BAM file, use the *Reheader BAM file* tool instead. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest + + + diff -r 7f7028112439 -r d6ec32ce882b covstats.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/covstats.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,31 @@ + + Calculate coverage statistics for a BCF file as generated by the Variant Calling tool + + toolshed_macros.xml + + + mimodd version -q + + mimodd covstats "$ifile" --ofile "$output_vcf" + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file produced by the *Variant Calling* tool, and calculates per-chromosome read coverage from it. + +.. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + + + diff -r 7f7028112439 -r d6ec32ce882b deletion_predictor.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/deletion_predictor.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,65 @@ + + Predicts deletions in one or more aligned read samples based on coverage of the reference genome and on insert sizes + + toolshed_macros.xml + + + mimodd version -q + + mimodd delcall + #for $l in $list_input + "${l.bamfile}" + #end for + "$covfile" -o "$outputfile" + --max-cov "$max_cov" --min-size "$min_size" $include_uncovered $group_by_id --verbose + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool predicts deletions from paired-end data in a two-step process: + +1) It finds regions of low-coverage, i.e., candidate regions for deletions, by scanning a BCF file produced by the *Variant Calling* tool. + + The *maximal coverage allowed inside a low-coverage region* and the *minimal deletion size* parameters are used at this step to define what is considered a low-coverage region. + + .. class:: warningmark + + The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. + +2) It assesses every low-coverage region statistically for evidence of it being a real deletion. **This step requires paired-end data** since it relies on shifts in the distribution of read pair insert sizes around real deletions. + +By default, the tool only reports Deletions, i.e., the subset of low-coverage regions that pass the statistical test. +If *include low-coverage regions* is selected, regions that failed the test will also be reported. + +With *group reads based on read group id only* selected, as it is by default, grouping of reads into samples is done strictly based on their read group IDs. +With the option deselected, grouping is done based on sample names in the first step of the analysis, i.e. the reads of all samples with a shared sample name are used to identify low-coverage regions. +In the second step, however, reads will be regrouped by their read group IDs again, i.e. the statistical assessment for real deletions is always done on a per read group basis. + +**TIP:** +Deselecting *group reads based on read group id only* can be useful, for example, if you have both paired-end and single-end sequencing data for the same sample. + +In this case, the two sets of reads will usually share a common sample name, but differ in their read groups. +With grouping based on sample names, the single-end data can be used together with the paired-end data to identify low-coverage regions, thus increasing overall coverage and reliability of this step. +Still, the assessment of deletions will use only the paired-end data (auto-detecting that the single-end reads do not provide insert size information). + + + + diff -r 7f7028112439 -r d6ec32ce882b fileinfo.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/fileinfo.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,38 @@ + + for supported data formats. + + toolshed_macros.xml + + + mimodd version -q + + mimodd info "$ifile" -o "$outputfile" --verbose --oformat $oformat + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool inspects the input file and generates a report summarizing its contents. + +It autodetects and works with most file formats produced by MiModD, i.e., **SAM / BAM, vcf / bcf and fasta**, and produces a standardized report for all of them. + + + diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/annotate_variants.xml --- a/mimodd_bitbucket_wrappers/annotate_variants.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,169 +0,0 @@ - - Predict the effects of SNPs and indels on known genes in the reference genome using SnpEff - - toolshed_macros.xml - - - mimodd version -q - - mimodd annotate - - "$inputfile" - - #if $str($annotool.name)=='snpeff': - --genome "${annotool.genomeVersion}" - #if $annotool.ori_output: - --snpeff-out "$snpeff_file" - #end if - #if $annotool.stats: - --stats "$summary_file" - #end if - ${annotool.snpeff_settings.chr} ${annotool.snpeff_settings.no_us} ${annotool.snpeff_settings.no_ds} ${annotool.snpeff_settings.no_intron} ${annotool.snpeff_settings.no_intergenic} ${annotool.snpeff_settings.no_utr} - #if $annotool.snpeff_settings.min_cov: - --minC "${annotool.snpeff_settings.min_cov}" - #end if - #if $annotool.snpeff_settings.min_qual: - --minQ "${annotool.snpeff_settings.min_qual}" - #end if - #if $annotool.snpeff_settings.ud: - --ud "${annotool.snpeff_settings.ud}" - #end if - #end if - - --ofile "$outputfile" - #if $str($formatting.oformat) == "text": - --oformat text - #end if - #if $str($formatting.oformat) == "html": - #if $formatting.formatter_file: - --link "${formatting.formatter_file}" - #end if - #if $formatting.species - --species "${formatting.species}" - #end if - #end if - - #if $str($grouping): - --grouping $grouping - #end if - --verbose - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ## default settings for SnpEff - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - (annotool['name']=="snpeff" and annotool['ori_output']) - - - (annotool['name']=="snpeff" and annotool['stats']) - - - - -.. class:: infomark - - **What it does** - -The tool turns a variant list in VCF format into a more readable summary table listing variant sites and effects. - -If installed, the variant annotation tool SnpEff can be used transparently to determine the genomic features, e.g., genes or transcripts, affected by the variants. - -Use of this feature requires that you have an appropriate SnpEff genome file installed on the host machine. You can use the *List installed SnpEff genomes* tool to generate a list of all available SnpEff genomes. -This list can then be used (by selecting the dataset as the *genome list*) to populate the *genome* dropdown menu, from which you can select the SnpEff genome file to be used for the annotation. - -As output file formats HTML or plain text are supported. -In HTML mode, variant positions and/or affected genomic features can be turned into hyperlinks to corresponding views in web-based genome browsers and databases. - -The behavior of this feature depends on: - -1) Recognition of the species that is analyzed - - You can declare the species you are working with using the *Species* text field. - If you are not declaring the species explicitly, but are choosing SnpEff for effect annotation, the tool will usually be able to auto-detect the species from the SnpEff genome you are using. - If no species gets assigned in either way, no hyperlinks will be generated and the html output will look essentially like plain text. - -2) Available hyperlink formatting rules for this species - - When the species has been recognized, the tool checks if you have selected an *optional file with hyperlink formatting instructions*. - If you did and that file contains an entry matching the recognized species, that entry will be used as a template to construct the hyperlinks. - If no matching entry is found in the file, an error will be raised. - - If you did not supply a hyperlink formatting instruction file, the tool will consult an internal lookup table to see if it finds default rules for the construction of the hyperlinks for the species. - If not, no hyperlinks will be generated and the html output will look essentially like plain text. - - **TIP:** - MiModD's internal hyperlink formatting lookup tables are maintained and growing with every new version, but since weblinks are changing frequently as well, it is possible that you will encounter broken hyperlinks for your species of interest. In such a case, you can resort to two things: `tell us about the problem`_ to make sure it gets fixed in the next release and, in the meantime, use a custom file with hyperlink formatting instructions to overwrite the default entry for your species. - -.. _tell us about the problem: mailto:mimodd@googlegroups.com - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/bamsort.xml --- a/mimodd_bitbucket_wrappers/bamsort.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,52 +0,0 @@ - - Sort a BAM file by coordinates (or names) of the mapped reads - - toolshed_macros.xml - - - mimodd version -q - - mimodd sort "$input.ifile" -o "$output" --iformat $input.iformat --oformat $oformat $by_name - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool sorts a BAM file of aligned reads, typically by the reference genome coordinates that the reads have been mapped to. - -Coordinate-sorted input files are expected by most downstream MiModD tools, but note that the *SNAP Read Alignment* produces coordinate-sorted output by default and it is only necessary to sort files that come from other sources or from *SNAP Read Alignment* jobs with a custom sort order. - -The option *Sort by read names instead of coordinates* is useful if you want to re-align coordinate-sorted paired-end data. In *paired-end mode*, the *SNAP Read Alignment* tool expects the reads in the input file to be arranged in read pairs, i.e., the forward read information of a pair must be followed immediately by its reverse mate information, which is typically not the case in coordinate-sorted files. Resorting such files by read names fixes this problem. - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/cloudmap.xml --- a/mimodd_bitbucket_wrappers/cloudmap.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,340 +0,0 @@ - - Map causative mutations by multi-variant linkage analysis. - - mimodd version -q - - mimodd map ${opt.mode} "${opt.source.ifile}" - #if $str($opt.source.sample): - -m "${opt.source.sample}" - #end if - #if $str($opt.source.related_parent_sample): - -r "${opt.source.related_parent_sample}" - #end if - #if $str($opt.source.unrelated_parent_sample): - -u "${opt.source.unrelated_parent_sample}" - #end if - $opt.source.infer_missing - -o "$ofile" - #if $str($opt.source.seqdict_required.required) == "yes": - -s "${opt.source.seqdict_required.seqdict}" - #end if - $opt.source.norm - #if $len($opt.source.bin_sizes): - --bin-sizes - #for $size in $opt.source.bin_sizes: - "${size.bin_size}" - #end for - #end if - #if $str($opt.source.tabfile): - $str($opt.source.tabfile) $tfile - #end if - #if $str($opt.source.plotopts.plots): - $str($opt.source.plotopts.plots) "$pfile" - $str($opt.source.plotopts.xlim) - #if $str($opt.source.plotopts.hylim): - --ylim-hist $str($opt.source.plotopts.hylim) - #end if - #if $str($opt.source.plotopts.hcols) and $len($opt.source.plotopts.hcols): - --hist-colors - #for $color in $opt.source.plotopts.hcols: - "${color.hcolor}" - #end for - #end if - #if $str($opt.source.plotopts.sylim): - --ylim-scatter $str($opt.source.plotopts.sylim) - #end if - #if $str($opt.source.plotopts.pcol): - --points-color "$str($opt.source.plotopts.pcol)" - #end if - #if $str($opt.source.plotopts.lcol): - --loess-color "$str($opt.source.plotopts.lcol)" - #end if - #if $str($opt.source.plotopts.span): - --loess-span "$str($opt.source.plotopts.span)" - #end if - #end if - - - - - toolshed_macros.xml - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - (opt['source']['tabfile']) - - - (opt['source']['plotopts']['plots']) - - - - -.. class:: infomark - - **What it does** - -This tool is a complete rewrite of and improves the EMS Variant Density and Hawaiian Variant Mapping tools of `CloudMap`_. It is the most downstream tool in `mapping-by-sequencing analysis workflows in MiModD`_. - -It can be used to analyze and visualize the inheritance pattern of variants detected and selected by other MiModD tools or as an alternative (and more versatile) plotting engine for data generated with `CloudMap`_. - -------------- - -**Usage Modes:** - -This tool can be run in one of two different modes depending on the type of mapping analysis that should be performed: - -1) *Simple Variant Density (SVD) Mapping* mode analyzes the density of variants along the reference genome by dividing each chromosome into regions of user-defined size (bins) and counting the variants found in each bin. - - All variants listed in the input file are analyzed in this mode, which means that as input you will typically want to use filtered lists of variants (as produced by the VCF Filter tool). - - The aim of SVD analysis is to identify clusters of variants in an outcrossed strain carrying a selectable unknown mutation, which is interpreted as linkage between the corresponding genomic region and the unknown mutation. - - This mode corresponds roughly to EMS Variant Density Mapping in CloudMap. - -2) *Variant Allele Frequency (VAF) Mapping** mode analyzes the inheritance pattern in cross-progeny at sites, at which the parents are homozygous for different alleles. - - The aim of VAF analysis is to identify clusters of variants with (near) homozygous inheritance in a F2 (or later generation) population obtained from a cross between a strain carrying a selectable unknown mutation and an unrelated mapping strain. Such a cluster is interpreted as linkage between the corresponding genomic region and the unknown mutation selected for in the F2 generation. - - This mode corresponds roughly to Hawaiian Variant Mapping in CloudMap, but can simultaneously take into account non-reference alleles found in either parent strain (CloudMap users may think of this as a combined Hawaiian Variant and Variant Discovery Mapping analysis). - -------------- - -**Input:** - -Valid input for this tool are VCF files (any VCF file in SVD mode, a MiModD-generated multi-sample VCF file in VAF mode) or a CloudMap tabular report file as generated by the Hawaiian Variant Mapping tool. Alternatively, the tool can generate (in both modes) its own tabular report file, which can be used as input instead of the original VCF file when rerunning the tool with different plotting parameters to reduce analysis time. - -.. class:: infomark - - CloudMap-generated tabular input files require, as additional input, a CloudMap-style sequence dictionary (even if the original CloudMap analysis was possible without one) as described in the original CloudMap paper. This file has a simple two-column tab-delimited format, in which each line lists the chromosome name (as it appears in the input VCF file) and the up-rounded length of the chromosome in megabases. - -------------- - -**Output:** - -The tool produces up to three output files: - -1) a default tabular file of binned variant counts that can be used to plot the data with external software such as Excel, - - -2) an optional pdf containing linkage plots, which should look just like the plots produced by CloudMap, but are optimized for file size and display speed and offer more user-configurable parameters and - - -3) an optional tabular per-variant report file, which can be configured to be either a valid input file for the corresponding original CloudMap tool (for users who really, really want to continue using CloudMap for plotting) or to be reusable in fast reruns of the tool (which can be useful to experiment with different plotting parameters). - -------------- - -**Settings:** - -1) Analysis settings - - *bin size to analyze variants in* - determines the width of the regions along each chromosome, in which variants are counted and analyzed together. - - Several bin sizes can be specified and for each size you will get a corresponding report section in the binned variant counts file and a histogram plot in the linkage plots file. - - *normalize variant counts to bin-width* - if selected (as per default) the variant counts for different bin sizes are not absolute, but normalized to the bin width - - *sample names (in VAF mode only)* - to analyze inheritance patterns, VAF mode needs information about the relationship between the samples defined in the input VCF file: - - The *mapping sample name* should be set to the name of the sample for which the inheritance pattern is to be analyzed (the pooled progeny population). - - The *name of the related sample* should be that of the parent sample that carried and brought in the unknown mutation to be mapped (or, alternatively, that of a closely related ancestor). - - Finally, the *name of the unrelated sample* should be that of the other parent strain used in the cross. - - At least one of the parent samples MUST be specified, but if the input file contains variant information for both parents, they can be analyzed together for higher mapping accuracy. If you are reanalyzing a tabular report file from a previous tool run or from CloudMap, the association between variants and samples is already incorporated into the input file and cannot be specified again. - -2) Graphical output settings - - .. class:: warningmark - - To be able to generate plots the system running MiModD needs to have the statistical programming environment R and its Python interface rpy2 installed. - - - *y-axes scaling* - if you want to override the defaults - - *x-axis scaling* - choose *preserve relative contig sizes* if you want the largest chromosome to fit the page width and smaller chromosomes to appear according to their relative size or choose *scale each contig to fit the plot width* if all chromosomes should exploit the available space - - *span value to be used in calculating the Loess regression line* - this value determines the degree of smoothing of the regression line through the scatterplot data. Information on loess regression and the loess span parameter can be found at http://en.wikipedia.org/wiki/Local_regression. The default is 0.1 as in CloudMap. - - *colors used for plotting* - can be selected freely from the offered palette. For histogram colors, the list of selected colors will be used to provide the colors for the different histograms plotted. If less colors than histograms (determined by the number of bin sizes selected) are specified, colors from the list will be recycled. - - -.. _CloudMap: https://usegalaxy.org/u/gm2123/p/cloudmap -.. _mapping-by-sequencing analysis workflows in MiModD: http://mimodd.readthedocs.org/en/latest/cloudmap.html - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/convert.xml --- a/mimodd_bitbucket_wrappers/convert.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,170 +0,0 @@ - - between different sequence data formats - - toolshed_macros.xml - - - mimodd version -q - - #if $str($mode.split_on_rgs) or $str($mode.oformat)=="fastq" or $str($mode.oformat)=="gz": - echo "Your input data is now getting processed by MiModD. The output will be split into several files based on the read groups found in the input.\nThis history item will remain in the busy state until the job is finished.\nAfter the job is showing as finished, Galaxy will start adding the results files to your history one by one.\n\nThis may take a while to complete! \n\nYou should refresh your history to see if new files have arrived.\n\nThis message is for your information only and can be deleted from the history once the job has finished." > $output_split_on_read_groups; - - mkdir converted_data; - #end if - - mimodd convert - - #for $i in $mode.input_list - "${i.file1}" - #if $str($mode.iformat) in ("fastq_pe", "gz_pe"): - "${i.file2}" - #end if - #end for - #if $str($mode.header) != "None": - --header "$(mode.header)" - #end if - - #if $str($outputname) == "None": - --ofile converted_data/read_group - #else - --ofile "$outputname" - #end if - --iformat $(mode.iformat) - --oformat $(mode.oformat) - ${mode.split_on_rgs} - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - (not mode['split_on_rgs'] and mode['oformat'] not in ("fastq", "gz")) - - - - - - (mode['split_on_rgs'] or mode['oformat'] in ("fastq", "gz")) - - - - - - -.. class:: infomark - - **What it does** - -The tool converts between different file formats used for storing next-generation sequencing data. - -As input file types it can handle uncompressed or gzipped fastq, SAM or BAM format, which it can convert to SAM or BAM format. - -**Notes:** - -1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to convert gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. - -2) The tool can convert fastq files representing data from paired-end sequencing runs to appropriate SAM/BAM format provided that the mate information is split over two fastq files in corresponding order. - - **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. - -3) Merging partial fastq (or gzipped fastq) files into a single SAM/BAM file is supported both for single-end and paired-end data. Simply add additional input datasets and select the appropriate files (pairs of files in case of paired-end data). - - Concatenation of SAM/BAM file during conversion is currently not supported. - -4) For input in fastq format a SAM header file providing run metadata **has to be specified**. The information in this file will be used as the header data of the new SAM/BAM file. You can use the *NGS Run Annotation* tool to generate a new header file for your data. - - For input in SAM/BAM format the tool will simply copy the existing header data to the new file. To modify the header of an existing SAM/BAM file, use the *Reheader BAM file* tool instead. - -.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 -.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy -.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/covstats.xml --- a/mimodd_bitbucket_wrappers/covstats.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,31 +0,0 @@ - - Calculate coverage statistics for a BCF file as generated by the Variant Calling tool - - toolshed_macros.xml - - - mimodd version -q - - mimodd covstats "$ifile" --ofile "$output_vcf" - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool takes as input a BCF file produced by the *Variant Calling* tool, and calculates per-chromosome read coverage from it. - -.. class:: warningmark - - The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/deletion_predictor.xml --- a/mimodd_bitbucket_wrappers/deletion_predictor.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,65 +0,0 @@ - - Predicts deletions in one or more aligned read samples based on coverage of the reference genome and on insert sizes - - toolshed_macros.xml - - - mimodd version -q - - mimodd delcall - #for $l in $list_input - "${l.bamfile}" - #end for - "$covfile" -o "$outputfile" - --max-cov "$max_cov" --min-size "$min_size" $include_uncovered $group_by_id --verbose - - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool predicts deletions from paired-end data in a two-step process: - -1) It finds regions of low-coverage, i.e., candidate regions for deletions, by scanning a BCF file produced by the *Variant Calling* tool. - - The *maximal coverage allowed inside a low-coverage region* and the *minimal deletion size* parameters are used at this step to define what is considered a low-coverage region. - - .. class:: warningmark - - The tool treats genome positions missing from the BCF input as zero coverage, so it is safe to use ONLY with BCF files produced by the *Variant Calling* tool or through other commands that keep the information for all sites. - -2) It assesses every low-coverage region statistically for evidence of it being a real deletion. **This step requires paired-end data** since it relies on shifts in the distribution of read pair insert sizes around real deletions. - -By default, the tool only reports Deletions, i.e., the subset of low-coverage regions that pass the statistical test. -If *include low-coverage regions* is selected, regions that failed the test will also be reported. - -With *group reads based on read group id only* selected, as it is by default, grouping of reads into samples is done strictly based on their read group IDs. -With the option deselected, grouping is done based on sample names in the first step of the analysis, i.e. the reads of all samples with a shared sample name are used to identify low-coverage regions. -In the second step, however, reads will be regrouped by their read group IDs again, i.e. the statistical assessment for real deletions is always done on a per read group basis. - -**TIP:** -Deselecting *group reads based on read group id only* can be useful, for example, if you have both paired-end and single-end sequencing data for the same sample. - -In this case, the two sets of reads will usually share a common sample name, but differ in their read groups. -With grouping based on sample names, the single-end data can be used together with the paired-end data to identify low-coverage regions, thus increasing overall coverage and reliability of this step. -Still, the assessment of deletions will use only the paired-end data (auto-detecting that the single-end reads do not provide insert size information). - - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/fileinfo.xml --- a/mimodd_bitbucket_wrappers/fileinfo.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,38 +0,0 @@ - - for supported data formats. - - toolshed_macros.xml - - - mimodd version -q - - mimodd info "$ifile" -o "$outputfile" --verbose --oformat $oformat - - - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool inspects the input file and generates a report summarizing its contents. - -It autodetects and works with most file formats produced by MiModD, i.e., **SAM / BAM, vcf / bcf and fasta**, and produces a standardized report for all of them. - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/reheader.xml --- a/mimodd_bitbucket_wrappers/reheader.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,202 +0,0 @@ - - From a BAM file generate a new file with the original header (if any) replaced or modified by that found in a second SAM file - - mimodd version -q - - #if ($str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form") or $str($co.treat_co) != "ignore": - mimodd header - #if $str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form": - #for $rginfo in $rg.rginfo.rg - #if $str($rginfo.source_id): - --rg-id "${rginfo.source_id}" - #end if - #if $str($rginfo.rg_sm): - --rg-sm "${rginfo.rg_sm}" - #end if - #if $str($rginfo.rg_cn): - --rg-cn "${rginfo.rg_cn}" - #else: - --rg-cn "" - #end if - #if $str($rginfo.rg_ds): - --rg-ds "${rginfo.rg_ds}" - #else: - --rg-ds "" - #end if - #if $str($rginfo.rg_date): - --rg-dt "${rginfo.rg_date}" - #else: - --rg-dt "" - #end if - #if $str($rginfo.rg_lb): - --rg-lb "${rginfo.rg_lb}" - #else: - --rg-lb "" - #end if - #if $str($rginfo.rg_pl): - --rg-pl "${rginfo.rg_pl}" - #else: - --rg-pl "" - #end if - #if $str($rginfo.rg_pi): - --rg-pi "${rginfo.rg_pi}" - #else: - --rg-pi "" - #end if - #if $str($rginfo.rg_pu): - --rg-pu "${rginfo.rg_pu}" - #else: - --rg-pu "" - #end if - #end for - #end if - #if $str($co.treat_co) != "ignore": - --co - #for $comment in $co.coinfo - #if $str($comment.line): - "${comment.line}" - #end if - #end for - #end if - | - #end if - mimodd reheader "$inputfile" --sq ignore - --rg ${rg.treat_rg} - #if $str($rg.treat_rg) != "ignore": - #if $str($rg.rginfo.source) == "from_file": - "${rg.rginfo.data}" - #else: - - - #end if - #for $rgmapping in $rg.rginfo.rg - #if $str($rgmapping.source_id) and $str($rgmapping.rg_id): - "$str($rgmapping.source_id)" : "$str($rgmapping.rg_id)" - #end if - #end for - #end if - - --co ${co.treat_co} - #if $str($co.treat_co) != "ignore": - - - #end if - - #set $restr = "" - #for $rename in $rg_renaming - #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') - #end for - #if $restr - --rgm $restr - #end if - - #set $restr = "" - #for $rename in $sq_renaming - #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') - #end for - #if $restr - --sqm $restr - #end if - - -o "$output" - - - - toolshed_macros.xml - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool generates a copy of the BAM input file with a modified header (i.e., metadata). - -It can update or replace read-group information (i.e., information about the samples in the file), add or replace comment lines, and rename reference sequences declared in the header. - -The tool ensures that the resulting BAM file is valid and can be further processed by other MiModD tools and standard software like samtools. It aborts with an error message if a valid BAM file cannot be generated with the user-specified settings. - -The template information used to modify or replace the input file metadata is provided through forms or, in the case of read-group information, can be taken from an existing SAM file as can be generated, for example, with the *NGS Run Annotation* tool. - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/sam_header.xml --- a/mimodd_bitbucket_wrappers/sam_header.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,128 +0,0 @@ - - Create a SAM format header from run metadata for sample annotation. - - toolshed_macros.xml - - - mimodd version -q - - mimodd header - - --rg-id "$rg_id" - --rg-sm "$rg_sm" - - #if $str($rg_cn): - --rg-cn "$rg_cn" - #end if - #if $str($rg_ds): - --rg-ds "$rg_ds" - #end if - #if $str($rg_date): - --rg-dt "$rg_date" - #end if - #if $str($rg_lb): - --rg-lb "$rg_lb" - #end if - #if $str($rg_pl): - --rg-pl "$rg_pl" - #end if - #if $str($rg_pi): - --rg-pi "$rg_pi" - #end if - #if $str($rg_pu): - --rg-pu "$rg_pu" - #end if - - --ofile "$outputfile" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -This tool takes the user-provided information about a next-generation sequencing run and constructs a valid header in the SAM file format from it. - -The result file can be used by the tools *Convert* and *Reheader* or in the *SNAP Read Alignment* step to add run metadata to sequenced reads files (or to overwrite pre-existing information). - -**Note:** - -**MiModD requires run metadata for every input file at the Alignment step !** - -**Tip:** - -While you can do Alignments from fastq file format by providing a custom header file directly to the *SNAP Read Alignment* tool, we **recommend** you to first convert all input files to and archive all datasets in SAM/BAM format with appropriate header information prior to any downstream analysis. Although a bit more time-consuming, this practice protects against information loss and ensures that the input datasets will remain useful for others in the future. - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/snap_caller.xml --- a/mimodd_bitbucket_wrappers/snap_caller.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,241 +0,0 @@ - - Map sequence reads to a reference genome using SNAP - - toolshed_macros.xml - - - mimodd version -q - - mimodd snap-batch -s - ## SNAP calls (considering different cases) - - #for $i in $datasets - "snap ${i.mode_choose.mode} '$ref_genome' - #if $str($i.mode_choose.mode) == "paired" and $str($i.mode_choose.input.iformat) in ("fastq", "gz"): -'${i.mode_choose.input.ifile1}' '${i.mode_choose.input.ifile2}' - #else: -'${i.mode_choose.input.ifile}' - #end if ---ofile '$outputfile' --iformat ${i.mode_choose.input.iformat} --oformat $oformat ---idx-seedsize '$set.seedsize' ---idx-slack '$set.slack' --maxseeds '$set.maxseeds' --maxhits '$set.maxhits' --clipping $set.clipping --maxdist '$set.maxdist' --confdiff '$set.confdiff' --confadapt '$set.confadpt' - #if $i.mode_choose.input.header: ---header '${i.mode_choose.input.header}' - #end if - #if $str($i.mode_choose.mode) == "paired": ---spacing '$set.sp_min' '$set.sp_max' - #end if - #if $str($set.selectivity) != "off": ---selectivity '$set.selectivity' - #end if - #if $str($set.filter_output) != "off": ---filter-output $set.filter_output - #end if - #if $str($set.sort) == "off": ---no-sort - #end if - #if $str($set.mmatch_notation) != "general": --X - #end if - #if $set.discard_overlapping_mates: ---discard-overlapping-mates - ## remove ',' (and possibly adjacent whitespace) and replace with ' ' - '#echo ("' '".join($set.discard_overlapping_mates.replace(" ", "").split(',')))#' - #end if ---verbose -" - #end for - - - - ## mandatory arguments (and mode-conditionals) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ## optional arguments - - - - - - - - ## default settings - - - - - - - - - - - - - - - - - - - - - - ## change settings - - - - - - ## paired-end specific options - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool aligns the sequenced reads in an arbitrary number of input datasets against a common reference genome and stores the results in a single, possibly multi-sample output file. It supports a variety of different sequenced reads input formats, i.e., SAM, BAM, fastq and gzipped fastq, and both single-end and paired-end data. - -Internally, the tool uses the ultrafast, hashtable-based aligner SNAP (http://snap.cs.berkeley.edu), hence its name. - -**Notes:** - -1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. - -2) To use paired-end fastq data with the tool the read mate information needs to be split over two fastq files in corresponding order. - - **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. - -3) The tool supports the alignment of reads from the same sequencing run, but distributed across several input files. - - Generally, it expects the reads from each input dataset to belong to one read-group and will abort with an error message if any input dataset declares more than one read group or sample names in its header. Different datasets, however, are allowed to contain reads from the same read-group (as indicated by matching read-group IDs and sample names in their headers), in which case the reads will be combined into one group in the output. - -4) Read-group information is required for every input dataset! - - We generally recommend to store NGS datasets in SAM/BAM format with run metadata stored in the file header. You can use the *NGS Run Annotation* and *Convert* tools to convert data in fastq format to SAM/BAM with added run information. - - While it is not our recommended approach, you can, if you prefer it, align reads from fastq files or SAM/BAM files without header read-group information. To do so, you **must** specify a SAM file that provides the missing information in its header along with the input dataset. You can generate a SAM header file with the *NGS Run Annotation* tool. - - Optionally, a SAM header file can also be used to replace existing read-group information in a headered SAM/BAM input file. This can be used to resolve read-group ID conflicts between multiple input files at tool runtime. - -5) The options available under *further parameter settings* can have **big** effects on the alignment quality. You are strongly encouraged to consult the `tool documentation`_ for detailed explanations of the available options. - -6) Currently, you cannot configure aligner-specific options separately for specific input files from within this Galaxy tool. If you need this advanced level of control, you should use the command line tool ``mimodd snap-batch``. - -.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 -.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy -.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest -.. _tool documentation: http://mimodd.readthedocs.org/en/latest/tool_doc.html#snap - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/snp_caller_caller.xml --- a/mimodd_bitbucket_wrappers/snp_caller_caller.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,66 +0,0 @@ - - From a reference and aligned reads generate a BCF file with position-specific variant likelihoods and coverage information - - toolshed_macros.xml - - - mimodd version -q - - mimodd varcall - - "$ref_genome" - #for $l in $list_input - "${l.inputfile}" - #end for - --ofile "$output_vcf" - --depth "$depth" - $group_by_id - $no_md5_check - --verbose - --quiet - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool transforms the read-centered information of its aligned reads input files into position-centered information. - -**It produces a BCF file that serves as the basis for all further variant analyses with MiModD**. - -**Notes:** - -By default, the tool will check whether the input BAM file(s) provide(s) MD5 checksums for the reference genome sequences used during read alignment (the *SNAP Read Alignment* tool stores these in the BAM file header). If it finds MD5 sums for all sequences, it will compare them to the actual checksums of the sequences in the specified reference genome and -check that every sequence mentioned in any BAM input file has a counterpart with matching MD5 sum in the reference genome and abort with an error message if that is not the case. If it finds sequences with matching checksum, but different names in the reference genome, it will use the name from the reference genome file in its output. - -This behavior has two benefits: - -1) It protects from accidental variant calling against a wrong reference genome (i.e., a different one than that used during the alignment step), which would result in wrong calls. This is the primary reason why we recommend to leave the check activated - -2) It provides an opportunity to change sequence names between aligned reads files and variant call files by providing a reference genome file with altered sequence names (but identical sequence data). - -Since there may be rare cases where you *really* want to align against a reference genome with different checksums (e.g., you may have edited the reference sequence based on the alignment results), the check can be turned off, but only do this if you know exactly why. - ------------ - -Internally, the tool uses samtools mpileup combined with bcftools to do all per-nucleotide calculations. - -It exposes just a single configuration parameter of these tools - the *maximum per-BAM depth*. Through this parameter, the maximum number of reads considered for variant calling at any site can be controlled. Its default value of 250 is taken from *samtools mpileup* and usually suitable. Consider, however, that this gives the maximum read number per input file, so if you have a large number of samples in one input file, it could become necessary to increase the value to get sufficient reads considered per sample. - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/snpeff_genomes.xml --- a/mimodd_bitbucket_wrappers/snpeff_genomes.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,24 +0,0 @@ - - Checks the local SnpEff installation to compile a list of currently installed genomes - - toolshed_macros.xml - - - mimodd version -q - - mimodd snpeff-genomes -o "$outputfile" - - - - - -.. class:: infomark - -**What it does** - -When executed this tool searches the host machine's SnpEff installation for properly registered and installed -genome annotation files. The resulting list is added as a plain text file to your history for use with the *Variant Annotation* Tool. - - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/toolshed_macros.xml --- a/mimodd_bitbucket_wrappers/toolshed_macros.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,7 +0,0 @@ - - - - mimodd - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/varextract.xml --- a/mimodd_bitbucket_wrappers/varextract.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,101 +0,0 @@ - - from a BCF file - - toolshed_macros.xml - - - mimodd version -q - - mimodd varextract "$ifile" - #if $len($sitesinfo) - -p - #for $source in $sitesinfo - "${source.pre_vcf}" - #end for - #end if - --ofile "$output_vcf" - $keep_alts - --verbose - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool takes as input a BCF file like the ones produced by the *Variant Calling* tool, extracts just the variant sites from it and reports them in VCF format. - -If the BCF input file specifies multiple samples, sites are included if they qualify as variant sites in at least one sample. - -In a typical analysis workflow, you will use the tool's VCF output as input for the *VCF Filter* tool to cut down the often still impressive list of sites to a subset with relevance to your project. - -**Options:** - -1) By default, a variant site is considered to be a position in the genome for which a non-reference allele appears in the inferred genotype of any sample. - - You can select the *keep all sites with alternate bases* option, if instead you want to extract all sites, for which at least one non-reference base has been observed (whether resulting in a non-reference allele call or not). Using this option should rarely be necessary, but could be occassionally helpful for closer inspection of candidate genomic regions. - -2) During the process of variant extraction the tool can take into account genome positions specified in one or more independently generated VCF files. If such additional VCF input is provided, the tool output will contain the samples found in these files as additional samples and sites from the main BCF file will be included if they either qualify as variant sites in at least one sample specified in the BCF or if they are listed in any of the additional VCF files. - - Optional VCF input can be particularly useful in one of the following situations: - - *scenario i* - you have prior information that leads you to think that certain genome positions are of special relevance for your project and, thus, you are interested in the statistics produced by the variant caller for these positions even if they are not considered variant sites. In this case you can use a minimal VCF file to guide the variant extraction process to include these positions. This minimal VCF file needs a minimal header: - - ``##fileformat=VCFv4.2`` - - followed by positional information like in this example:: - - #CHROM POS ID REF ALT QUAL FILTER INFO - chrI 1222 . . . . . . - chrI 2651 . . . . . . - chrI 3659 . . . . . . - chrI 3731 . . . . . . - - , where columns are tab-separated and . serves as a placeholder for missing information. - - *scenario ii* - you have actual variant calls from an additional sample, but you do not have access to the original sequenced reads data (if you had, the recommended approach would be to align this data along with your other sequencing data or, at least, to perform the *Variant Calling* step together). - - This situation is often encountered with published datasets. Assume you have obtained a list of known single nucleotide variants (SNVs) found in one particular strain of your favorite model organism and you would like to know which of these SNVs are present in the related strains you have sequenced. You have aligned the sequenced reads from your samples and have used the *Variant Calling* tool, which has generated a BCF file ready for variant extraction. If the SNV list for the previously sequenced strain is in VCF format already, you can now just plug it into the analysis process by specifying it in the tool interface as an *independently generated vcf file*. The resulting vcf output file will contain all SNV sites along with the variant sites found in the BCF alone. You can then proceed to the *VCF Filter* tool to look at the original SNV sites only or to investigate any other interesting subset of sites. If the SNV list is in some other format, you will have o convert it to VCF first. At a minimum, the file must have a ``##fileformat`` header line like the previous example and have the ``REF`` and ``ALT`` column filled in like so:: - - #CHROM POS ID REF ALT QUAL FILTER INFO - chrI 1897409 . A G . . . - chrI 1897492 . C T . . . - chrI 1897616 . C A . . . - chrI 1897987 . A T . . . - chrI 1898185 . C T . . . - chrI 1898715 . G A . . . - chrI 1898729 . T C . . . - chrI 1900288 . T A . . . - - , in which case the tool will assume that the corresponding sample is homozygous for each of the SNVs. If you need to distinguish between homozygous and heterozygous SNVs you will have to extend the format to include a format and a sample column with genotype (GT) information like in this example:: - - #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleX - chrI 1897409 . A G . . . GT 1/1 - chrI 1897492 . C T . . . GT 0/1 - chrI 1897616 . C A . . . GT 0/1 - chrI 1897987 . A T . . . GT 0/1 - chrI 1898185 . C T . . . GT 0/1 - chrI 1898715 . G A . . . GT 0/1 - chrI 1898729 . T C . . . GT 0/1 - chrI 1900288 . T A . . . GT 0/1 - - , in which sampleX would be heterozygous for all SNVs except the first. - - .. class:: warningmark - - If the optional VCF input contains INDEL calls, these will be ignored by the tool. - - - - diff -r 7f7028112439 -r d6ec32ce882b mimodd_bitbucket_wrappers/vcf_filter.xml --- a/mimodd_bitbucket_wrappers/vcf_filter.xml Tue Mar 28 04:28:19 2017 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,133 +0,0 @@ - - Extracts lines from a vcf variant file based on field-specific filters - - toolshed_macros.xml - - - mimodd version -q - - mimodd vcf-filter - "$inputfile" - -o "$outputfile" - #if len($datasets): - -s - #for $i in $datasets - "$i.sample" - #end for - --gt - #for $i in $datasets - ## remove whitespace from free-text input - "#echo ("".join($i.GT.split()) or "ANY")#" - #echo " " - #end for - --dp - #for $i in $datasets - "$i.DP" - #end for - --gq - #for $i in $datasets - "$i.GQ" - #end for - --af - #for $i in $datasets - "#echo ($i.AF or "::")#" - #end for - #end if - #if len($regions): - -r - #for $i in $regions - #if $i.stop: - "$i.chrom:$i.start-$i.stop" - #else: - "$i.chrom:$i.start" - #end if - #end for - #end if - #if $vfilter: - --vfilter - ## remove ',' and replace with ' ' - "#echo ('" "'.join($vfilter.split(',')))#" - #end if - $vartype - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -.. class:: infomark - - **What it does** - -The tool filters a variant file in VCF format to generate a new VCF file with only a subset of the original variants. - -The following types of variant filters can be set up: - -1) Sample-specific filters: - - Filter variants based on their characteristics in the sequenced reads of a specific sample. Multiple sample-specific filters are combined by logical AND, i.e., only variants that pass ALL sample-specific filters are kept. - -2) Region filters: - - Filter variants based on the genomic region they affect. Multiple region filters are combined by logical OR, i.e., variants passing ANY region filter are kept. - -3) Variant type filter: - - Filter variants by their type, i.e. whether they are single nucleotide variations (SNVs) or indels - -In addition, the *sample* filter can be used to reduce the samples encoded in a multi-sample VCF file to just those specified by the filter. -The *sample* filter is included mainly for compatibility reasons: if an external tool cannot deal with the multisample file format, but instead looks only at the first sample-specific column of the file, you can use the filter to turn the multi-sample file into a single-sample file. Besides, the filter can also be used to change the order of the samples since it will sort the samples in the order specified in the filter field. - -**Examples of sample-specific filters:** - -*Simple genotype pattern* - -genotype pattern: 1/1 ==> keep all variants in the vcf input file for which the specified sample's genotype is homozygous mutant - -*Complex genotype pattern* - -genotype pattern: 0/1, 0/0 ==> keep all variants for which the sample's genotype is either heterozygous or homozygous wildtype - -*Multiple sample-specific filters* - -Filter 1: genotype pattern: 0/0, Filter 2: genotype pattern 1/1: -==> keep all variants for which the first sample's gentoype is homozygous wildtype **and** the second sample's genotype is homozygous mutant - -*Combining sample-specific filter criteria* - -genotype pattern: 1/1, depth of coverage: 3, genotype quality: 9 -==> keep variants for which the sample's genotype is homozygous mutant **and** for which this genotype assignment is corroborated by a genotype quality score of at least 9 -**and** at least three reads from the sample cover the variant site - -**TIP:** - -As in the example above, genotype quality is typically most useful in combination with a genotype pattern. -It acts then, effectively, to make the genotype filter more stringent. - - - - - diff -r 7f7028112439 -r d6ec32ce882b reheader.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/reheader.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,202 @@ + + From a BAM file generate a new file with the original header (if any) replaced or modified by that found in a second SAM file + + mimodd version -q + + #if ($str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form") or $str($co.treat_co) != "ignore": + mimodd header + #if $str($rg.treat_rg) != "ignore" and $str($rg.rginfo.source) == "from_form": + #for $rginfo in $rg.rginfo.rg + #if $str($rginfo.source_id): + --rg-id "${rginfo.source_id}" + #end if + #if $str($rginfo.rg_sm): + --rg-sm "${rginfo.rg_sm}" + #end if + #if $str($rginfo.rg_cn): + --rg-cn "${rginfo.rg_cn}" + #else: + --rg-cn "" + #end if + #if $str($rginfo.rg_ds): + --rg-ds "${rginfo.rg_ds}" + #else: + --rg-ds "" + #end if + #if $str($rginfo.rg_date): + --rg-dt "${rginfo.rg_date}" + #else: + --rg-dt "" + #end if + #if $str($rginfo.rg_lb): + --rg-lb "${rginfo.rg_lb}" + #else: + --rg-lb "" + #end if + #if $str($rginfo.rg_pl): + --rg-pl "${rginfo.rg_pl}" + #else: + --rg-pl "" + #end if + #if $str($rginfo.rg_pi): + --rg-pi "${rginfo.rg_pi}" + #else: + --rg-pi "" + #end if + #if $str($rginfo.rg_pu): + --rg-pu "${rginfo.rg_pu}" + #else: + --rg-pu "" + #end if + #end for + #end if + #if $str($co.treat_co) != "ignore": + --co + #for $comment in $co.coinfo + #if $str($comment.line): + "${comment.line}" + #end if + #end for + #end if + | + #end if + mimodd reheader "$inputfile" --sq ignore + --rg ${rg.treat_rg} + #if $str($rg.treat_rg) != "ignore": + #if $str($rg.rginfo.source) == "from_file": + "${rg.rginfo.data}" + #else: + - + #end if + #for $rgmapping in $rg.rginfo.rg + #if $str($rgmapping.source_id) and $str($rgmapping.rg_id): + "$str($rgmapping.source_id)" : "$str($rgmapping.rg_id)" + #end if + #end for + #end if + + --co ${co.treat_co} + #if $str($co.treat_co) != "ignore": + - + #end if + + #set $restr = "" + #for $rename in $rg_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') + #end for + #if $restr + --rgm $restr + #end if + + #set $restr = "" + #for $rename in $sq_renaming + #set $restr = $restr + ($str($rename.from) and $str($rename.to) and '"' + $str($rename.from) + '" : "' + $str($rename.to) + '" ') + #end for + #if $restr + --sqm $restr + #end if + + -o "$output" + + + + toolshed_macros.xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool generates a copy of the BAM input file with a modified header (i.e., metadata). + +It can update or replace read-group information (i.e., information about the samples in the file), add or replace comment lines, and rename reference sequences declared in the header. + +The tool ensures that the resulting BAM file is valid and can be further processed by other MiModD tools and standard software like samtools. It aborts with an error message if a valid BAM file cannot be generated with the user-specified settings. + +The template information used to modify or replace the input file metadata is provided through forms or, in the case of read-group information, can be taken from an existing SAM file as can be generated, for example, with the *NGS Run Annotation* tool. + + + diff -r 7f7028112439 -r d6ec32ce882b sam_header.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sam_header.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,128 @@ + + Create a SAM format header from run metadata for sample annotation. + + toolshed_macros.xml + + + mimodd version -q + + mimodd header + + --rg-id "$rg_id" + --rg-sm "$rg_sm" + + #if $str($rg_cn): + --rg-cn "$rg_cn" + #end if + #if $str($rg_ds): + --rg-ds "$rg_ds" + #end if + #if $str($rg_date): + --rg-dt "$rg_date" + #end if + #if $str($rg_lb): + --rg-lb "$rg_lb" + #end if + #if $str($rg_pl): + --rg-pl "$rg_pl" + #end if + #if $str($rg_pi): + --rg-pi "$rg_pi" + #end if + #if $str($rg_pu): + --rg-pu "$rg_pu" + #end if + + --ofile "$outputfile" + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +This tool takes the user-provided information about a next-generation sequencing run and constructs a valid header in the SAM file format from it. + +The result file can be used by the tools *Convert* and *Reheader* or in the *SNAP Read Alignment* step to add run metadata to sequenced reads files (or to overwrite pre-existing information). + +**Note:** + +**MiModD requires run metadata for every input file at the Alignment step !** + +**Tip:** + +While you can do Alignments from fastq file format by providing a custom header file directly to the *SNAP Read Alignment* tool, we **recommend** you to first convert all input files to and archive all datasets in SAM/BAM format with appropriate header information prior to any downstream analysis. Although a bit more time-consuming, this practice protects against information loss and ensures that the input datasets will remain useful for others in the future. + + + diff -r 7f7028112439 -r d6ec32ce882b snap_caller.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snap_caller.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,241 @@ + + Map sequence reads to a reference genome using SNAP + + toolshed_macros.xml + + + mimodd version -q + + mimodd snap-batch -s + ## SNAP calls (considering different cases) + + #for $i in $datasets + "snap ${i.mode_choose.mode} '$ref_genome' + #if $str($i.mode_choose.mode) == "paired" and $str($i.mode_choose.input.iformat) in ("fastq", "gz"): +'${i.mode_choose.input.ifile1}' '${i.mode_choose.input.ifile2}' + #else: +'${i.mode_choose.input.ifile}' + #end if +--ofile '$outputfile' --iformat ${i.mode_choose.input.iformat} --oformat $oformat +--idx-seedsize '$set.seedsize' +--idx-slack '$set.slack' --maxseeds '$set.maxseeds' --maxhits '$set.maxhits' --clipping $set.clipping --maxdist '$set.maxdist' --confdiff '$set.confdiff' --confadapt '$set.confadpt' + #if $i.mode_choose.input.header: +--header '${i.mode_choose.input.header}' + #end if + #if $str($i.mode_choose.mode) == "paired": +--spacing '$set.sp_min' '$set.sp_max' + #end if + #if $str($set.selectivity) != "off": +--selectivity '$set.selectivity' + #end if + #if $str($set.filter_output) != "off": +--filter-output $set.filter_output + #end if + #if $str($set.sort) == "off": +--no-sort + #end if + #if $str($set.mmatch_notation) != "general": +-X + #end if + #if $set.discard_overlapping_mates: +--discard-overlapping-mates + ## remove ',' (and possibly adjacent whitespace) and replace with ' ' + '#echo ("' '".join($set.discard_overlapping_mates.replace(" ", "").split(',')))#' + #end if +--verbose +" + #end for + + + + ## mandatory arguments (and mode-conditionals) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ## optional arguments + + + + + + + + ## default settings + + + + + + + + + + + + + + + + + + + + + + ## change settings + + + + + + ## paired-end specific options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool aligns the sequenced reads in an arbitrary number of input datasets against a common reference genome and stores the results in a single, possibly multi-sample output file. It supports a variety of different sequenced reads input formats, i.e., SAM, BAM, fastq and gzipped fastq, and both single-end and paired-end data. + +Internally, the tool uses the ultrafast, hashtable-based aligner SNAP (http://snap.cs.berkeley.edu), hence its name. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, so the option to align gzipped fastq input is useful only with customized Galaxy instances or by using linked files as explained in our `recipe for using gzipped fastq files in Galaxy`_ from the `MiModD user guide`_. + +2) To use paired-end fastq data with the tool the read mate information needs to be split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *fastq splitter* and *fastq de-interlacer* tools for Galaxy from the `Fastq Manipulation category`_ of the Galaxy Tool Shed to see if they can convert your files to the expected format. + +3) The tool supports the alignment of reads from the same sequencing run, but distributed across several input files. + + Generally, it expects the reads from each input dataset to belong to one read-group and will abort with an error message if any input dataset declares more than one read group or sample names in its header. Different datasets, however, are allowed to contain reads from the same read-group (as indicated by matching read-group IDs and sample names in their headers), in which case the reads will be combined into one group in the output. + +4) Read-group information is required for every input dataset! + + We generally recommend to store NGS datasets in SAM/BAM format with run metadata stored in the file header. You can use the *NGS Run Annotation* and *Convert* tools to convert data in fastq format to SAM/BAM with added run information. + + While it is not our recommended approach, you can, if you prefer it, align reads from fastq files or SAM/BAM files without header read-group information. To do so, you **must** specify a SAM file that provides the missing information in its header along with the input dataset. You can generate a SAM header file with the *NGS Run Annotation* tool. + + Optionally, a SAM header file can also be used to replace existing read-group information in a headered SAM/BAM input file. This can be used to resolve read-group ID conflicts between multiple input files at tool runtime. + +5) The options available under *further parameter settings* can have **big** effects on the alignment quality. You are strongly encouraged to consult the `tool documentation`_ for detailed explanations of the available options. + +6) Currently, you cannot configure aligner-specific options separately for specific input files from within this Galaxy tool. If you need this advanced level of control, you should use the command line tool ``mimodd snap-batch``. + +.. _Fastq Manipulation category: https://toolshed.g2.bx.psu.edu/repository/browse_repositories_in_category?id=310ff67d4caf6531 +.. _recipe for using gzipped fastq files in Galaxy: http://mimodd.readthedocs.org/en/latest/recipes.html#use-gzipped-fastq-files-in-galaxy +.. _MiModD user guide: http://mimodd.readthedocs.org/en/latest +.. _tool documentation: http://mimodd.readthedocs.org/en/latest/tool_doc.html#snap + + + diff -r 7f7028112439 -r d6ec32ce882b snp_caller_caller.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snp_caller_caller.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,66 @@ + + From a reference and aligned reads generate a BCF file with position-specific variant likelihoods and coverage information + + toolshed_macros.xml + + + mimodd version -q + + mimodd varcall + + "$ref_genome" + #for $l in $list_input + "${l.inputfile}" + #end for + --ofile "$output_vcf" + --depth "$depth" + $group_by_id + $no_md5_check + --verbose + --quiet + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool transforms the read-centered information of its aligned reads input files into position-centered information. + +**It produces a BCF file that serves as the basis for all further variant analyses with MiModD**. + +**Notes:** + +By default, the tool will check whether the input BAM file(s) provide(s) MD5 checksums for the reference genome sequences used during read alignment (the *SNAP Read Alignment* tool stores these in the BAM file header). If it finds MD5 sums for all sequences, it will compare them to the actual checksums of the sequences in the specified reference genome and +check that every sequence mentioned in any BAM input file has a counterpart with matching MD5 sum in the reference genome and abort with an error message if that is not the case. If it finds sequences with matching checksum, but different names in the reference genome, it will use the name from the reference genome file in its output. + +This behavior has two benefits: + +1) It protects from accidental variant calling against a wrong reference genome (i.e., a different one than that used during the alignment step), which would result in wrong calls. This is the primary reason why we recommend to leave the check activated + +2) It provides an opportunity to change sequence names between aligned reads files and variant call files by providing a reference genome file with altered sequence names (but identical sequence data). + +Since there may be rare cases where you *really* want to align against a reference genome with different checksums (e.g., you may have edited the reference sequence based on the alignment results), the check can be turned off, but only do this if you know exactly why. + +----------- + +Internally, the tool uses samtools mpileup combined with bcftools to do all per-nucleotide calculations. + +It exposes just a single configuration parameter of these tools - the *maximum per-BAM depth*. Through this parameter, the maximum number of reads considered for variant calling at any site can be controlled. Its default value of 250 is taken from *samtools mpileup* and usually suitable. Consider, however, that this gives the maximum read number per input file, so if you have a large number of samples in one input file, it could become necessary to increase the value to get sufficient reads considered per sample. + + + diff -r 7f7028112439 -r d6ec32ce882b snpeff_genomes.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snpeff_genomes.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,24 @@ + + Checks the local SnpEff installation to compile a list of currently installed genomes + + toolshed_macros.xml + + + mimodd version -q + + mimodd snpeff-genomes -o "$outputfile" + + + + + +.. class:: infomark + +**What it does** + +When executed this tool searches the host machine's SnpEff installation for properly registered and installed +genome annotation files. The resulting list is added as a plain text file to your history for use with the *Variant Annotation* Tool. + + + + diff -r 7f7028112439 -r d6ec32ce882b toolshed_macros.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/toolshed_macros.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,7 @@ + + + + mimodd + + + diff -r 7f7028112439 -r d6ec32ce882b varextract.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/varextract.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,101 @@ + + from a BCF file + + toolshed_macros.xml + + + mimodd version -q + + mimodd varextract "$ifile" + #if $len($sitesinfo) + -p + #for $source in $sitesinfo + "${source.pre_vcf}" + #end for + #end if + --ofile "$output_vcf" + $keep_alts + --verbose + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool takes as input a BCF file like the ones produced by the *Variant Calling* tool, extracts just the variant sites from it and reports them in VCF format. + +If the BCF input file specifies multiple samples, sites are included if they qualify as variant sites in at least one sample. + +In a typical analysis workflow, you will use the tool's VCF output as input for the *VCF Filter* tool to cut down the often still impressive list of sites to a subset with relevance to your project. + +**Options:** + +1) By default, a variant site is considered to be a position in the genome for which a non-reference allele appears in the inferred genotype of any sample. + + You can select the *keep all sites with alternate bases* option, if instead you want to extract all sites, for which at least one non-reference base has been observed (whether resulting in a non-reference allele call or not). Using this option should rarely be necessary, but could be occassionally helpful for closer inspection of candidate genomic regions. + +2) During the process of variant extraction the tool can take into account genome positions specified in one or more independently generated VCF files. If such additional VCF input is provided, the tool output will contain the samples found in these files as additional samples and sites from the main BCF file will be included if they either qualify as variant sites in at least one sample specified in the BCF or if they are listed in any of the additional VCF files. + + Optional VCF input can be particularly useful in one of the following situations: + + *scenario i* - you have prior information that leads you to think that certain genome positions are of special relevance for your project and, thus, you are interested in the statistics produced by the variant caller for these positions even if they are not considered variant sites. In this case you can use a minimal VCF file to guide the variant extraction process to include these positions. This minimal VCF file needs a minimal header: + + ``##fileformat=VCFv4.2`` + + followed by positional information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1222 . . . . . . + chrI 2651 . . . . . . + chrI 3659 . . . . . . + chrI 3731 . . . . . . + + , where columns are tab-separated and . serves as a placeholder for missing information. + + *scenario ii* - you have actual variant calls from an additional sample, but you do not have access to the original sequenced reads data (if you had, the recommended approach would be to align this data along with your other sequencing data or, at least, to perform the *Variant Calling* step together). + + This situation is often encountered with published datasets. Assume you have obtained a list of known single nucleotide variants (SNVs) found in one particular strain of your favorite model organism and you would like to know which of these SNVs are present in the related strains you have sequenced. You have aligned the sequenced reads from your samples and have used the *Variant Calling* tool, which has generated a BCF file ready for variant extraction. If the SNV list for the previously sequenced strain is in VCF format already, you can now just plug it into the analysis process by specifying it in the tool interface as an *independently generated vcf file*. The resulting vcf output file will contain all SNV sites along with the variant sites found in the BCF alone. You can then proceed to the *VCF Filter* tool to look at the original SNV sites only or to investigate any other interesting subset of sites. If the SNV list is in some other format, you will have o convert it to VCF first. At a minimum, the file must have a ``##fileformat`` header line like the previous example and have the ``REF`` and ``ALT`` column filled in like so:: + + #CHROM POS ID REF ALT QUAL FILTER INFO + chrI 1897409 . A G . . . + chrI 1897492 . C T . . . + chrI 1897616 . C A . . . + chrI 1897987 . A T . . . + chrI 1898185 . C T . . . + chrI 1898715 . G A . . . + chrI 1898729 . T C . . . + chrI 1900288 . T A . . . + + , in which case the tool will assume that the corresponding sample is homozygous for each of the SNVs. If you need to distinguish between homozygous and heterozygous SNVs you will have to extend the format to include a format and a sample column with genotype (GT) information like in this example:: + + #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleX + chrI 1897409 . A G . . . GT 1/1 + chrI 1897492 . C T . . . GT 0/1 + chrI 1897616 . C A . . . GT 0/1 + chrI 1897987 . A T . . . GT 0/1 + chrI 1898185 . C T . . . GT 0/1 + chrI 1898715 . G A . . . GT 0/1 + chrI 1898729 . T C . . . GT 0/1 + chrI 1900288 . T A . . . GT 0/1 + + , in which sampleX would be heterozygous for all SNVs except the first. + + .. class:: warningmark + + If the optional VCF input contains INDEL calls, these will be ignored by the tool. + + + + diff -r 7f7028112439 -r d6ec32ce882b vcf_filter.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/vcf_filter.xml Tue Mar 28 04:34:04 2017 -0400 @@ -0,0 +1,133 @@ + + Extracts lines from a vcf variant file based on field-specific filters + + toolshed_macros.xml + + + mimodd version -q + + mimodd vcf-filter + "$inputfile" + -o "$outputfile" + #if len($datasets): + -s + #for $i in $datasets + "$i.sample" + #end for + --gt + #for $i in $datasets + ## remove whitespace from free-text input + "#echo ("".join($i.GT.split()) or "ANY")#" + #echo " " + #end for + --dp + #for $i in $datasets + "$i.DP" + #end for + --gq + #for $i in $datasets + "$i.GQ" + #end for + --af + #for $i in $datasets + "#echo ($i.AF or "::")#" + #end for + #end if + #if len($regions): + -r + #for $i in $regions + #if $i.stop: + "$i.chrom:$i.start-$i.stop" + #else: + "$i.chrom:$i.start" + #end if + #end for + #end if + #if $vfilter: + --vfilter + ## remove ',' and replace with ' ' + "#echo ('" "'.join($vfilter.split(',')))#" + #end if + $vartype + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool filters a variant file in VCF format to generate a new VCF file with only a subset of the original variants. + +The following types of variant filters can be set up: + +1) Sample-specific filters: + + Filter variants based on their characteristics in the sequenced reads of a specific sample. Multiple sample-specific filters are combined by logical AND, i.e., only variants that pass ALL sample-specific filters are kept. + +2) Region filters: + + Filter variants based on the genomic region they affect. Multiple region filters are combined by logical OR, i.e., variants passing ANY region filter are kept. + +3) Variant type filter: + + Filter variants by their type, i.e. whether they are single nucleotide variations (SNVs) or indels + +In addition, the *sample* filter can be used to reduce the samples encoded in a multi-sample VCF file to just those specified by the filter. +The *sample* filter is included mainly for compatibility reasons: if an external tool cannot deal with the multisample file format, but instead looks only at the first sample-specific column of the file, you can use the filter to turn the multi-sample file into a single-sample file. Besides, the filter can also be used to change the order of the samples since it will sort the samples in the order specified in the filter field. + +**Examples of sample-specific filters:** + +*Simple genotype pattern* + +genotype pattern: 1/1 ==> keep all variants in the vcf input file for which the specified sample's genotype is homozygous mutant + +*Complex genotype pattern* + +genotype pattern: 0/1, 0/0 ==> keep all variants for which the sample's genotype is either heterozygous or homozygous wildtype + +*Multiple sample-specific filters* + +Filter 1: genotype pattern: 0/0, Filter 2: genotype pattern 1/1: +==> keep all variants for which the first sample's gentoype is homozygous wildtype **and** the second sample's genotype is homozygous mutant + +*Combining sample-specific filter criteria* + +genotype pattern: 1/1, depth of coverage: 3, genotype quality: 9 +==> keep variants for which the sample's genotype is homozygous mutant **and** for which this genotype assignment is corroborated by a genotype quality score of at least 9 +**and** at least three reads from the sample cover the variant site + +**TIP:** + +As in the example above, genotype quality is typically most useful in combination with a genotype pattern. +It acts then, effectively, to make the genotype filter more stringent. + + + + +