# HG changeset patch # User wolma # Date 1407857175 14400 # Node ID 7da2c9654a8368a9c76c5efda05487db6e72d376 Uploaded diff -r 000000000000 -r 7da2c9654a83 annotate_variants.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/annotate_variants.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,165 @@ + + Predict the effects of SNPs and indels on known genes in the reference genome using SnpEff + + python3 + MiModD + + + mimodd annotate + + $inputfile + + #if $str($annotool.name)=='snpeff': + --genome ${annotool.genomeVersion} + #if $annotool.ori_output: + --snpeff_out $snpeff_file + #end if + #if $annotool.stats: + --stats $summary_file + #end if + ${annotool.snpeff_settings.chr} ${annotool.snpeff_settings.no_us} ${annotool.snpeff_settings.no_ds} ${annotool.snpeff_settings.no_intron} ${annotool.snpeff_settings.no_intergenic} ${annotool.snpeff_settings.no_utr} + #if $annotool.snpeff_settings.min_cov: + --minC ${annotool.snpeff_settings.min_cov} + #end if + #if $annotool.snpeff_settings.min_qual: + --minQ ${annotool.snpeff_settings.min_qual} + #end if + #if $annotool.snpeff_settings.ud: + --ud ${annotool.snpeff_settings.ud} + #end if + #end if + + --ofile $outputfile + #if $str($formatting.oformat) == "text": + --oformat text + #end if + #if $str($formatting.oformat) == "html": + #if $formatting.formatter_file: + --link ${formatting.formatter_file} + #end if + #if $formatting.species + --species ${formatting.species} + #end if + #end if + + #if $str($grouping): + --grouping $grouping + #end if + --verbose + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ## default settings for SnpEff + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + (annotool['name']=="snpeff" and annotool['ori_output']) + + + (annotool['name']=="snpeff" and annotool['stats']) + + + + +.. class:: infomark + + **What it does** + +The tool turns a variant list in VCF format into a more readable summary table listing variant sites and effects. + +If installed, the variant annotation tool SnpEff can be used transparently to determine the genomic features, e.g., genes or transcripts, affected by the variants. + +Use of this feature requires that you have an appropriate SnpEff genome file installed on the host machine. You can use the *List installed SnpEff genomes* tool to generate a list of all available SnpEff genomes. +This list can then be used (by selecting the dataset as the *genome list*) to populate the *genome* dropdown menu, from which you can select the SnpEff genome file to be used for the annotation. + +As output file formats HTML or plain text are supported. +In HTML mode, variant positions and/or affected genomic features can be turned into hyperlinks to corresponding views in web-based genome browsers. + +The behavior of this feature depends on: + +1) Recognition of the species that is analyzed + + You can declare the species you are working with using the *Species* text field. + If you are not declaring the species explicitly, but are choosing SnpEff for effect annotation, the tool will usually be able to auto-detect the species from the SnpEff genome you are using. + If no species gets assigned in either way, no hyperlinks will be generated and the html output will look essentially like plain text. + +2) Available hyperlink formatting rules for this species + + When the species has been recognized, the tool checks if you have selected an *optional file with hyperlink formatting instructions*. + If you did and that file contains an entry matching the recognized species, that entry will be used as a template to construct the hyperlinks. + If no matching entry is found in the file, an error will be raised. + + If you did not supply a hyperlink formatting instruction file, the tool will consult an internal lookup table to see if it finds default rules for the construction of the hyperlinks for the species. + If not, no hyperlinks will be generated and the html output will look essentially like plain text. + + + + diff -r 000000000000 -r 7da2c9654a83 bamsort.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bamsort.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,40 @@ + + Sort a BAM file by coordinates (or names) of the mapped reads + + python3 + MiModD + + + mimodd sort $inputfile -o $output --oformat $oformat $by_name + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool sorts a BAM file of aligned reads, typically by the reference genome coordinates that the reads have been mapped to. + +Coordinate-sorted input files are expected by the downstream MiModD tools *Variant Calling and Coverage Analysis* and *Deletion prediction*. + +Note, however, that the *SNAP Read Alignment* produces coordinate-sorted output by default and it is only necessary to sort files that come from other sources or from *SNAP Read Alignment* jobs with a custom sort order. + +The option *Sort by read names instead of coordinates* is useful if you want to re-align coordinate-sorted paired-end data. In *paired-end mode*, the *SNAP Read Alignment* tool expects the reads in the input file to be arranged in read pairs, i.e., the forward read information of a pair must be followed immediately by its reverse mate information, which is typically not the case in coordinate-sorted files. Resorting such files by read names fixes this problem. + + + + diff -r 000000000000 -r 7da2c9654a83 convert.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/convert.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,124 @@ + + between different sequence data formats + + python3 + MiModD + + + mimodd convert + + #for $i in $mode.input_list + ${i.file1} + #if $str($mode.iformat) in ("fastq_pe", "gz_pe"): + ${i.file2} + #end if + #end for + #if $str($header) != "None": + --header $header + #end if + --output $outputname + --iformat $(mode.iformat) + --oformat $(mode.oformat) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool converts between different file formats used for storing next-generation sequencing data. + +As input file types, it can handle uncompressed or gzipped fastq, SAM or BAM format, which it can convert to SAM or BAM format. + +**Notes:** + +1) In its standard configuration Galaxy will decompress any .gz files during their upload, effectively preventing the use of gzipped fastq files. + +2) The tool can convert fastq files representing data from paired-end sequencing runs to appropriate SAM/BAM format provided that the mate information is split over two fastq files in corresponding order. + + **TIP:** If your paired-end data is arranged differently, you may look into the *FASTQ splitter* and *FASTQ de-interlacer* tools to see if they convert your files to the right format. + +3) Specifying a SAM header file to use in the conversion is highly recommended as this will add sequencing run metadata to the results file, which is the main purpose of storing unaligned NGS data in SAM/BAM format. + + See the help on the *NGS Run Annotation* tool for information on how to generate a new header file. + + + + + diff -r 000000000000 -r 7da2c9654a83 deletion_predictor.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/deletion_predictor.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,60 @@ + + Predicts deletions in one or more aligned read samples based on coverage of the reference genome and on insert sizes + + python3 + MiModD + + + mimodd delcall + #for $l in $list_input + ${l.bamfile} + #end for + $covfile -o $outputfile + --max_cov $max_cov --min_size $min_size $include_uncovered $group_by_id --verbose + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool predicts deletions from paired-end data in a two-step process. + +First, it finds regions of low-coverage, i.e., candidate regions for deletions, by scanning a coverage file as produced by the *Variant Calling and Coverage Analysis* tool. +The *maximal coverage allowed inside a low-coverage region* and the *minimal deletion size* parameters are used at this step to define what is considered a low-coverage region. + +Second, the tool assesses every low-coverage region statistically for evidence of it being a real deletion. +This step requires paired-end data since it relies on shifts in the distribution of read pair insert sizes around real deletions. + +By default, the tool only reports Deletions, i.e., the fraction of low-coverage regions that pass the statistical test. +If *include low-coverage regions* is selected, regions that failed the test will also be reported. + +With *group reads based on read group id only* selected, as it is by default, grouping of reads into samples is done strictly based on their read group IDs. +With the option deselected, grouping is done based on sample names in the first step of the analysis, i.e. the reads of all samples with a shared sample name are used to identify low-coverage regions. +In the second step, however, reads will be regrouped by their read group IDs again, i.e. the statistical assessment for real deletions is always done on a per read group basis. + +**TIP:** +Deselecting *group reads based on read group id only* can be useful, for example, if you have both paired-end and single-end sequencing data for the same sample. + +In this case, the two sets of reads will usually share a common sample name, but differ in their read groups. +With grouping based on sample names, the single-end data can be used together with the paired-end data to identify low-coverage regions, thus increasing overall coverage and reliability of this step. +Still, the assessment of deletions will use only the paired-end data (auto-detecting that the single-end reads do not provide insert size information). + + + + diff -r 000000000000 -r 7da2c9654a83 reheader.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/reheader.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,33 @@ + + From a BAM file generate a new file with the original header (if any) replaced by that found in a second SAM file + + python3 + MiModD + + + mimodd reheader $template $input -o $output --verbose + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool replaces the header of the input BAM file (i.e., its metadata) with that found in the template SAM file and writes the result to a new BAM file. + +Typically, you will generate the header template file with the *NGS Run Annotation* tool, but any SAM file with header information can be used instead. + + + + + diff -r 000000000000 -r 7da2c9654a83 sam_header.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sam_header.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,130 @@ + + Create a SAM format header from run metadata for sample annotation. + + python3 + MiModD + + + mimodd header + + --rg_id "$rg_id" + --rg_sm "$rg_sm" + + #if $str($rg_cn): + --rg_cn "$rg_cn" + #end if + #if $str($rg_ds): + --rg_ds "$rg_ds" + #end if + #if $str($anno) and $str($month) and $str($day): + --rg_dt "$anno-$month-$day" + #end if + #if $str($rg_lb): + --rg_lb "$rg_lb" + #end if + #if $str($rg_pl): + --rg_pl "$rg_pl" + #end if + #if $str($rg_ds): + --rg_pi "$rg_pi" + #end if + #if $str($rg_pu): + --rg_pu "$rg_pu" + #end if + + --outputfile $outputfile + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +This tool takes the user-provided information about a next-generation sequencing run and constructs a valid header in the SAM file format from it. + +The result file can be used by the tools *Convert* and *Reheader* or in the *SNAP Read Alignment* step to add run metadata to sequenced reads files (or to overwrite pre-existing information). + +**Note:** + +**MiModD requires run metadata for every input file at the Alignment step !** + +**Tip:** + +While you can do Alignments from fastq file format by providing a custom header file directly to the *SNAP Read Alignment* tool, the **recommended approach** is to first convert all input files to and archive all datasets in SAM/BAM format with appropriate header information prior to any downstream analysis. Although a bit more time-consuming this practice protects against information loss and ensures that the input datasets will remain useful for others in the future. + + + + diff -r 000000000000 -r 7da2c9654a83 sampleinfo.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/sampleinfo.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,29 @@ + + for supported data formats. + + python3 + MiModD + + + mimodd info $ifile -o $outputfile --verbose + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool inspects the input file and writes a report about the samples (and read groups) encoded in it. + +It works with all file formats used and supported by MiModD that contain sample metadata, i.e. headered SAM/BAM files, vcf files with sample information and the cov files produced during Coverage Analysis. + + + diff -r 000000000000 -r 7da2c9654a83 seqdict.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/seqdict.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,30 @@ + + for use with CloudMap. + + python3 + MiModD + + + mimodd cm_seqdict $ifile -o $outputfile + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The purpose of this tool is solely to provide compatibility with the external **CloudMap** *Variant Discovery Mapping* and *Hawaiian Variant Mapping* tools. + +From a VCF file, the tool extracts the chromosome names and sizes and reports them in the **CloudMap** *species configuration file* format. +Such a file is required as input to the **CloudMap** mapping tools, if you are working with a species other than the natively supported ones (i.e., other than C.elegans or A. thaliana for the current version of CloudMap). + + + diff -r 000000000000 -r 7da2c9654a83 snap_caller.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snap_caller.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,221 @@ + + Map sequence reads to a reference genome using SNAP + + python3 + MiModD + + + mimodd snap_batch -s + ## SNAP calls (considering different cases) + + #for $i in $datasets + "snap ${i.mode_choose.mode} $ref_genome + #if $str($i.mode_choose.mode) == "paired" and $str($i.mode_choose.input.iformat) in ("fastq", "gz"): +${i.mode_choose.input.ifile1} ${i.mode_choose.input.ifile2} + #else: +${i.mode_choose.input.ifile} + #end if +--outputfile $outputfile --iformat ${i.mode_choose.input.iformat} --oformat $oformat +--idx_seedsize $set.seedsize +--idx_slack $set.slack --maxseeds $set.maxseeds --maxhits $set.maxhits --clipping=$set.clipping --maxdist $set.maxdist --confdiff $set.confdiff + #if $i.mode_choose.input.header: +--header ${i.mode_choose.input.header} + #end if + #if $str($i.mode_choose.mode) == "paired": +--spacing $set.sp_min $set.sp_max + #end if + #if $str($set.selectivity) != "off": +--selectivity $set.selectivity + #end if + #if $str($set.filter_output) != "off": +--filter_output $set.filter_output + #end if + #if $str($set.sort) != "off": +--sort $set.sort + #end if + #if $str($set.mmatch_notation) == "general": +-M + #end if +--max_mate_overlap $set.max_mate_overlap +--verbose +" + #end for + + + + ## mandatory arguments (and mode-conditionals) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + ## optional arguments + + + + + + + + ## default settings + + + + + + + + + + + + + + + + + + + + + ## change settings + + + + + + ## paired-end specific options + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool aligns the sequenced reads in an arbitrary number of input files against a common reference genome and stores the results in a single, possibly multi-sample output file. + +It does so by using the ultrafast, hashtable-based aligner SNAP, but unless you want to change aligner-specific options you do not have to know anything about this implementation detail. + +**Notes:** + +1) The tool requires that each input file contains adequate header information (i.e. metadata about the read groups and samples it encodes). The *custom header file* is offered as an **optional choice** for input files that **may** contain such header information, but you **must** specify it if your specific file does not provide the information. You **can** also provide a header file for an input file with header information, in which case the custom header will overwrite the existing header of the input file. + +2) Currently, you cannot configure aligner-specific options separately for specific input files from within this Galaxy tool. If you need this advanced level of control, you should use the command line tool ``mimodd snap_batch``. + + + + + + + + + + + + diff -r 000000000000 -r 7da2c9654a83 snp_caller_caller.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snp_caller_caller.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,99 @@ + + Predict SNPs and indels in one or more aligned read samples and calculate the coverage of every base in the reference genome using samtools/bcftools + + python3 + MiModD + + + mimodd varcall + + $ref_genome + #for $l in $list_input + ${l.inputfile} + #end for + --output_vcf=$output_vcf + #if $cov: + --output_cov=$output_cov + #end if + #if $cstats: + --cstats=$output_stats + #end if + --depth=$depth + #if $sites.use_sites: + --sites=$sites.sitelist + --output_sites=$output_sites + #end if + $group_by_id + --verbose + --quiet + + + + + + + + + + + + + + + + ## default settings + + + + + ## change settings + + + + + + + + + + + + cov + + + cstats + + + + sites['use_sites'] + + + + +.. class:: infomark + + **What it does** + +The tool calls variants (SNPs and indels) with respect to the reference genome from the aligned reads in the input files. + +It produces up to three output files: + +1) The *variant sites file* is in vcf format and includes one line for every genomic position at which a variant is found. + + When the input files hold aligned reads from more than one sample, a variant detected in one sample is enough for inclusion. + The sample-specific information in the last columns of the output file will provide detailed information about the genotype likelihoods for each sample. + +2) The optional *coverage file* reports the depth of coverage for each sample per base across the entire reference genome. + + This file is required by the *Deletion Prediction* tool. + +3) The optional *position-specified sites file* is in vcf format again. If *report on sites specified by positions* was selected, it will have one line per user-defined genomic position independent of whether that position is included also in the variant sites vcf file or not. + + **TIP:** This file is what you will need for the **Cloudmap** *Hawaiian Variant Mapping* tool. + +**Note:** + +The tool uses samtools mpileup and bcftools for variant calling, but exposes just a single configuration parameter of these tools - the *maximum per-BAM depth*. Through this parameter, the maximum number of reads considered for variant calling at any site can be controlled. Its default value of 250 is taken from *samtools mpileup* and usually suitable. Consider, however, that this gives the maximum read number per input file, so if you have a large number of samples in one input file, it could become necessary to increase the value to get sufficient reads considered per sample. + + + diff -r 000000000000 -r 7da2c9654a83 snpeff_genomes.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/snpeff_genomes.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,23 @@ + + Checks the local SnpEff installation to compile a list of currently installed genomes + + python3 + MiModD + + + mimodd snpeff_genomes -o $outputfile + + + + + +.. class:: infomark + +**What it does** + +When executed this tool searches the host machine's SnpEff installation for properly registered and installed +genome annotation files. The resulting list is added as a plain text file to your history for use with the *Variant Annotation* Tool. + + + + diff -r 000000000000 -r 7da2c9654a83 vcf_filter.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/vcf_filter.xml Tue Aug 12 11:26:15 2014 -0400 @@ -0,0 +1,127 @@ + + extracts lines from a vcf variant file based on field-specific filters + + python3 + MiModD + + + mimodd vcf_filter + $inputfile + -o $outputfile + #if len($datasets): + -s + #for $i in $datasets + $i.sample + #end for + --gt + #for $i in $datasets + ## remove whitespace from free-text input + #echo ("".join($i.GT.split()) or "ANY") + #echo " " + #end for + --dp + #for $i in $datasets + $i.DP + #end for + --gq + #for $i in $datasets + $i.GQ + #end for + #end if + #if len($regions): + -r + #for $i in $regions + #if $i.stop: + $i.chrom:$i.start-$i.stop + #else: + $i.chrom:$i.start + #end if + #end for + #end if + #if $vfilter: + --v_filter + ## remove ',' (and possibly adjacent whitespace) and replace with ' ' + #echo (" ".join("".join($vfilter.split()).split(','))) + #end if + $vartype + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + + **What it does** + +The tool filters a variant file in VCF format to generate a new VCF file with only a subset of the original variants. + +The following types of variant filters can be set up: + +1) Sample-specific filters: + + Filter variants based on their characteristics in the sequenced reads of a specific sample. Multiple sample-specific filters are combined by logical AND, i.e., only variants that pass ALL sample-specific filters are kept. + +2) Region filters: + + Filter variants based on the genomic region they affect. Multiple region filters are combined by logical OR, i.e., variants passing ANY region filter are kept. + +3) Variant type filter: + + Filter variants by their type, i.e. whether they are single nucleotide variations (SNVs) or indels + +In addition, the *sample* filter can be used to reduce the samples encoded in a multi-sample VCF file to just those specified by the filter. +The *sample* filter is included for compatibility reasons: if an external tool cannot deal with the multisample file format, but instead looks only at the first sample-specific column of the file, you can use the filter to turn the multi-sample file into a single-sample file. + +**Examples of sample-specific filters:** + +*Simple genotype pattern* + +genotype pattern: 1/1 ==> keep all variants in the vcf input file for which the specified sample's genotype is homozygous mutant + +*Complex genotype pattern* + +genotype pattern: 0/1, 0/0 ==> keep all variants for which the sample's genotype is either heterozygous or homozygous wildtype + +*Multiple sample-specific filters* + +Filter 1: genotype pattern: 0/0, Filter 2: genotype pattern 1/1: +==> keep all variants for which the first sample's gentoype is homozygous wildtype **and** the second sample's genotype is homozygous mutant + +*Combining sample-specific filter criteria* + +genotype pattern: 1/1, depth of coverage: 3, genotype quality: 9 +==> keep variants for which the sample's genotype is homozygous mutant **and** for which this genotype assignment is corroborated by a genotype quality score of at least 9 +**and** at least three reads from the sample cover the variant site + +**TIP:** + +As in the example above, genotype quality is typically most useful in combination with a genotype pattern. +It acts then, effectively, to make the genotype filter more stringent. + + + + +