view picard_EstimLibComplexity.xml @ 143:b2ca2d181fb4 draft

fixed downsample sam in picard1106 (accept bam)
author Rayan Chikhi <chikhi@psu.edu>
date Mon, 16 Jun 2014 17:38:15 -0400
parents 6ad4014eddd7
children
line wrap: on
line source

<tool name="Estimate Library Complexity" id="rgEstLibComp" version="1.106.0">
  <description>estimates library complexity from sequence of read pairs alone</description>
  <requirements><requirement type="package" version="1.106.0">picard</requirement></requirements>
  <command interpreter="python">
   picard_wrapper.py -i "${input_file}" -n "${out_prefix}" --tmpdir "${__new_file_path__}" --minid "${minIDbases}"
   --maxdiff "${maxDiff}" --minmeanq "${minMeanQ}" --readregex "${readRegex}" --optdupdist "${optDupeDist}"
   -j "\$JAVA_JAR_PATH/EstimateLibraryComplexity.jar" -d "${html_file.files_path}" -t "${html_file}"
  </command>
  <inputs>
    <param format="bam,sam" name="input_file" type="data" label="SAM/BAM dataset"
      help="If empty, upload or import a SAM/BAM dataset."/>
    <param name="out_prefix" value="Library Complexity" type="text"
      label="Title for the output file" help="Use this remind you what the job was for." size="80" />
    <param name="minIDbases" value="5" type="integer"  label="Minimum identical bases at starts of reads for grouping" size="5" 
      help="Total_reads / 4^max_id_bases reads will be compared at a time. Lower numbers = more accurate results and exponentially more time/memory." />
     <param name="maxDiff" value="0.03" type="float"
      label="Maximum difference rate for identical reads" size="5" 
      help="The maximum rate of differences between two reads to call them identical" />
     <param name="minMeanQ" value="20" type="integer"
      label="Minimum percentage" size="5" 
      help="The minimum mean quality of bases in a read pair. Lower average quality reads filtered out from all calculations" />
     <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" type="text" size="120"
      label="Regular expression that can be used to parse read names in the incoming SAM file" 
      help="Names are parsed to extract: tile/region, x coordinate and y coordinate, to estimate optical duplication rate" >
      <sanitizer>
        <valid initial="string.printable">
         <remove value="&apos;"/>
        </valid>
        <mapping initial="none">
          <add source="&apos;" target="__sq__"/>
        </mapping>
      </sanitizer>
     </param>
     <param name="optDupeDist" value="100" type="text"
      label="The maximum offset between two duplicte clusters in order to consider them optical duplicates." size="5" 
      help="e.g. 5-10 pixels. Later Illumina software versions multiply pixel values by 10, in which case 50-100" />

  </inputs>
  <outputs>
    <data format="html" name="html_file" label="${out_prefix}_lib_complexity.html"/>
  </outputs>
  <tests>
    <test>
      <param name="input_file" value="picard_input_tiny.sam" />
      <param name="out_prefix" value="Library Complexity" />
      <param name="minIDbases" value="5" />
      <param name="maxDiff" value="0.03" />
      <param name="minMeanQ" value="20" />
      <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" />
      <param name="optDupeDist" value="100" />      
      <output name="html_file" file="picard_output_estlibcomplexity_tinysam.html" ftype="html" lines_diff="30" />
    </test>
  </tests>
  <help>

.. class:: infomark

**Purpose**

Attempts to estimate library complexity from sequence alone. 
Does so by sorting all reads by the first N bases (5 by default) of each read and then 
comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be 
duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).

Reads of poor quality are filtered out so as to provide a more accurate estimate. 
The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than 
MIN_MEAN_QUALITY across either the first or second read.

The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the 
calculation of library size. Also, since there is no alignment to screen out technical reads one 
further filter is applied on the data. After examining all reads a histogram is built of 
[#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are
then removed from the histogram as outliers before library size is estimated.

**Picard documentation**

This is a Galaxy wrapper for EstimateLibraryComplexity, a part of the external package Picard-tools_.

 .. _Picard-tools: http://www.google.com/search?q=picard+samtools

-----

.. class:: infomark

**Inputs, outputs, and parameters**

Picard documentation says (reformatted for Galaxy):

.. csv-table::
   :header-rows: 1

    Option	Description
    "INPUT=File","One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped. This option may be specified 0 or more times."
    "OUTPUT=File","Output file to writes per-library metrics to. Required."
    "MIN_IDENTICAL_BASES=Integer","The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU. Default value: 5."
    "MAX_DIFF_RATE=Double","The maximum rate of differences between two reads to call them identical. Default value: 0.03. "
    "MIN_MEAN_QUALITY=Integer","The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations. Default value: 20."
    "READ_NAME_REGEX=String","Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to clear the default value."
    "OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer","The maximum offset between two duplicte clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal. Default value: 100"
    "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false. This option can be set to 'null' to clear the default value. "

.. class:: warningmark

**Warning on SAM/BAM quality**

Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
to be the only way to deal with SAM/BAM that cannot be parsed.

.. class:: infomark

**Note on the Regular Expression**

(from the Picard docs)
This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file. 
These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. 
The regular expression should contain three capture groups for the three variables, in order. 
Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*.


  </help>
</tool>