# HG changeset patch # User devteam # Date 1393372485 18000 # Node ID 5fe20cda6a51012c98226ed680f51c7de43fe3cf # Parent a3068d7de91d4d2cf7aaceacc49aabd539c54aa7 Uploaded diff -r a3068d7de91d -r 5fe20cda6a51 picard_MarkDuplicates.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/picard_MarkDuplicates.xml Tue Feb 25 18:54:45 2014 -0500 @@ -0,0 +1,131 @@ + + locates duplicate molecules + + picard_wrapper.py -i "${input_file}" -n "${out_prefix}" --tmpdir "${__new_file_path__}" -o "${out_file}" + --remdups "${remDups}" --assumesorted "${assumeSorted}" --readregex "${readRegex}" --optdupdist "${optDupeDist}" + -j "\$JAVA_JAR_PATH/MarkDuplicates.jar" -d "${html_file.files_path}" -t "${html_file}" -e "${input_file.ext}" + + picard + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: infomark + +**Purpose** + +Marks all duplicate reads in a provided SAM or BAM file and either removes them or flags them. + +**Picard documentation** + +This is a Galaxy wrapper for MarkDuplicates, a part of the external package Picard-tools_. + + .. _Picard-tools: http://www.google.com/search?q=picard+samtools + +----- + +.. class:: infomark + +**Inputs, outputs, and parameters** + +Picard documentation says (reformatted for Galaxy): + +.. csv-table:: Mark Duplicates docs + :header-rows: 1 + + Option,Description + "INPUT=File","The input SAM or BAM file to analyze. Must be coordinate sorted. Required." + "OUTPUT=File","The output file to right marked records to Required." + "METRICS_FILE=File","File to write duplication metrics to Required." + "REMOVE_DUPLICATES=Boolean","If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false." + "ASSUME_SORTED=Boolean","If true, assume that the input file is coordinate sorted, even if the header says otherwise. Default value: false." + "MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=Integer","This option is obsolete. ReadEnds will always be spilled to disk. Default value: 50000." + "MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=Integer","Maximum number of file handles to keep open when spilling read ends to disk." + "READ_NAME_REGEX=String","Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. " + "OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer","The maximum offset between two duplicte clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal. Default value: 100" + +.. class:: warningmark + +**Warning on SAM/BAM quality** + +Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT** +flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears +to be the only way to deal with SAM/BAM that cannot be parsed. +.. class:: infomark + +**Note on the Regular Expression** + +(from the Picard docs) +This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+). + +Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged unless the remove duplicates option is selected. In some cases you may want to do this, but please only do this if you really understand what you are doing. + + + + + + + + + + + + + + +