# HG changeset patch
# User devteam
# Date 1351003775 14400
# Node ID ff4ec13e496e5a4afafd5c0616975be534e526ae
Uploaded tarball to repository
diff -r 000000000000 -r ff4ec13e496e picard_AddOrReplaceReadGroups.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_AddOrReplaceReadGroups.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,205 @@
+
+ picard
+
+ picard_wrapper.py
+ --input="$inputFile"
+ --rg-lb="$rglb"
+ --rg-pl="$rgpl"
+ --rg-pu="$rgpu"
+ --rg-sm="$rgsm"
+ --rg-id="$rgid"
+ --rg-opts=${readGroupOpts.rgOpts}
+ #if $readGroupOpts.rgOpts == "full"
+ --rg-cn="$readGroupOpts.rgcn"
+ --rg-ds="$readGroupOpts.rgds"
+ #end if
+ --output-format=$outputFormat
+ --output=$outFile
+ -j "\$JAVA_JAR_PATH/AddOrReplaceReadGroups.jar"
+ --tmpdir "${__new_file_path__}"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Add or Replace Read Groups in an input BAM or SAM file.
+
+**Read Groups are Important!**
+
+Many downstream analysis tools (such as GATK, for example) require BAM datasets to contain read groups. Even if you are not going to use GATK, setting read groups correctly from the start will simplify your life greatly. Below we provide an explanation of read groups fields taken from GATK FAQ webpage:
+
+.. csv-table::
+ :header-rows: 1
+
+ Tag,Importance,Definition,Meaning
+ "ID","Required","Read group identifier. Each @RG line must have a unique ID. The value of ID is used in the RG tags of alignment records. Must be unique among all read groups in header section. Read group IDs may be modified when merging SAM files in order to handle collisions.","Ideally, this should be a globally unique identify across all sequencing data in the world, such as the Illumina flowcell + lane name and number. Will be referenced by each read with the RG:Z field, allowing tools to determine the read group information associated with each read, including the sample from which the read came. Also, a read group is effectively treated as a separate run of the NGS instrument in tools like base quality score recalibration (a GATK component) -- all reads within a read group are assumed to come from the same instrument run and to therefore share the same error model."
+ "SM","Sample. Use pool name where a pool is being sequenced.","Required. As important as ID.","The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample. Therefore it's critical that the SM field be correctly specified, especially when using multi-sample tools like the Unified Genotyper (a GATK component)."
+ "PL","Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.","Important. Not currently used in the GATK, but was in the past, and may return. The only way to known the sequencing technology used to generate the sequencing data","It's a good idea to use this field."
+ "LB","DNA preparation library identify","Essential for MarkDuplicates","MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes."
+
+**Example of Read Group usage**
+
+Support we have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an illumina hiseq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, we would create 12 BAM files, with the following @RG fields in the header::
+
+ Dad's data:
+ @RG ID:FLOWCELL1.LANE1 PL:illumina LB:LIB-DAD-1 SM:DAD PI:200
+ @RG ID:FLOWCELL1.LANE2 PL:illumina LB:LIB-DAD-1 SM:DAD PI:200
+ @RG ID:FLOWCELL1.LANE3 PL:illumina LB:LIB-DAD-2 SM:DAD PI:400
+ @RG ID:FLOWCELL1.LANE4 PL:illumina LB:LIB-DAD-2 SM:DAD PI:400
+
+ Mom's data:
+ @RG ID:FLOWCELL1.LANE5 PL:illumina LB:LIB-MOM-1 SM:MOM PI:200
+ @RG ID:FLOWCELL1.LANE6 PL:illumina LB:LIB-MOM-1 SM:MOM PI:200
+ @RG ID:FLOWCELL1.LANE7 PL:illumina LB:LIB-MOM-2 SM:MOM PI:400
+ @RG ID:FLOWCELL1.LANE8 PL:illumina LB:LIB-MOM-2 SM:MOM PI:400
+
+ Kid's data:
+ @RG ID:FLOWCELL2.LANE1 PL:illumina LB:LIB-KID-1 SM:KID PI:200
+ @RG ID:FLOWCELL2.LANE2 PL:illumina LB:LIB-KID-1 SM:KID PI:200
+ @RG ID:FLOWCELL2.LANE3 PL:illumina LB:LIB-KID-2 SM:KID PI:400
+ @RG ID:FLOWCELL2.LANE4 PL:illumina LB:LIB-KID-2 SM:KID PI:400
+
+Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).
+
+**Picard documentation**
+
+This is a Galaxy wrapper for AddOrReplaceReadGroups, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+------
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+Either a sam file or a bam file must be supplied. If a bam file is used, it must
+be coordinate-sorted. Galaxy currently coordinate-sorts all bam files.
+
+The output file is either bam (the default) or sam, according to user selection,
+and contains the same information as the input file except for the appropraite
+additional (or modified) read group tags. Bam is recommended since it is smaller.
+
+From the Picard documentation.
+
+AddOrReplaceReadGroups REQUIRED parameters::
+
+ Option (Type) Description
+
+ RGLB=String Read Group Library
+ RGPL=String Read Group platform (e.g. illumina, solid)
+ RGPU=String Read Group platform unit (eg. run barcode)
+ RGSM=String Read Group sample name
+ RGID=String Read Group ID; Default value: null (empty)
+
+AddOrReplaceReadGroups OPTIONAL parameters::
+
+ Option (Type) Description
+
+ RGCN=String Read Group sequencing center name; Default value: null (empty)
+ RGDS=String Read Group description Default value: null (empty)
+
+One parameter that Picard's AddOrReplaceReadGroups offers that is automatically
+set by Galaxy is the SORT_ORDER, which is set to coordinate.
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e picard_BamIndexStats.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_BamIndexStats.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,118 @@
+
+ picard
+
+ picard_wrapper.py
+ --input "$input_file"
+ --bai-file "$input_file.metadata.bam_index"
+ -t "$htmlfile"
+ -d "$htmlfile.files_path"
+ -j "\$JAVA_JAR_PATH/BamIndexStats.jar"
+ --tmpdir "${__new_file_path__}"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Generate Bam Index Stats for a provided BAM file.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for BamIndexStats, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+------
+
+.. class:: infomark
+
+**Inputs and outputs**
+
+The only input is the BAM file you wish to obtain statistics for, which is required.
+Note that it must be coordinate-sorted. Galaxy currently coordinate-sorts all BAM files.
+
+This tool outputs an HTML file that contains links to the actual metrics results, as well
+as a log file with info on the exact command run.
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+------
+
+**Example**
+
+Given a BAM file created from the following::
+
+ @HD VN:1.0 SO:coordinate
+ @SQ SN:chr1 LN:101
+ @SQ SN:chr7 LN:404
+ @SQ SN:chr8 LN:202
+ @SQ SN:chr10 LN:303
+ @SQ SN:chr14 LN:505
+ @RG ID:0 SM:Hi,Mom!
+ @RG ID:1 SM:samplesample DS:ClearDescription
+ @PG ID:1 PN:Hey! VN:2.0
+ @CO Just a generic comment to make the header longer
+ read1 83 chr7 1 255 101M = 302 201 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))II'I*/)-I*-)I.-)I)I),/-II..)./.,.).*II,I.II-)III0*IIIIIIII/32/,01460II/6/*0*/2/283//36868/I RG:Z:0
+ read2 89 chr7 1 255 101M * 0 0 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))II'I*/)-I*-)I.-)I)I),/-II..)./.,.).*II,I.II-)III0*IIIIIIII/32/,01460II/6/*0*/2/283//36868/I RG:Z:0
+ read3 83 chr7 1 255 101M = 302 201 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))II'I*/)-I*-)I.-)I)I),/-II..)./.,.).*II,I.II-)III0*IIIIIIII/32/,01460II/6/*0*/2/283//36868/I RG:Z:0
+ read4 147 chr7 16 255 101M = 21 -96 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))II'I*/)-I*-)I.-)I)I),/-II..)./.,.).*II,I.II-)III0*IIIIIIII/32/,01460II/6/*0*/2/283//36868/I RG:Z:0
+ read5 99 chr7 21 255 101M = 16 96 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))II'I*/)-I*-)I.-)I)I),/-II..)./.,.).*II,I.II-)III0*IIIIIIII/32/,01460II/6/*0*/2/283//36868/I RG:Z:0
+ read6 163 chr7 302 255 101M = 1 -201 NCGCGGCATCNCGATTTCTTTCCGCAGCTAACCTCCCGACAGATCGGCAGCGCGTCGTGTAGGTTATTATGGTACATCTTGTCGTGCGGCNAGAGCATACA I/15445666651/566666553+2/14/I/555512+3/)-'/-I-'*+))*''13+3)'//++''/'))/3+I*5++)I'2+I+/*I-II*)I-./1'1 RG:Z:0
+ read7 163 chr7 302 255 10M1D10M5I76M = 1 -201 NCGCGGCATCNCGATTTCTTTCCGCAGCTAACCTCCCGACAGATCGGCAGCGCGTCGTGTAGGTTATTATGGTACATCTTGTCGTGCGGCNAGAGCATACA I/15445666651/566666553+2/14/I/555512+3/)-'/-I-'*+))*''13+3)'//++''/'))/3+I*5++)I'2+I+/*I-II*)I-./1'1 RG:Z:0
+ read8 165 * 0 0 * chr7 1 0 NCGCGGCATCNCGATTTCTTTCCGCAGCTAACCTCCCGACAGATCGGCAGCGCGTCGTGTAGGTTATTATGGTACATCTTGTCGTGCGGCNAGAGCATACA I/15445666651/566666553+2/14/I/555512+3/)-'/-I-'*+))*''13+3)'//++''/'))/3+I*5++)I'2+I+/*I-II*)I-./1'1 RG:Z:0
+
+The following metrics file will be produced::
+
+ chr1 length= 101 Aligned= 0 Unaligned= 0
+ chr7 length= 404 Aligned= 7 Unaligned= 0
+ chr8 length= 202 Aligned= 0 Unaligned= 0
+ chr10 length= 303 Aligned= 0 Unaligned= 0
+ chr14 length= 505 Aligned= 0 Unaligned= 0
+ NoCoordinateCount= 1
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e picard_FastqToSam.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_FastqToSam.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,145 @@
+
+ creates an unaligned BAM file
+ picard
+
+ java -XX:DefaultMaxRAMFraction=1 -XX:+UseParallelGC
+ -jar "\$JAVA_JAR_PATH/FastqToSam.jar"
+ FASTQ="${input_fastq1}"
+ #if str( $input_fastq2) != "None":
+ FASTQ2="${input_fastq2}"
+ #end if
+ QUALITY_FORMAT="${ dict( fastqsanger='Standard', fastqcssanger='Standard', fastqillumina='Illumina', fastqsolexa='Solexa' )[ $input_fastq1.ext ] }" ##Solexa, Illumina, Standard
+ OUTPUT="${output_bam}"
+ READ_GROUP_NAME="${read_group_name}"
+ SAMPLE_NAME="${sample_name}"
+ #if $param_type.param_type_selector == "advanced":
+ #if str( $param_type.library_name ) != "":
+ LIBRARY_NAME="${param_type.library_name}"
+ #end if
+ #if str( $param_type.platform_unit ) != "":
+ PLATFORM_UNIT="${param_type.platform_unit}"
+ #end if
+ #if str( $param_type.platform ) != "":
+ PLATFORM="${param_type.platform}"
+ #end if
+ #if str( $param_type.sequencing_center ) != "":
+ SEQUENCING_CENTER="${param_type.sequencing_center}"
+ #end if
+ #if str( $param_type.predicted_insert_size ) != "":
+ PREDICTED_INSERT_SIZE="${param_type.predicted_insert_size}"
+ #end if
+ #if str( $param_type.description.value ) != "":
+ DESCRIPTION="${param_type.description}"
+ #end if
+ #if str( $param_type.run_date ) != "":
+ RUN_DATE="${param_type.run_date}"
+ #end if
+ #if str( $param_type.min_q ) != "":
+ MIN_Q="${param_type.min_q}"
+ #end if
+ #if str( $param_type.min_q ) != "":
+ MAX_Q="${param_type.max_q}"
+ #end if
+ SORT_ORDER="${param_type.sort_order}"
+ #else:
+ SORT_ORDER=coordinate ##unsorted, queryname, coordinate; always use coordinate
+ #end if
+ 2>&1
+ || echo "Error running Picard FastqToSAM" >&2
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+**What it does**
+
+Picard: FastqToSam converts FASTQ files to unaligned BAM files.
+
+------
+
+Please cite the website "http://picard.sourceforge.net".
+
+------
+
+
+**Input formats**
+
+FastqToSam accepts FASTQ input files. If using paired-end data, you should select two FASTQ files.
+
+------
+
+**Outputs**
+
+The output is in BAM format, see http://samtools.sourceforge.net for more details.
+
+-------
+
+**FastqToSam settings**
+
+This is list of FastqToSam options::
+
+ READ_GROUP_NAME=String Read group name Default value: A. This option can be set to 'null' to clear the default value.
+ SAMPLE_NAME=String Sample name to insert into the read group header Required.
+ LIBRARY_NAME=String The library name to place into the LB attribute in the read group header Default value: null.
+ PLATFORM_UNIT=String The platform unit (often run_barcode.lane) to insert into the read group header Default value: null.
+ PLATFORM=String The platform type (e.g. illumina, solid) to insert into the read group header Default value: null.
+ SEQUENCING_CENTER=String The sequencing center from which the data originated Default value: null.
+ PREDICTED_INSERT_SIZE=Integer Predicted median insert size, to insert into the read group header Default value: null.
+ DESCRIPTION=String Inserted into the read group header Default value: null.
+
+
diff -r 000000000000 -r ff4ec13e496e picard_ReorderSam.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_ReorderSam.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,166 @@
+
+ picard
+
+ picard_wrapper.py
+ --input=$inputFile
+ #if $source.indexSource == "built-in"
+ --ref="${source.ref.fields.path}"
+ #else
+ --ref-file=$refFile
+ --species-name=$source.speciesName
+ --build-name=$source.buildName
+ --trunc-names=$source.truncateSeqNames
+ #end if
+ --allow-inc-dict-concord=$allowIncDictConcord
+ --allow-contig-len-discord=$allowContigLenDiscord
+ --output-format=$outputFormat
+ --output=$outFile
+ --tmpdir "${__new_file_path__}"
+ -j "\$JAVA_JAR_PATH/ReorderSam.jar"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Reorder SAM/BAM to match contig ordering in a particular reference file. Note that this is
+not the same as sorting as done by the SortSam tool, which sorts by either coordinate
+values or query name. The ordering in ReorderSam is based on exact name matching of
+contigs/chromosomes. Reads that are mapped to a contig that is not in the new reference file are
+not included in the output.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for ReorderSam, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+------
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+For the file that needs to be reordered, either a sam file or a bam file must be supplied.
+If a bam file is used, it must be coordinate-sorted. A reference file is also required,
+so either a fasta file should be supplied or a built-in reference can be selected.
+
+The output contains the same reads as the input file but the reads have been rearranged so
+they appear in the same order as the provided reference file. The tool will output either
+bam (the default) or sam, according to user selection. Bam is recommended since it is smaller.
+
+The only extra parameters that can be set are flags for allowing incomplete dict concordance
+and allowing contig length discordance. If incomplete dict concordance is allowed, only a
+partial overlap of the bam contigs with the new reference sequence contigs is required. By
+default it is off, requiring a corresponding contig in the new reference for each read contig.
+If contig length discordance is allowed, contig names that are the same between a read and the
+new reference contig are allowed even if they have different lengths. This is usually not a
+good idea, unless you know exactly what you're doing. It's off by default.
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e picard_ReplaceSamHeader.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_ReplaceSamHeader.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,115 @@
+
+ picard
+
+ picard_wrapper.py
+ --input "$inputFile"
+ -o $outFile
+ --header-file $headerFile
+ --output-format $outputFormat
+ -j "\$JAVA_JAR_PATH/ReplaceSamHeader.jar"
+ --tmpdir "${__new_file_path__}"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Replace Sam Header with the header from another sam file. The tool does not do any
+significant validation, so it's up to the user to make sure that the elements in
+the header are relevant and that the new header has all the required things.
+
+Replace the SAMFileHeader in a SAM file with the given header. Validation is
+minimal. It is up to the user to ensure that all the elements referred to in the
+SAMRecords are present in the new header. Sort order of the two input files must
+be the same.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for ReplaceSamHeader, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+------
+
+.. class:: infomark
+
+**Inputs and outputs**
+
+Either a sam file or a bam file is required as the file whose header will be replaced.
+The header file is also required and can also be either sam or bam (it does not have
+to be the same type as the other file). In both cases, if a bam file is used, it must
+be coordinate-sorted. Galaxy currently coordinate-sorts all bam files.
+
+The tool will output either bam (the default) or sam. Bam is recommended since it is smaller.
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e picard_SamToFastq.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_SamToFastq.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,189 @@
+
+ creates a FASTQ file
+ picard
+
+ picard_SamToFastq_wrapper.py
+ -p '
+ java -XX:DefaultMaxRAMFraction=1 -XX:+UseParallelGC
+ -jar "\$JAVA_JAR_PATH/SamToFastq.jar"
+ INPUT="${input_sam}"
+ VALIDATION_STRINGENCY="LENIENT"
+ RE_REVERSE=${re_reverse}
+ INCLUDE_NON_PF_READS=${include_non_pf_reads}
+ #if str( $clipping_attribute ):
+ CLIPPING_ATTRIBUTE="${clipping_attribute}"
+ #end if
+ #if str( $clipping_action ):
+ CLIPPING_ACTION="${clipping_action}"
+ #end if
+ #if str( $read1_trim ):
+ READ1_TRIM="${read1_trim}"
+ #end if
+ #if str( $read1_max_bases_to_write ):
+ READ1_MAX_BASES_TO_WRITE="${read1_max_bases_to_write}"
+ #end if
+ INCLUDE_NON_PRIMARY_ALIGNMENTS=${include_non_primary_alignments}
+
+ #if str( $output_per_read_group_selector ) == 'per_sam_file':
+ ##OUTPUT_PER_RG=false
+ FASTQ="${output_fastq1}"
+
+ #if str( $single_paired_end_type.single_paired_end_type_selector ) == 'paired':
+ SECOND_END_FASTQ="${output_fastq2}"
+ #if str( $single_paired_end_type.read2_trim ):
+ READ2_TRIM="${single_paired_end_type.read2_trim}"
+ #end if
+ #if str( $single_paired_end_type.read2_max_bases_to_write ):
+ READ2_MAX_BASES_TO_WRITE="${single_paired_end_type.read2_max_bases_to_write}"
+ #end if
+ #end if
+ '
+ #else:
+ OUTPUT_PER_RG=true
+ #if str( $single_paired_end_type.single_paired_end_type_selector ) == 'paired':
+ '
+ --read_group_file_2 "${output_fastq2}"
+ --file_id_2 "${output_fastq2.id}"
+ -p '
+ #if str( $single_paired_end_type.read2_trim ):
+ READ2_TRIM="${single_paired_end_type.read2_trim}"
+ #end if
+ #if str( $single_paired_end_type.read2_max_bases_to_write ):
+ READ2_MAX_BASES_TO_WRITE="${single_paired_end_type.read2_max_bases_to_write}"
+ #end if
+ #end if
+ '
+ --read_group_file_1 "${output_fastq1}"
+ --new_files_path "${$__new_file_path__}"
+ --file_id_1 "${output_fastq1.id}"
+ #end if
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ single_paired_end_type['single_paired_end_type_selector'] == 'paired'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+**What it does**
+
+Picard: SamToFastq converts SAM files to FASTQ files.
+
+Extracts read sequences and qualities from the input SAM/BAM file and writes them into the output file in Sanger fastq format. In the RC mode (default is True), if the read is aligned and the alignment is to the reverse strand on the genome, the read's sequence from input SAM file will be reverse-complemented prior to writing it to fastq in order restore correctly the original read sequence as it was generated by the sequencer.
+
+------
+
+Please cite the website "http://picard.sourceforge.net".
+
+------
+
+
+**Input formats**
+
+FastqToSam accepts SAM input files, see http://samtools.sourceforge.net for more details.
+
+------
+
+**Outputs**
+
+The output is in FASTQ format. If using Paired end data, 2 fastq files are created.
+
+-------
+
+**FastqToSam settings**
+
+This is list of SamToFastq options::
+
+ INPUT=File Input SAM/BAM file to extract reads from Required.
+ FASTQ=File Output fastq file (single-end fastq or, if paired, first end of the pair fastq). Required. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
+ SECOND_END_FASTQ=File Output fastq file (if paired, second end of the pair fastq). Default value: null. Cannot be used in conjuction with option(s) OUTPUT_PER_RG (OPRG)
+ OUTPUT_PER_RG=Boolean Output a fastq file per read group (two fastq files per read group if the group is paired). Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) SECOND_END_FASTQ (F2) FASTQ (F)
+ OUTPUT_DIR=File Directory in which to output the fastq file(s). Used only when OUTPUT_PER_RG is true. Default value: null.
+ RE_REVERSE=Boolean Re-reverse bases and qualities of reads with negative strand flag set before writing them to fastq Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
+ INCLUDE_NON_PF_READS=Boolean Include non-PF reads from the SAM file into the output FASTQ files. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
+ CLIPPING_ATTRIBUTE=String The attribute that stores the position at which the SAM record should be clipped Default value: null.
+ CLIPPING_ACTION=String The action that should be taken with clipped reads: 'X' means the reads and qualities should be trimmed at the clipped position; 'N' means the bases should be changed to Ns in the clipped region; and any integer means that the base qualities should be set to that value in the clipped region. Default value: null.
+ READ1_TRIM=Integer The number of bases to trim from the beginning of read 1. Default value: 0. This option can be set to 'null' to clear the default value.
+ READ1_MAX_BASES_TO_WRITE=Integer The maximum number of bases to write from read 1 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.
+ READ2_TRIM=Integer The number of bases to trim from the beginning of read 2. Default value: 0. This option can be set to 'null' to clear the default value.
+ READ2_MAX_BASES_TO_WRITE=Integer The maximum number of bases to write from read 2 after trimming. If there are fewer than this many bases left after trimming, all will be written. If this value is null then all bases left after trimming will be written. Default value: null.
+ INCLUDE_NON_PRIMARY_ALIGNMENTS=Boolean If true, include non-primary alignments in the output. Support of non-primary alignments in SamToFastq is not comprehensive, so there may be exceptions if this is set to true and there are paired reads with non-primary alignments. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e picard_SamToFastq_wrapper.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_SamToFastq_wrapper.py Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,93 @@
+#!/usr/bin/env python
+#Dan Blankenberg
+
+"""
+A wrapper script for running the Picard SamToFastq command. Allows parsing read groups into separate files.
+"""
+
+import sys, optparse, os, tempfile, subprocess, shutil
+
+CHUNK_SIZE = 2**20 #1mb
+
+
+def cleanup_before_exit( tmp_dir ):
+ if tmp_dir and os.path.exists( tmp_dir ):
+ shutil.rmtree( tmp_dir )
+
+def open_file_from_option( filename, mode = 'rb' ):
+ if filename:
+ return open( filename, mode = mode )
+ return None
+
+def __main__():
+ #Parse Command Line
+ parser = optparse.OptionParser()
+ parser.add_option( '-p', '--pass_through', dest='pass_through_options', action='append', type="string", help='These options are passed through directly to PICARD, without any modification.' )
+ parser.add_option( '-1', '--read_group_file_1', dest='read_group_file_1', action='store', type="string", default=None, help='Read Group 1 output file, when using multiple readgroups' )
+ parser.add_option( '-2', '--read_group_file_2', dest='read_group_file_2', action='store', type="string", default=None, help='Read Group 2 output file, when using multiple readgroups and paired end' )
+ parser.add_option( '', '--stdout', dest='stdout', action='store', type="string", default=None, help='If specified, the output of stdout will be written to this file.' )
+ parser.add_option( '', '--stderr', dest='stderr', action='store', type="string", default=None, help='If specified, the output of stderr will be written to this file.' )
+ parser.add_option( '-n', '--new_files_path', dest='new_files_path', action='store', type="string", default=None, help='new_files_path')
+ parser.add_option( '-i', '--file_id_1', dest='file_id_1', action='store', type="string", default=None, help='file_id_1')
+ parser.add_option( '-f', '--file_id_2', dest='file_id_2', action='store', type="string", default=None, help='file_id_2')
+ (options, args) = parser.parse_args()
+
+ tmp_dir = tempfile.mkdtemp( prefix='tmp-picard-' )
+ if options.pass_through_options:
+ cmd = ' '.join( options.pass_through_options )
+ else:
+ cmd = ''
+ if options.new_files_path is not None:
+ print 'Creating FASTQ files by Read Group'
+ assert None not in [ options.read_group_file_1, options.new_files_path, options.file_id_1 ], 'When using read group aware, you need to specify --read_group_file_1, --read_group_file_2 (when paired end), --new_files_path, and --file_id'
+ cmd = '%s OUTPUT_DIR="%s"' % ( cmd, tmp_dir)
+ #set up stdout and stderr output options
+ stdout = open_file_from_option( options.stdout, mode = 'wb' )
+ if stdout is None:
+ stdout = sys.stdout
+ stderr = open_file_from_option( options.stderr, mode = 'wb' )
+ #if no stderr file is specified, we'll use our own
+ if stderr is None:
+ stderr = tempfile.NamedTemporaryFile( prefix="picard-stderr-", dir=tmp_dir )
+
+ proc = subprocess.Popen( args=cmd, stdout=stdout, stderr=stderr, shell=True, cwd=tmp_dir )
+ return_code = proc.wait()
+
+ if return_code:
+ stderr_target = sys.stderr
+ else:
+ stderr_target = sys.stdout
+ stderr.flush()
+ stderr.seek(0)
+ while True:
+ chunk = stderr.read( CHUNK_SIZE )
+ if chunk:
+ stderr_target.write( chunk )
+ else:
+ break
+ stderr.close()
+ #if rg aware, put files where they belong
+ if options.new_files_path is not None:
+ fastq_1_name = options.read_group_file_1
+ fastq_2_name = options.read_group_file_2
+ file_id_1 = options.file_id_1
+ file_id_2 = options.file_id_2
+ if file_id_2 is None:
+ file_id_2 = file_id_1
+ for filename in sorted( os.listdir( tmp_dir ) ):
+ if filename.endswith( '_1.fastq' ):
+ if fastq_1_name:
+ shutil.move( os.path.join( tmp_dir, filename ), fastq_1_name )
+ fastq_1_name = None
+ else:
+ shutil.move( os.path.join( tmp_dir, filename ), os.path.join( options.new_files_path, 'primary_%s_%s - 1_visible_fastqsanger' % ( file_id_1, filename[:-len( '_1.fastq' )] ) ) )
+ elif filename.endswith( '_2.fastq' ):
+ if fastq_2_name:
+ shutil.move( os.path.join( tmp_dir, filename ), fastq_2_name )
+ fastq_2_name = None
+ else:
+ shutil.move( os.path.join( tmp_dir, filename ), os.path.join( options.new_files_path, 'primary_%s_%s - 2_visible_fastqsanger' % ( file_id_2, filename[:-len( '_2.fastq' )] ) ) )
+
+ cleanup_before_exit( tmp_dir )
+
+if __name__=="__main__": __main__()
diff -r 000000000000 -r ff4ec13e496e picard_wrapper.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/picard_wrapper.py Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,776 @@
+#!/usr/bin/env python
+"""
+Originally written by Kelly Vincent
+pretty output and additional picard wrappers by Ross Lazarus for rgenetics
+Runs all available wrapped Picard tools.
+usage: picard_wrapper.py [options]
+code Ross wrote licensed under the LGPL
+see http://www.gnu.org/copyleft/lesser.html
+"""
+
+import optparse, os, sys, subprocess, tempfile, shutil, time, logging
+
+galhtmlprefix = """
+
+
+
+
+
+
+
+
+
+
+"""
+galhtmlattr = """Galaxy tool %s run at %s """
+galhtmlpostfix = """
\n"""
+
+
+def stop_err( msg ):
+ sys.stderr.write( '%s\n' % msg )
+ sys.exit()
+
+
+def timenow():
+ """return current time as a string
+ """
+ return time.strftime('%d/%m/%Y %H:%M:%S', time.localtime(time.time()))
+
+
+class PicardBase():
+ """
+ simple base class with some utilities for Picard
+ adapted and merged with Kelly Vincent's code april 2011 Ross
+ lots of changes...
+ """
+
+ def __init__(self, opts=None,arg0=None):
+ """ common stuff needed at init for a picard tool
+ """
+ assert opts <> None, 'PicardBase needs opts at init'
+ self.opts = opts
+ if self.opts.outdir == None:
+ self.opts.outdir = os.getcwd() # fixmate has no html file eg so use temp dir
+ assert self.opts.outdir <> None,'## PicardBase needs a temp directory if no output directory passed in'
+ self.picname = self.baseName(opts.jar)
+ if self.picname.startswith('picard'):
+ self.picname = opts.picard_cmd # special case for some tools like replaceheader?
+ self.progname = self.baseName(arg0)
+ self.version = '0.002'
+ self.delme = [] # list of files to destroy
+ self.title = opts.title
+ self.inputfile = opts.input
+ try:
+ os.makedirs(opts.outdir)
+ except:
+ pass
+ try:
+ os.makedirs(opts.tmpdir)
+ except:
+ pass
+ self.log_filename = os.path.join(self.opts.outdir,'%s.log' % self.picname)
+ self.metricsOut = os.path.join(opts.outdir,'%s.metrics.txt' % self.picname)
+ self.setLogging(logfname=self.log_filename)
+
+ def baseName(self,name=None):
+ return os.path.splitext(os.path.basename(name))[0]
+
+ def setLogging(self,logfname="picard_wrapper.log"):
+ """setup a logger
+ """
+ logging.basicConfig(level=logging.INFO,
+ filename=logfname,
+ filemode='a')
+
+
+ def readLarge(self,fname=None):
+ """ read a potentially huge file.
+ """
+ try:
+ # get stderr, allowing for case where it's very large
+ tmp = open( fname, 'rb' )
+ s = ''
+ buffsize = 1048576
+ try:
+ while True:
+ more = tmp.read( buffsize )
+ if len(more) > 0:
+ s += more
+ else:
+ break
+ except OverflowError:
+ pass
+ tmp.close()
+ except Exception, e:
+ stop_err( 'Read Large Exception : %s' % str( e ) )
+ return s
+
+ def runCL(self,cl=None,output_dir=None):
+ """ construct and run a command line
+ we have galaxy's temp path as opt.temp_dir so don't really need isolation
+ sometimes stdout is needed as the output - ugly hacks to deal with potentially vast artifacts
+ """
+ assert cl <> None, 'PicardBase runCL needs a command line as cl'
+ if output_dir == None:
+ output_dir = self.opts.outdir
+ if type(cl) == type([]):
+ cl = ' '.join(cl)
+ fd,templog = tempfile.mkstemp(dir=output_dir,suffix='rgtempRun.txt')
+ tlf = open(templog,'wb')
+ fd,temperr = tempfile.mkstemp(dir=output_dir,suffix='rgtempErr.txt')
+ tef = open(temperr,'wb')
+ process = subprocess.Popen(cl, shell=True, stderr=tef, stdout=tlf, cwd=output_dir)
+ rval = process.wait()
+ tlf.close()
+ tef.close()
+ stderrs = self.readLarge(temperr)
+ stdouts = self.readLarge(templog)
+ if rval > 0:
+ s = '## executing %s returned status %d and stderr: \n%s\n' % (cl,rval,stderrs)
+ stdouts = '%s\n%s' % (stdouts,stderrs)
+ else:
+ s = '## executing %s returned status %d and nothing on stderr\n' % (cl,rval)
+ logging.info(s)
+ os.unlink(templog) # always
+ os.unlink(temperr) # always
+ return s, stdouts, rval # sometimes s is an output
+
+ def runPic(self, jar, cl):
+ """
+ cl should be everything after the jar file name in the command
+ """
+ runme = ['java -Xmx%s' % self.opts.maxjheap]
+ runme.append(" -Djava.io.tmpdir='%s' " % self.opts.tmpdir)
+ runme.append('-jar %s' % jar)
+ runme += cl
+ s,stdouts,rval = self.runCL(cl=runme, output_dir=self.opts.outdir)
+ return stdouts,rval
+
+ def samToBam(self,infile=None,outdir=None):
+ """
+ use samtools view to convert sam to bam
+ """
+ fd,tempbam = tempfile.mkstemp(dir=outdir,suffix='rgutilsTemp.bam')
+ cl = ['samtools view -h -b -S -o ',tempbam,infile]
+ tlog,stdouts,rval = self.runCL(cl,outdir)
+ return tlog,tempbam,rval
+
+ def sortSam(self, infile=None,outfile=None,outdir=None):
+ """
+ """
+ print '## sortSam got infile=%s,outfile=%s,outdir=%s' % (infile,outfile,outdir)
+ cl = ['samtools sort',infile,outfile]
+ tlog,stdouts,rval = self.runCL(cl,outdir)
+ return tlog
+
+ def cleanup(self):
+ for fname in self.delme:
+ try:
+ os.unlink(fname)
+ except:
+ pass
+
+ def prettyPicout(self,transpose,maxrows):
+ """organize picard outpouts into a report html page
+ """
+ res = []
+ try:
+ r = open(self.metricsOut,'r').readlines()
+ except:
+ r = []
+ if len(r) > 0:
+ res.append('Picard on line resources
\n')
+ if transpose:
+ res.append('Picard output (transposed to make it easier to see)\n')
+ else:
+ res.append('Picard output\n')
+ res.append('
\n')
+ dat = []
+ heads = []
+ lastr = len(r) - 1
+ # special case for estimate library complexity hist
+ thist = False
+ for i,row in enumerate(r):
+ if row.strip() > '':
+ srow = row.split('\t')
+ if row.startswith('#'):
+ heads.append(row.strip()) # want strings
+ else:
+ dat.append(srow) # want lists
+ if row.startswith('## HISTOGRAM'):
+ thist = True
+ if len(heads) > 0:
+ hres = ['
%s
' % (i % 2,x) for i,x in enumerate(heads)]
+ res += hres
+ heads = []
+ if len(dat) > 0:
+ if transpose and not thist:
+ tdat = map(None,*dat) # transpose an arbitrary list of lists
+ tdat = ['
%s
%s
\n' % ((i+len(heads)) % 2,x[0],x[1]) for i,x in enumerate(tdat)]
+ else:
+ tdat = ['\t'.join(x).strip() for x in dat] # back to strings :(
+ tdat = ['
%s
\n' % ((i+len(heads)) % 2,x) for i,x in enumerate(tdat)]
+ res += tdat
+ dat = []
+ res.append('
\n')
+ return res
+
+ def fixPicardOutputs(self,transpose,maxloglines):
+ """
+ picard produces long hard to read tab header files
+ make them available but present them transposed for readability
+ """
+ logging.shutdown()
+ self.cleanup() # remove temp files stored in delme
+ rstyle=""""""
+ res = [rstyle,]
+ res.append(galhtmlprefix % self.progname)
+ res.append(galhtmlattr % (self.picname,timenow()))
+ flist = [x for x in os.listdir(self.opts.outdir) if not x.startswith('.')]
+ pdflist = [x for x in flist if os.path.splitext(x)[-1].lower() == '.pdf']
+ if len(pdflist) > 0: # assumes all pdfs come with thumbnail .jpgs
+ for p in pdflist:
+ pbase = os.path.splitext(p)[0] # removes .pdf
+ imghref = '%s.jpg' % pbase
+ mimghref = '%s-0.jpg' % pbase # multiple pages pdf -> multiple thumbnails without asking!
+ if mimghref in flist:
+ imghref=mimghref # only one for thumbnail...it's a multi page pdf
+ res.append('
')
+ if llen > maxloglines:
+ rlog.append('\n## WARNING - %d log lines truncated - %s contains entire output' % (llen - maxloglines,self.log_filename,self.log_filename))
+ res += rlog
+ else:
+ res.append("### Odd, Picard left no log file %s - must have really barfed badly?\n" % self.log_filename)
+ res.append('The freely available Picard software \n')
+ res.append( 'generated all outputs reported here running as a Galaxy tool')
+ res.append(galhtmlpostfix)
+ outf = open(self.opts.htmlout,'w')
+ outf.write(''.join(res))
+ outf.write('\n')
+ outf.close()
+
+ def makePicInterval(self,inbed=None,outf=None):
+ """
+ picard wants bait and target files to have the same header length as the incoming bam/sam
+ a meaningful (ie accurate) representation will fail because of this - so this hack
+ it would be far better to be able to supply the original bed untouched
+ Additional checking added Ross Lazarus Dec 2011 to deal with two 'bug' reports on the list
+ """
+ assert inbed <> None
+ bed = open(inbed,'r').readlines()
+ sbed = [x.split('\t') for x in bed] # lengths MUST be 5
+ lens = [len(x) for x in sbed]
+ strands = [x[3] for x in sbed if not x[3] in ['+','-']]
+ maxl = max(lens)
+ minl = min(lens)
+ e = []
+ if maxl <> minl:
+ e.append("## Input error: Inconsistent field count in %s - please read the documentation on bait/target format requirements, fix and try again" % inbed)
+ if maxl <> 5:
+ e.append("## Input error: %d fields found in %s, 5 required - please read the warning and documentation on bait/target format requirements, fix and try again" % (maxl,inbed))
+ if len(strands) > 0:
+ e.append("## Input error: Fourth column in %s is not the required strand (+ or -) - please read the warning and documentation on bait/target format requirements, fix and try again" % (inbed))
+ if len(e) > 0: # write to stderr and quit
+ print >> sys.stderr, '\n'.join(e)
+ sys.exit(1)
+ thead = os.path.join(self.opts.outdir,'tempSamHead.txt')
+ if self.opts.datatype == 'sam':
+ cl = ['samtools view -H -S',self.opts.input,'>',thead]
+ else:
+ cl = ['samtools view -H',self.opts.input,'>',thead]
+ self.runCL(cl=cl,output_dir=self.opts.outdir)
+ head = open(thead,'r').readlines()
+ s = '## got %d rows of header\n' % (len(head))
+ logging.info(s)
+ o = open(outf,'w')
+ o.write(''.join(head))
+ o.write(''.join(bed))
+ o.close()
+ return outf
+
+ def cleanSam(self, insam=None, newsam=None, picardErrors=[],outformat=None):
+ """
+ interesting problem - if paired, must remove mate pair of errors too or we have a new set of errors after cleaning - missing mate pairs!
+ Do the work of removing all the error sequences
+ pysam is cool
+ infile = pysam.Samfile( "-", "r" )
+ outfile = pysam.Samfile( "-", "w", template = infile )
+ for s in infile: outfile.write(s)
+
+ errors from ValidateSameFile.jar look like
+ WARNING: Record 32, Read name SRR006041.1202260, NM tag (nucleotide differences) is missing
+ ERROR: Record 33, Read name SRR006041.1042721, Empty sequence dictionary.
+ ERROR: Record 33, Read name SRR006041.1042721, RG ID on SAMRecord not found in header: SRR006041
+
+ """
+ assert os.path.isfile(insam), 'rgPicardValidate cleansam needs an input sam file - cannot find %s' % insam
+ assert newsam <> None, 'rgPicardValidate cleansam needs an output new sam file path'
+ removeNames = [x.split(',')[1].replace(' Read name ','') for x in picardErrors if len(x.split(',')) > 2]
+ remDict = dict(zip(removeNames,range(len(removeNames))))
+ infile = pysam.Samfile(insam,'rb')
+ info = 'found %d error sequences in picardErrors, %d unique' % (len(removeNames),len(remDict))
+ if len(removeNames) > 0:
+ outfile = pysam.Samfile(newsam,'wb',template=infile) # template must be an open file
+ i = 0
+ j = 0
+ for row in infile:
+ dropme = remDict.get(row.qname,None) # keep if None
+ if not dropme:
+ outfile.write(row)
+ j += 1
+ else: # discard
+ i += 1
+ info = '%s\n%s' % (info, 'Discarded %d lines writing %d to %s from %s' % (i,j,newsam,insam))
+ outfile.close()
+ infile.close()
+ else: # we really want a nullop or a simple pointer copy
+ infile.close()
+ if newsam:
+ shutil.copy(insam,newsam)
+ logging.info(info)
+
+
+
+def __main__():
+ doFix = False # tools returning htmlfile don't need this
+ doTranspose = True # default
+ maxloglines = 100 # default
+ #Parse Command Line
+ op = optparse.OptionParser()
+ # All tools
+ op.add_option('-i', '--input', dest='input', help='Input SAM or BAM file' )
+ op.add_option('-e', '--inputext', default=None)
+ op.add_option('-o', '--output', default=None)
+ op.add_option('-n', '--title', default="Pick a Picard Tool")
+ op.add_option('-t', '--htmlout', default=None)
+ op.add_option('-d', '--outdir', default=None)
+ op.add_option('-x', '--maxjheap', default='4g')
+ op.add_option('-b', '--bisulphite', default='false')
+ op.add_option('-s', '--sortorder', default='query')
+ op.add_option('','--tmpdir', default='/tmp')
+ op.add_option('-j','--jar',default='')
+ op.add_option('','--picard-cmd',default=None)
+ # Many tools
+ op.add_option( '', '--output-format', dest='output_format', help='Output format' )
+ op.add_option( '', '--bai-file', dest='bai_file', help='The path to the index file for the input bam file' )
+ op.add_option( '', '--ref', dest='ref', help='Built-in reference with fasta and dict file', default=None )
+ # CreateSequenceDictionary
+ op.add_option( '', '--ref-file', dest='ref_file', help='Fasta to use as reference', default=None )
+ op.add_option( '', '--species-name', dest='species_name', help='Species name to use in creating dict file from fasta file' )
+ op.add_option( '', '--build-name', dest='build_name', help='Name of genome assembly to use in creating dict file from fasta file' )
+ op.add_option( '', '--trunc-names', dest='trunc_names', help='Truncate sequence names at first whitespace from fasta file' )
+ # MarkDuplicates
+ op.add_option( '', '--remdups', default='true', help='Remove duplicates from output file' )
+ op.add_option( '', '--optdupdist', default="100", help='Maximum pixels between two identical sequences in order to consider them optical duplicates.' )
+ # CollectInsertSizeMetrics
+ op.add_option('', '--taillimit', default="0")
+ op.add_option('', '--histwidth', default="0")
+ op.add_option('', '--minpct', default="0.01")
+ op.add_option('', '--malevel', default='')
+ op.add_option('', '--deviations', default="0.0")
+ # CollectAlignmentSummaryMetrics
+ op.add_option('', '--maxinsert', default="20")
+ op.add_option('', '--adaptors', default='')
+ # FixMateInformation and validate
+ # CollectGcBiasMetrics
+ op.add_option('', '--windowsize', default='100')
+ op.add_option('', '--mingenomefrac', default='0.00001')
+ # AddOrReplaceReadGroups
+ op.add_option( '', '--rg-opts', dest='rg_opts', help='Specify extra (optional) arguments with full, otherwise preSet' )
+ op.add_option( '', '--rg-lb', dest='rg_library', help='Read Group Library' )
+ op.add_option( '', '--rg-pl', dest='rg_platform', help='Read Group platform (e.g. illumina, solid)' )
+ op.add_option( '', '--rg-pu', dest='rg_plat_unit', help='Read Group platform unit (eg. run barcode) ' )
+ op.add_option( '', '--rg-sm', dest='rg_sample', help='Read Group sample name' )
+ op.add_option( '', '--rg-id', dest='rg_id', help='Read Group ID' )
+ op.add_option( '', '--rg-cn', dest='rg_seq_center', help='Read Group sequencing center name' )
+ op.add_option( '', '--rg-ds', dest='rg_desc', help='Read Group description' )
+ # ReorderSam
+ op.add_option( '', '--allow-inc-dict-concord', dest='allow_inc_dict_concord', help='Allow incomplete dict concordance' )
+ op.add_option( '', '--allow-contig-len-discord', dest='allow_contig_len_discord', help='Allow contig length discordance' )
+ # ReplaceSamHeader
+ op.add_option( '', '--header-file', dest='header_file', help='sam or bam file from which header will be read' )
+
+ op.add_option('','--assumesorted', default='true')
+ op.add_option('','--readregex', default="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*")
+ #estimatelibrarycomplexity
+ op.add_option('','--minid', default="5")
+ op.add_option('','--maxdiff', default="0.03")
+ op.add_option('','--minmeanq', default="20")
+ #hsmetrics
+ op.add_option('','--baitbed', default=None)
+ op.add_option('','--targetbed', default=None)
+ #validate
+ op.add_option('','--ignoreflags', action='append', type="string")
+ op.add_option('','--maxerrors', default=None)
+ op.add_option('','--datatype', default=None)
+ op.add_option('','--bamout', default=None)
+ op.add_option('','--samout', default=None)
+
+ opts, args = op.parse_args()
+ opts.sortme = opts.assumesorted == 'false'
+ assert opts.input <> None
+ # need to add
+ # instance that does all the work
+ pic = PicardBase(opts,sys.argv[0])
+
+ tmp_dir = opts.outdir
+ haveTempout = False # we use this where sam output is an option
+ rval = 0
+ stdouts = 'Not run yet'
+ # set ref and dict files to use (create if necessary)
+ ref_file_name = opts.ref
+ if opts.ref_file <> None:
+ csd = 'CreateSequenceDictionary'
+ realjarpath = os.path.split(opts.jar)[0]
+ jarpath = os.path.join(realjarpath,'%s.jar' % csd) # for refseq
+ tmp_ref_fd, tmp_ref_name = tempfile.mkstemp( dir=opts.tmpdir , prefix = pic.picname)
+ ref_file_name = '%s.fasta' % tmp_ref_name
+ # build dict
+ dict_file_name = '%s.dict' % tmp_ref_name
+ os.symlink( opts.ref_file, ref_file_name )
+ cl = ['REFERENCE=%s' % ref_file_name]
+ cl.append('OUTPUT=%s' % dict_file_name)
+ cl.append('URI=%s' % os.path.basename( opts.ref_file ))
+ cl.append('TRUNCATE_NAMES_AT_WHITESPACE=%s' % opts.trunc_names)
+ if opts.species_name:
+ cl.append('SPECIES=%s' % opts.species_name)
+ if opts.build_name:
+ cl.append('GENOME_ASSEMBLY=%s' % opts.build_name)
+ pic.delme.append(dict_file_name)
+ pic.delme.append(ref_file_name)
+ pic.delme.append(tmp_ref_name)
+ stdouts,rval = pic.runPic(jarpath, cl)
+ # run relevant command(s)
+
+ # define temporary output
+ # if output is sam, it must have that extension, otherwise bam will be produced
+ # specify sam or bam file with extension
+ if opts.output_format == 'sam':
+ suff = '.sam'
+ else:
+ suff = ''
+ tmp_fd, tempout = tempfile.mkstemp( dir=opts.tmpdir, suffix=suff )
+
+ cl = ['VALIDATION_STRINGENCY=LENIENT',]
+
+ if pic.picname == 'AddOrReplaceReadGroups':
+ # sort order to match Galaxy's default
+ cl.append('SORT_ORDER=coordinate')
+ # input
+ cl.append('INPUT=%s' % opts.input)
+ # outputs
+ cl.append('OUTPUT=%s' % tempout)
+ # required read groups
+ cl.append('RGLB="%s"' % opts.rg_library)
+ cl.append('RGPL="%s"' % opts.rg_platform)
+ cl.append('RGPU="%s"' % opts.rg_plat_unit)
+ cl.append('RGSM="%s"' % opts.rg_sample)
+ if opts.rg_id:
+ cl.append('RGID="%s"' % opts.rg_id)
+ # optional read groups
+ if opts.rg_seq_center:
+ cl.append('RGCN="%s"' % opts.rg_seq_center)
+ if opts.rg_desc:
+ cl.append('RGDS="%s"' % opts.rg_desc)
+ stdouts,rval = pic.runPic(opts.jar, cl)
+ haveTempout = True
+
+ elif pic.picname == 'BamIndexStats':
+ tmp_fd, tmp_name = tempfile.mkstemp( dir=tmp_dir )
+ tmp_bam_name = '%s.bam' % tmp_name
+ tmp_bai_name = '%s.bai' % tmp_bam_name
+ os.symlink( opts.input, tmp_bam_name )
+ os.symlink( opts.bai_file, tmp_bai_name )
+ cl.append('INPUT=%s' % ( tmp_bam_name ))
+ pic.delme.append(tmp_bam_name)
+ pic.delme.append(tmp_bai_name)
+ pic.delme.append(tmp_name)
+ stdouts,rval = pic.runPic( opts.jar, cl )
+ f = open(pic.metricsOut,'a')
+ f.write(stdouts) # got this on stdout from runCl
+ f.write('\n')
+ f.close()
+ doTranspose = False # but not transposed
+
+ elif pic.picname == 'EstimateLibraryComplexity':
+ cl.append('I=%s' % opts.input)
+ cl.append('O=%s' % pic.metricsOut)
+ if float(opts.minid) > 0:
+ cl.append('MIN_IDENTICAL_BASES=%s' % opts.minid)
+ if float(opts.maxdiff) > 0.0:
+ cl.append('MAX_DIFF_RATE=%s' % opts.maxdiff)
+ if float(opts.minmeanq) > 0:
+ cl.append('MIN_MEAN_QUALITY=%s' % opts.minmeanq)
+ if opts.readregex > '':
+ cl.append('READ_NAME_REGEX="%s"' % opts.readregex)
+ if float(opts.optdupdist) > 0:
+ cl.append('OPTICAL_DUPLICATE_PIXEL_DISTANCE=%s' % opts.optdupdist)
+ stdouts,rval = pic.runPic(opts.jar, cl)
+
+ elif pic.picname == 'CollectAlignmentSummaryMetrics':
+ # Why do we do this fakefasta thing?
+ # Because we need NO fai to be available or picard barfs unless it matches the input data.
+ # why? Dunno Seems to work without complaining if the .bai file is AWOL....
+ fakefasta = os.path.join(opts.outdir,'%s_fake.fasta' % os.path.basename(ref_file_name))
+ try:
+ os.symlink(ref_file_name,fakefasta)
+ except:
+ s = '## unable to symlink %s to %s - different devices? Will shutil.copy'
+ info = s
+ shutil.copy(ref_file_name,fakefasta)
+ pic.delme.append(fakefasta)
+ cl.append('ASSUME_SORTED=true')
+ adaptlist = opts.adaptors.split(',')
+ adaptorseqs = ['ADAPTER_SEQUENCE=%s' % x for x in adaptlist]
+ cl += adaptorseqs
+ cl.append('IS_BISULFITE_SEQUENCED=%s' % opts.bisulphite)
+ cl.append('MAX_INSERT_SIZE=%s' % opts.maxinsert)
+ cl.append('OUTPUT=%s' % pic.metricsOut)
+ cl.append('R=%s' % fakefasta)
+ cl.append('TMP_DIR=%s' % opts.tmpdir)
+ if not opts.assumesorted.lower() == 'true': # we need to sort input
+ sortedfile = '%s.sorted' % os.path.basename(opts.input)
+ if opts.datatype == 'sam': # need to work with a bam
+ tlog,tempbam,trval = pic.samToBam(opts.input,opts.outdir)
+ pic.delme.append(tempbam)
+ try:
+ tlog = pic.sortSam(tempbam,sortedfile,opts.outdir)
+ except:
+ print '## exception on sorting sam file %s' % opts.input
+ else: # is already bam
+ try:
+ tlog = pic.sortSam(opts.input,sortedfile,opts.outdir)
+ except : # bug - [bam_sort_core] not being ignored - TODO fixme
+ print '## exception %s on sorting bam file %s' % (sys.exc_info()[0],opts.input)
+ cl.append('INPUT=%s.bam' % os.path.abspath(os.path.join(opts.outdir,sortedfile)))
+ pic.delme.append(os.path.join(opts.outdir,sortedfile))
+ else:
+ cl.append('INPUT=%s' % os.path.abspath(opts.input))
+ stdouts,rval = pic.runPic(opts.jar, cl)
+
+
+ elif pic.picname == 'CollectGcBiasMetrics':
+ assert os.path.isfile(ref_file_name),'PicardGC needs a reference sequence - cannot read %s' % ref_file_name
+ # sigh. Why do we do this fakefasta thing? Because we need NO fai to be available or picard barfs unless it has the same length as the input data.
+ # why? Dunno
+ fakefasta = os.path.join(opts.outdir,'%s_fake.fasta' % os.path.basename(ref_file_name))
+ try:
+ os.symlink(ref_file_name,fakefasta)
+ except:
+ s = '## unable to symlink %s to %s - different devices? May need to replace with shutil.copy'
+ info = s
+ shutil.copy(ref_file_name,fakefasta)
+ pic.delme.append(fakefasta)
+ x = 'rgPicardGCBiasMetrics'
+ pdfname = '%s.pdf' % x
+ jpgname = '%s.jpg' % x
+ tempout = os.path.join(opts.outdir,'rgPicardGCBiasMetrics.out')
+ temppdf = os.path.join(opts.outdir,pdfname)
+ cl.append('R=%s' % fakefasta)
+ cl.append('WINDOW_SIZE=%s' % opts.windowsize)
+ cl.append('MINIMUM_GENOME_FRACTION=%s' % opts.mingenomefrac)
+ cl.append('INPUT=%s' % opts.input)
+ cl.append('OUTPUT=%s' % tempout)
+ cl.append('TMP_DIR=%s' % opts.tmpdir)
+ cl.append('CHART_OUTPUT=%s' % temppdf)
+ cl.append('SUMMARY_OUTPUT=%s' % pic.metricsOut)
+ stdouts,rval = pic.runPic(opts.jar, cl)
+ if os.path.isfile(temppdf):
+ cl2 = ['convert','-resize x400',temppdf,os.path.join(opts.outdir,jpgname)] # make the jpg for fixPicardOutputs to find
+ s,stdouts,rval = pic.runCL(cl=cl2,output_dir=opts.outdir)
+ else:
+ s='### runGC: Unable to find pdf %s - please check the log for the causal problem\n' % temppdf
+ lf = open(pic.log_filename,'a')
+ lf.write(s)
+ lf.write('\n')
+ lf.close()
+
+ elif pic.picname == 'CollectInsertSizeMetrics':
+ """
+ picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" --deviations "$deviations"
+ --histwidth "$histWidth" --minpct "$minPct" --malevel "$malevel"
+ -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/picard/CollectInsertSizeMetrics.jar" -d "$html_file.files_path" -t "$html_file"
+
+ """
+ isPDF = 'InsertSizeHist.pdf'
+ pdfpath = os.path.join(opts.outdir,isPDF)
+ histpdf = 'InsertSizeHist.pdf'
+ cl.append('I=%s' % opts.input)
+ cl.append('O=%s' % pic.metricsOut)
+ cl.append('HISTOGRAM_FILE=%s' % histpdf)
+ #if opts.taillimit <> '0': # this was deprecated although still mentioned in the docs at 1.56
+ # cl.append('TAIL_LIMIT=%s' % opts.taillimit)
+ if opts.histwidth <> '0':
+ cl.append('HISTOGRAM_WIDTH=%s' % opts.histwidth)
+ if float( opts.minpct) > 0.0:
+ cl.append('MINIMUM_PCT=%s' % opts.minpct)
+ if float(opts.deviations) > 0.0:
+ cl.append('DEVIATIONS=%s' % opts.deviations)
+ if opts.malevel:
+ malists = opts.malevel.split(',')
+ malist = ['METRIC_ACCUMULATION_LEVEL=%s' % x for x in malists]
+ cl += malist
+ stdouts,rval = pic.runPic(opts.jar, cl)
+ if os.path.exists(pdfpath): # automake thumbnail - will be added to html
+ cl2 = ['mogrify', '-format jpg -resize x400 %s' % pdfpath]
+ pic.runCL(cl=cl2,output_dir=opts.outdir)
+ else:
+ s = 'Unable to find expected pdf file %s \n' % pdfpath
+ s += 'This always happens if single ended data was provided to this tool,\n'
+ s += 'so please double check that your input data really is paired-end NGS data. \n'
+ s += 'If your input was paired data this may be a bug worth reporting to the galaxy-bugs list\n '
+ logging.info(s)
+ if len(stdouts) > 0:
+ logging.info(stdouts)
+
+ elif pic.picname == 'MarkDuplicates':
+ # assume sorted even if header says otherwise
+ cl.append('ASSUME_SORTED=%s' % (opts.assumesorted))
+ # input
+ cl.append('INPUT=%s' % opts.input)
+ # outputs
+ cl.append('OUTPUT=%s' % opts.output)
+ cl.append('METRICS_FILE=%s' % pic.metricsOut )
+ # remove or mark duplicates
+ cl.append('REMOVE_DUPLICATES=%s' % opts.remdups)
+ # the regular expression to be used to parse reads in incoming SAM file
+ cl.append('READ_NAME_REGEX="%s"' % opts.readregex)
+ # maximum offset between two duplicate clusters
+ cl.append('OPTICAL_DUPLICATE_PIXEL_DISTANCE=%s' % opts.optdupdist)
+ stdouts,rval = pic.runPic(opts.jar, cl)
+
+ elif pic.picname == 'FixMateInformation':
+ cl.append('I=%s' % opts.input)
+ cl.append('O=%s' % tempout)
+ cl.append('SORT_ORDER=%s' % opts.sortorder)
+ stdouts,rval = pic.runPic(opts.jar,cl)
+ haveTempout = True
+
+ elif pic.picname == 'ReorderSam':
+ # input
+ cl.append('INPUT=%s' % opts.input)
+ # output
+ cl.append('OUTPUT=%s' % tempout)
+ # reference
+ cl.append('REFERENCE=%s' % ref_file_name)
+ # incomplete dict concordance
+ if opts.allow_inc_dict_concord == 'true':
+ cl.append('ALLOW_INCOMPLETE_DICT_CONCORDANCE=true')
+ # contig length discordance
+ if opts.allow_contig_len_discord == 'true':
+ cl.append('ALLOW_CONTIG_LENGTH_DISCORDANCE=true')
+ stdouts,rval = pic.runPic(opts.jar, cl)
+ haveTempout = True
+
+ elif pic.picname == 'ReplaceSamHeader':
+ cl.append('INPUT=%s' % opts.input)
+ cl.append('OUTPUT=%s' % tempout)
+ cl.append('HEADER=%s' % opts.header_file)
+ stdouts,rval = pic.runPic(opts.jar, cl)
+ haveTempout = True
+
+ elif pic.picname == 'CalculateHsMetrics':
+ maxloglines = 100
+ baitfname = os.path.join(opts.outdir,'rgPicardHsMetrics.bait')
+ targetfname = os.path.join(opts.outdir,'rgPicardHsMetrics.target')
+ baitf = pic.makePicInterval(opts.baitbed,baitfname)
+ if opts.targetbed == opts.baitbed: # same file sometimes
+ targetf = baitf
+ else:
+ targetf = pic.makePicInterval(opts.targetbed,targetfname)
+ cl.append('BAIT_INTERVALS=%s' % baitf)
+ cl.append('TARGET_INTERVALS=%s' % targetf)
+ cl.append('INPUT=%s' % os.path.abspath(opts.input))
+ cl.append('OUTPUT=%s' % pic.metricsOut)
+ cl.append('TMP_DIR=%s' % opts.tmpdir)
+ stdouts,rval = pic.runPic(opts.jar,cl)
+
+ elif pic.picname == 'ValidateSamFile':
+ import pysam
+ doTranspose = False
+ sortedfile = os.path.join(opts.outdir,'rgValidate.sorted')
+ stf = open(pic.log_filename,'w')
+ tlog = None
+ if opts.datatype == 'sam': # need to work with a bam
+ tlog,tempbam,rval = pic.samToBam(opts.input,opts.outdir)
+ try:
+ tlog = pic.sortSam(tempbam,sortedfile,opts.outdir)
+ except:
+ print '## exception on sorting sam file %s' % opts.input
+ else: # is already bam
+ try:
+ tlog = pic.sortSam(opts.input,sortedfile,opts.outdir)
+ except: # bug - [bam_sort_core] not being ignored - TODO fixme
+ print '## exception on sorting bam file %s' % opts.input
+ if tlog:
+ print '##tlog=',tlog
+ stf.write(tlog)
+ stf.write('\n')
+ sortedfile = '%s.bam' % sortedfile # samtools does that
+ cl.append('O=%s' % pic.metricsOut)
+ cl.append('TMP_DIR=%s' % opts.tmpdir)
+ cl.append('I=%s' % sortedfile)
+ opts.maxerrors = '99999999'
+ cl.append('MAX_OUTPUT=%s' % opts.maxerrors)
+ if opts.ignoreflags[0] <> 'None': # picard error values to ignore
+ igs = ['IGNORE=%s' % x for x in opts.ignoreflags if x <> 'None']
+ cl.append(' '.join(igs))
+ if opts.bisulphite.lower() <> 'false':
+ cl.append('IS_BISULFITE_SEQUENCED=true')
+ if opts.ref <> None or opts.ref_file <> None:
+ cl.append('R=%s' % ref_file_name)
+ stdouts,rval = pic.runPic(opts.jar,cl)
+ if opts.datatype == 'sam':
+ pic.delme.append(tempbam)
+ newsam = opts.output
+ outformat = 'bam'
+ pe = open(pic.metricsOut,'r').readlines()
+ pic.cleanSam(insam=sortedfile, newsam=newsam, picardErrors=pe,outformat=outformat)
+ pic.delme.append(sortedfile) # not wanted
+ stf.close()
+ pic.cleanup()
+ else:
+ print >> sys.stderr,'picard.py got an unknown tool name - %s' % pic.picname
+ sys.exit(1)
+ if haveTempout:
+ # Some Picard tools produced a potentially intermediate bam file.
+ # Either just move to final location or create sam
+ if os.path.exists(tempout):
+ shutil.move(tempout, os.path.abspath(opts.output))
+ if opts.htmlout <> None or doFix: # return a pretty html page
+ pic.fixPicardOutputs(transpose=doTranspose,maxloglines=maxloglines)
+ if rval <> 0:
+ print >> sys.stderr, '## exit code=%d; stdout=%s' % (rval,stdouts)
+ # signal failure
+if __name__=="__main__": __main__()
+
diff -r 000000000000 -r ff4ec13e496e rgPicardASMetrics.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rgPicardASMetrics.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,162 @@
+
+
+ picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file"
+ --assumesorted "$sorted" -b "$bisulphite" --adaptors "$adaptors" --maxinsert "$maxinsert" -n "$out_prefix" --datatype "$input_file.ext"
+ -j \$JAVA_JAR_PATH/CollectAlignmentSummaryMetrics.jar --tmpdir "${__new_file_path__}"
+#if $genomeSource.refGenomeSource == "history":
+ --ref-file "$genomeSource.ownFile"
+#else
+ --ref "${genomeSource.index.fields.path}"
+#end if
+
+ picard
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Summary**
+
+This Galaxy tool uses Picard to report high-level measures of alignment based on a provided sam or bam file.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for CollectAlignmentSummaryMetrics, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+-----
+
+.. class:: infomark
+
+**Syntax**
+
+- **Input** - SAM/BAM format aligned short read data in your current history
+- **Title** - the title to use for all output files from this job - use it for high level metadata
+- **Reference Genome** - Galaxy (and Picard) needs to know which genomic reference was used to generate alignemnts within the input SAM/BAM dataset. Here you have three choices:
+
+ - *Assigned data genome/build* - a genome specified for this dataset. If you your SAM/BAM dataset has an assigned reference genome it will be displayed below this dropdown. If it does not -> use one of the following two options.
+ - *Select a different built-in genome* - this option will list all reference genomes presently cached at this instance of Galaxy.
+ - *Select a reference genome from history* - alternatively you can upload your own version of reference genome into your history and use it with this option. This is however not advisable with large human-sized genomes. If your genome is large contact Galaxy team using "Help" link at the top of the interface and provide exact details on where we can download sequences you would like to use as the refenece. We will then install them as a part of locally cached genomic references.
+
+- **Assume Sorted** - saves sorting time - but only if true!
+- **Bisulphite data** - see Picard documentation http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSummaryMetrics
+- **Maximum acceptable insertion length** - see Picard documentation at http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSummaryMetrics
+
+-----
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+The Picard documentation (reformatted for Galaxy) says:
+
+.. csv-table::
+ :header-rows: 1
+
+ Option,Description
+ "INPUT=File","SAM or BAM file Required."
+ "OUTPUT=File","File to write insert size metrics to Required."
+ "REFERENCE_SEQUENCE=File","Reference sequence file Required."
+ "ASSUME_SORTED=Boolean","If true (default), unsorted SAM/BAM files will be considerd coordinate sorted "
+ "MAX_INSERT_SIZE=Integer","Paired end reads above this insert size will be considered chimeric along with inter-chromosomal pairs. Default value: 100000."
+ "ADAPTER_SEQUENCE=String","This option may be specified 0 or more times. "
+ "IS_BISULFITE_SEQUENCED=Boolean","Whether the SAM or BAM file consists of bisulfite sequenced reads. Default value: false. "
+ "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created."
+
+The output produced by the tool has the following columns::
+
+ 1. CATEGORY: One of either UNPAIRED (for a fragment run), FIRST_OF_PAIR when metrics are for only the first read in a paired run, SECOND_OF_PAIR when the metrics are for only the second read in a paired run or PAIR when the metrics are aggregeted for both first and second reads in a pair.
+ 2. TOTAL_READS: The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters.
+ 3. PF_READS: The number of PF reads where PF is defined as passing Illumina's filter.
+ 4. PCT_PF_READS: The percentage of reads that are PF (PF_READS / TOTAL_READS)
+ 5. PF_NOISE_READS: The number of PF reads that are marked as noise reads. A noise read is one which is composed entirey of A bases and/or N bases. These reads are marked as they are usually artifactual and are of no use in downstream analysis.
+ 6. PF_READS_ALIGNED: The number of PF reads that were aligned to the reference sequence. This includes reads that aligned with low quality (i.e. their alignments are ambiguous).
+ 7. PCT_PF_READS_ALIGNED: The percentage of PF reads that aligned to the reference sequence. PF_READS_ALIGNED / PF_READS
+ 8. PF_HQ_ALIGNED_READS: The number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong.
+ 9. PF_HQ_ALIGNED_BASES: The number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED_READS * READ_LENGTH but may differ when either mixed read lengths are present or many reads are aligned with gaps.
+ 10. PF_HQ_ALIGNED_Q20_BASES: The subest of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher.
+ 11. PF_HQ_MEDIAN_MISMATCHES: The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS).
+ 12. PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads.
+ 13. MEAN_READ_LENGTH: The mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads.
+ 14. READS_ALIGNED_IN_PAIRS: The number of aligned reads who's mate pair was also aligned to the reference.
+ 15. PCT_READS_ALIGNED_IN_PAIRS: The percentage of reads who's mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED
+ 16. BAD_CYCLES: The number of instrument cycles in which 80% or more of base calls were no-calls.
+ 17. STRAND_BALANCE: The number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome.
+ 18. PCT_CHIMERAS: The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes.
+ 19. PCT_ADAPTER: The percentage of PF reads that are unaligned and match to a known adapter sequence right from the start of the read.
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e rgPicardFixMate.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rgPicardFixMate.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,107 @@
+
+ for paired data
+
+ picard_wrapper.py -i "$input_file" -o "$out_file" --tmpdir "${__new_file_path__}" -n "$out_prefix"
+ --output-format "$outputFormat" -j "\$JAVA_JAR_PATH/FixMateInformation.jar" --sortorder "$sortOrder"
+
+ picard
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Ensure that all mate-pair information is in sync between each read and it's mate pair.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for FixMateInformation, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+.. class:: warningmark
+
+**Useful for paired data only**
+
+Likely won't do anything helpful for single end sequence data
+Currently, Galaxy doesn't distinguish paired from single ended SAM/BAM so make sure
+the data you choose are valid (paired end) SAM or BAM data - unless you trust this
+tool not to harm your data.
+
+-----
+
+.. class:: infomark
+
+**Syntax**
+
+- **Input** - a paired read sam/bam format aligned short read data in your current history
+- **Sort order** - can be used to adjust the ordering of reads
+- **Title** - the title to use for all output files from this job - use it for high level metadata
+- **Output Format** - either SAM or compressed as BAM
+
+-----
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+.. csv-table::
+
+ :header-rows: 1
+
+ Option,Description
+ "INPUT=File","The input file to fix. This option may be specified 0 or more times."
+ "OUTPUT=File","The output file to write to"
+ "SORT_ORDER=SortOrder","Optional sort order if the OUTPUT file should be sorted differently than the INPUT file. Default value: null. Possible values: {unsorted, queryname, coordinate}"
+ "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false"
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e rgPicardGCBiasMetrics.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rgPicardGCBiasMetrics.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,150 @@
+
+
+ picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file"
+ --windowsize "$windowsize" --mingenomefrac "$mingenomefrac" -n "$out_prefix" --tmpdir "${__new_file_path__}"
+ -j \$JAVA_JAR_PATH/CollectGcBiasMetrics.jar
+#if $genomeSource.refGenomeSource == "history":
+ --ref-file "${genomeSource.ownFile}"
+#else:
+ --ref "${genomeSource.index.fields.path}"
+#end if
+
+ picard
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Summary**
+
+This Galaxy tool uses Picard to report detailed metrics about reads that fall within windows of a certain GC bin on the reference genome.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for CollectGcBiasMetrics, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+-----
+
+.. class:: infomark
+
+**Syntax**
+
+- **Input** - SAM/BAM format aligned short read data in your current history
+- **Title** - the title to use for all output files from this job - use it for high level metadata
+- **Reference Genome** - Galaxy (and Picard) needs to know which genomic reference was used to generate alignemnts within the input SAM/BAM dataset. Here you have three choices:
+
+ - *Assigned data genome/build* - a genome specified for this dataset. If you your SAM/BAM dataset has an assigned reference genome it will be displayed below this dropdown. If it does not -> use one of the following two options.
+ - *Select a different built-in genome* - this option will list all reference genomes presently cached at this instance of Galaxy.
+ - *Select a reference genome from history* - alternatively you can upload your own version of reference genome into your history and use it with this option. This is however not advisable with large human-sized genomes. If your genome is large contact Galaxy team using "Help" link at the top of the interface and provide exact details on where we can download sequences you would like to use as the refenece. We will then install them as a part of locally cached genomic references.
+
+- **Window Size** see Picard documentation http://picard.sourceforge.net/command-line-overview.shtml#CollectGCBiasMetrics
+- **Minimum Genome Fraction** See Picard documentation at http://picard.sourceforge.net/command-line-overview.shtml#CollectGCBiasMetrics
+
+-----
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+The Picard documentation (reformatted for Galaxy) says:
+
+.. csv-table::
+ :header-rows: 1
+
+ Option,Description
+ "REFERENCE_SEQUENCE=File","The reference sequence fasta file. Required."
+ "INPUT=File","The BAM or SAM file containing aligned reads. Required."
+ "OUTPUT=File","The text file to write the metrics table to. Required."
+ "CHART_OUTPUT=File","The PDF file to render the chart to. Required."
+ "SUMMARY_OUTPUT=File","The text file to write summary metrics to. Default value: null."
+ "WINDOW_SIZE=Integer","The size of windows on the genome that are used to bin reads. Default value: 100."
+ "MINIMUM_GENOME_FRACTION=Double","For summary metrics, exclude GC windows that include less than this fraction of the genome. Default value: 1.0E-5."
+ "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false."
+
+The output produced by the tool has the following columns::
+
+ 1. GC: The G+C content of the reference sequence represented by this bin. Values are from 0% to 100%
+ 2. WINDOWS: The number of windows on the reference genome that have this G+C content.
+ 3. READ_STARTS: The number of reads who's start position is at the start of a window of this GC.
+ 4. MEAN_BASE_QUALITY: The mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC.
+ 5. NORMALIZED_COVERAGE: The ration of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average).
+ 6. ERROR_BAR_WIDTH: The radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85.
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
diff -r 000000000000 -r ff4ec13e496e rgPicardHsMetrics.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rgPicardHsMetrics.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,156 @@
+
+ for targeted resequencing data
+
+
+ picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file" --datatype "$input_file.ext"
+ --baitbed "$bait_bed" --targetbed "$target_bed" -n "$out_prefix" --tmpdir "${__new_file_path__}"
+ -j "\$JAVA_JAR_PATH/CalculateHsMetrics.jar"
+
+
+ picard
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Summary**
+
+Calculates a set of Hybrid Selection specific metrics from an aligned SAM or BAM file.
+
+.. class:: warnmark
+
+**WARNING about bait and target files**
+
+Picard is very fussy about the bait and target file format. If these are not exactly right, it will fail with an error something like:
+
+Exception in thread "main" net.sf.picard.PicardException: Invalid interval record contains 6 fields: chr1 45787123 45787316 CASO_22G_25063 1000 +
+
+If you see an error like that from this tool, please do NOT report it to any of the Galaxy mailing lists as it is not a bug!
+It means you must reformat your bait and target files. Galaxy cannot do that for you automatically unfortunately.
+
+The required definition is described in the documentation at http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_command-line_arguments
+and the sample provided looks like this:
+
+chr1 1104841 1104940 + target_1
+chr1 1105283 1105599 + target_2
+chr1 1105712 1105860 + target_3
+chr1 1105960 1106119 + target_4
+
+So your bait and target files MUST have 5 columns with chr, start, end, strand and name tab delimited and in exactly that order.
+Note that the Picard mandated sam header described in the documentation linked above is automagically added by the tool in Galaxy.
+
+.. class:: infomark
+
+**Picard documentation**
+
+This is a Galaxy wrapper for CalculateHsMetrics.jar, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+-----
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+Picard documentation says (reformatted for Galaxy):
+
+Calculates a set of Hybrid Selection specific metrics from an aligned SAM or BAM file.
+
+.. csv-table::
+ :header-rows: 1
+
+ "Option", "Description"
+ "BAIT_INTERVALS=File","An interval list file that contains the locations of the baits used. Required."
+ "TARGET_INTERVALS=File","An interval list file that contains the locations of the targets. Required."
+ "INPUT=File","An aligned SAM or BAM file. Required."
+ "OUTPUT=File","The output file to write the metrics to. Required. Cannot be used in conjuction with option(s) METRICS_FILE (M)"
+ "METRICS_FILE=File","Legacy synonym for OUTPUT, should not be used. Required. Cannot be used in conjuction with option(s) OUTPUT (O)"
+ "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false"
+
+HsMetrics
+
+ The set of metrics captured that are specific to a hybrid selection analysis.
+
+Output Column Definitions::
+
+ 1. BAIT_SET: The name of the bait set used in the hybrid selection.
+ 2. GENOME_SIZE: The number of bases in the reference genome used for alignment.
+ 3. BAIT_TERRITORY: The number of bases which have one or more baits on top of them.
+ 4. TARGET_TERRITORY: The unique number of target bases in the experiment where target is usually exons etc.
+ 5. BAIT_DESIGN_EFFICIENCY: Target terrirtoy / bait territory. 1 == perfectly efficient, 0.5 = half of baited bases are not target.
+ 6. TOTAL_READS: The total number of reads in the SAM or BAM file examine.
+ 7. PF_READS: The number of reads that pass the vendor's filter.
+ 8. PF_UNIQUE_READS: The number of PF reads that are not marked as duplicates.
+ 9. PCT_PF_READS: PF reads / total reads. The percent of reads passing filter.
+ 10. PCT_PF_UQ_READS: PF Unique Reads / Total Reads.
+ 11. PF_UQ_READS_ALIGNED: The number of PF unique reads that are aligned with mapping score > 0 to the reference genome.
+ 12. PCT_PF_UQ_READS_ALIGNED: PF Reads Aligned / PF Reads.
+ 13. PF_UQ_BASES_ALIGNED: The number of bases in the PF aligned reads that are mapped to a reference base. Accounts for clipping and gaps.
+ 14. ON_BAIT_BASES: The number of PF aligned bases that mapped to a baited region of the genome.
+ 15. NEAR_BAIT_BASES: The number of PF aligned bases that mapped to within a fixed interval of a baited region, but not on a baited region.
+ 16. OFF_BAIT_BASES: The number of PF aligned bases that mapped to neither on or near a bait.
+ 17. ON_TARGET_BASES: The number of PF aligned bases that mapped to a targetted region of the genome.
+ 18. PCT_SELECTED_BASES: On+Near Bait Bases / PF Bases Aligned.
+ 19. PCT_OFF_BAIT: The percentage of aligned PF bases that mapped neither on or near a bait.
+ 20. ON_BAIT_VS_SELECTED: The percentage of on+near bait bases that are on as opposed to near.
+ 21. MEAN_BAIT_COVERAGE: The mean coverage of all baits in the experiment.
+ 22. MEAN_TARGET_COVERAGE: The mean coverage of targets that recieved at least coverage depth = 2 at one base.
+ 23. PCT_USABLE_BASES_ON_BAIT: The number of aligned, de-duped, on-bait bases out of the PF bases available.
+ 24. PCT_USABLE_BASES_ON_TARGET: The number of aligned, de-duped, on-target bases out of the PF bases available.
+ 25. FOLD_ENRICHMENT: The fold by which the baited region has been amplified above genomic background.
+ 26. ZERO_CVG_TARGETS_PCT: The number of targets that did not reach coverage=2 over any base.
+ 27. FOLD_80_BASE_PENALTY: The fold over-coverage necessary to raise 80% of bases in "non-zero-cvg" targets to the mean coverage level in those targets.
+ 28. PCT_TARGET_BASES_2X: The percentage of ALL target bases acheiving 2X or greater coverage.
+ 29. PCT_TARGET_BASES_10X: The percentage of ALL target bases acheiving 10X or greater coverage.
+ 30. PCT_TARGET_BASES_20X: The percentage of ALL target bases acheiving 20X or greater coverage.
+ 31. PCT_TARGET_BASES_30X: The percentage of ALL target bases acheiving 30X or greater coverage.
+ 32. HS_LIBRARY_SIZE: The estimated number of unique molecules in the selected part of the library.
+ 33. HS_PENALTY_10X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 10X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 10 * HS_PENALTY_10X.
+ 34. HS_PENALTY_20X: The "hybrid selection penalty" incurred to get 80% of target bases to 20X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 20X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 20 * HS_PENALTY_20X.
+ 35. HS_PENALTY_30X: The "hybrid selection penalty" incurred to get 80% of target bases to 10X. This metric should be interpreted as: if I have a design with 10 megabases of target, and want to get 30X coverage I need to sequence until PF_ALIGNED_BASES = 10^6 * 30 * HS_PENALTY_30X.
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e rgPicardInsertSize.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rgPicardInsertSize.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,97 @@
+
+ for PAIRED data
+ picard
+
+ picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" --deviations "$deviations"
+ --histwidth "$histWidth" --minpct "$minPct" --malevel "$malevel"
+ -j "\$JAVA_JAR_PATH/CollectInsertSizeMetrics.jar" -d "$html_file.files_path" -t "$html_file"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Reads a SAM or BAM file and describes the distribution
+of insert size (excluding duplicates) with metrics and a histogram plot.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for CollectInsertSizeMetrics, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+.. class:: warningmark
+
+**Useful for paired data only**
+
+This tool works for paired data only and can be expected to fail for single end data.
+
+-----
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+Picard documentation says (reformatted for Galaxy):
+
+.. csv-table::
+ :header-rows: 1
+
+ Option,Description
+ "INPUT=File","SAM or BAM file Required."
+ "OUTPUT=File","File to write insert size metrics to Required."
+ "HISTOGRAM_FILE=File","File to write insert size histogram chart to Required."
+ "TAIL_LIMIT=Integer","When calculating mean and stdev stop when the bins in the tail of the distribution contain fewer than mode/TAIL_LIMIT items. This also limits how much data goes into each data category of the histogram."
+ "HISTOGRAM_WIDTH=Integer","Explicitly sets the histogram width, overriding the TAIL_LIMIT option. Also, when calculating mean and stdev, only bins LE HISTOGRAM_WIDTH will be included. "
+ "MINIMUM_PCT=Float","When generating the histogram, discard any data categories (out of FR, TANDEM, RF) that have fewer than this percentage of overall reads. (Range: 0 to 1) Default value: 0.01."
+ "STOP_AFTER=Integer","Stop after processing N reads, mainly for debugging. Default value: 0."
+ "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false."
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+
+
diff -r 000000000000 -r ff4ec13e496e rgPicardLibComplexity.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rgPicardLibComplexity.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,123 @@
+
+ picard
+
+ picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" --minid "$minIDbases"
+ --maxdiff "$maxDiff" --minmeanq "$minMeanQ" --readregex "$readRegex" --optdupdist "$optDupeDist"
+ -j "\$JAVA_JAR_PATH/EstimateLibraryComplexity.jar" -d "$html_file.files_path" -t "$html_file"
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Attempts to estimate library complexity from sequence alone.
+Does so by sorting all reads by the first N bases (5 by default) of each read and then
+comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be
+duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).
+
+Reads of poor quality are filtered out so as to provide a more accurate estimate.
+The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than
+MIN_MEAN_QUALITY across either the first or second read.
+
+The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the
+calculation of library size. Also, since there is no alignment to screen out technical reads one
+further filter is applied on the data. After examining all reads a histogram is built of
+[#reads in duplicate set -> #of duplicate sets]; all bins that contain exactly one duplicate set are
+then removed from the histogram as outliers before library size is estimated.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for EstimateLibraryComplexity, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+-----
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+Picard documentation says (reformatted for Galaxy):
+
+.. csv-table::
+ :header-rows: 1
+
+ Option Description
+ "INPUT=File","One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped. This option may be specified 0 or more times."
+ "OUTPUT=File","Output file to writes per-library metrics to. Required."
+ "MIN_IDENTICAL_BASES=Integer","The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU. Default value: 5."
+ "MAX_DIFF_RATE=Double","The maximum rate of differences between two reads to call them identical. Default value: 0.03. "
+ "MIN_MEAN_QUALITY=Integer","The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations. Default value: 20."
+ "READ_NAME_REGEX=String","Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to clear the default value."
+ "OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer","The maximum offset between two duplicte clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal. Default value: 100"
+ "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false. This option can be set to 'null' to clear the default value. "
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+
+.. class:: infomark
+
+**Note on the Regular Expression**
+
+(from the Picard docs)
+This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file.
+These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size.
+The regular expression should contain three capture groups for the three variables, in order.
+Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*.
+
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e rgPicardMarkDups.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/rgPicardMarkDups.xml Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,130 @@
+
+
+ picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" -o "$out_file"
+ --remdups "$remDups" --assumesorted "$assumeSorted" --readregex "$readRegex" --optdupdist "$optDupeDist"
+ -j "\$JAVA_JAR_PATH/MarkDuplicates.jar" -d "$html_file.files_path" -t "$html_file" -e "$input_file.ext"
+
+ picard
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+.. class:: infomark
+
+**Purpose**
+
+Marks all duplicate reads in a provided SAM or BAM file and either removes them or flags them.
+
+**Picard documentation**
+
+This is a Galaxy wrapper for MarkDuplicates, a part of the external package Picard-tools_.
+
+ .. _Picard-tools: http://www.google.com/search?q=picard+samtools
+
+-----
+
+.. class:: infomark
+
+**Inputs, outputs, and parameters**
+
+Picard documentation says (reformatted for Galaxy):
+
+.. csv-table:: Mark Duplicates docs
+ :header-rows: 1
+
+ Option,Description
+ "INPUT=File","The input SAM or BAM file to analyze. Must be coordinate sorted. Required."
+ "OUTPUT=File","The output file to right marked records to Required."
+ "METRICS_FILE=File","File to write duplication metrics to Required."
+ "REMOVE_DUPLICATES=Boolean","If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false."
+ "ASSUME_SORTED=Boolean","If true, assume that the input file is coordinate sorted, even if the header says otherwise. Default value: false."
+ "MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=Integer","This option is obsolete. ReadEnds will always be spilled to disk. Default value: 50000."
+ "MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=Integer","Maximum number of file handles to keep open when spilling read ends to disk."
+ "READ_NAME_REGEX=String","Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. "
+ "OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer","The maximum offset between two duplicte clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal. Default value: 100"
+
+.. class:: warningmark
+
+**Warning on SAM/BAM quality**
+
+Many SAM/BAM files produced externally and uploaded to Galaxy do not fully conform to SAM/BAM specifications. Galaxy deals with this by using the **LENIENT**
+flag when it runs Picard, which allows reads to be discarded if they're empty or don't map. This appears
+to be the only way to deal with SAM/BAM that cannot be parsed.
+.. class:: infomark
+
+**Note on the Regular Expression**
+
+(from the Picard docs)
+This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).
+
+Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules. All records are then written to the output file with the duplicate records flagged unless the remove duplicates option is selected. In some cases you may want to do this, but please only do this if you really understand what you are doing.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff -r 000000000000 -r ff4ec13e496e test-data/bfast_out1.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/bfast_out1.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,23 @@
+@HD VN:0.1.2 SO:unsorted GO:none
+@SQ SN:phiX174 LN:5386
+@PG ID:bfast VN:0.6.4d
+random_phiX_region_0 0 phiX174 553 255 50M * 0 0 TTGAGGCTTGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_1 0 phiX174 3693 255 50M * 0 0 GTTAGTGCTGAGGTTGACTTAGTTCATCAGCAAACGCAGAATCAGCGGTA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_2 0 phiX174 375 255 50M * 0 0 AATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_3 0 phiX174 3168 255 50M * 0 0 GGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_4 0 phiX174 5254 255 50M * 0 0 ACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGAC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_5 0 phiX174 5066 255 50M * 0 0 AGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACTTCCCA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_6 0 phiX174 1226 255 50M * 0 0 CACGTTTATGGTGAACAGTGGATTAAGTTCATGAAGGATGGTGTTAATGC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_7 0 phiX174 1096 255 50M * 0 0 AACTACTCCGGTTATCGCTGGCGACTCCTTCGAGATGGACGCCGTTGGCG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_8 0 phiX174 535 255 50M * 0 0 CTCGTGCTCGTCGCTGCGTTGAGGCTTGCGTTTATGGTACGCTGGACTTT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_9 0 phiX174 3669 255 50M * 0 0 CAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAGTTCA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_10 0 phiX174 4887 255 50M * 0 0 TACAGTATGCCCATCGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_11 0 phiX174 1849 255 50M * 0 0 TATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_12 0 phiX174 4145 255 50M * 0 0 AGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_13 0 phiX174 1853 255 50M * 0 0 TTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGACTAAAGAGATTCA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_14 0 phiX174 2800 255 50M * 0 0 CCGGGCAATAACGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGC ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2300 NM:i:1 NH:i:1 IH:i:1 HI:i:1 MD:Z:11T38 XA:i:0
+random_phiX_region_15 0 phiX174 1910 255 50M * 0 0 AACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_16 0 phiX174 3366 255 50M * 0 0 GCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_17 0 phiX174 2165 255 50M * 0 0 CATGATTATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAG ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_18 0 phiX174 2051 255 50M * 0 0 TGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
+random_phiX_region_19 0 phiX174 5099 255 50M * 0 0 GCTGTCGCTACTTCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ PG:Z:bfast AS:i:2500 NM:i:0 NH:i:1 IH:i:1 HI:i:1 MD:Z:50 XA:i:0
diff -r 000000000000 -r ff4ec13e496e test-data/bwa_wrapper_in2.fastqsanger
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/bwa_wrapper_in2.fastqsanger Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,120 @@
+@seq1/1
+GGACTCAGATAGTAATCC
++
+II#IIIIIII$5+.(9II
+@seq2/1
+ATTCGACCTATCCTTGCG
++
+IIIIIIIIIIIIIIIIII
+@seq3/1
+GTAACAAAGTTTGGATTG
++
+IIIIIIIIIIIIIIIIII
+@seq4/1
+AGCCGCTCGTCTTTTATG
++
+IIIIIIIIIIIIIIIIII
+@seq5/1
+CAGTTATATGGCTTTTGG
++
+IIIIIIIIIIIIIIIIII
+@seq6/1
+AGGCGCTCGTCTTGGTAT
++
+IIIIIIIIIIIIIIIIII
+@seq7/1
+TGTAGGTGGTCAACCAAT
++
+IIIIIIIIIIIIIIIIII
+@seq8/1
+ACACCCGTCCTTTACGTC
++
+IIIIIIIIIIIIIIIIII
+@seq9/1
+GCCGCTATTCAGGTTGTT
++
+IIIIIIIIIIIIIIIIII
+@seq10/1
+ATTCTTTCTTTTCGTATC
++
+IIIIIIIIIIIIIIIIII
+@seq11/1
+GCATTTCTACTCCTTCTC
++
+II#IIIIIII$5+.(9II
+@seq12/1
+CGCGCTTCGATAAAAATG
++
+IIIIIIIIIIIIIIIIII
+@seq13/1
+ATTTCTACTCTTTCTCAT
++
+IIIIIIIIIIIIIIIIII
+@seq14/1
+CCCTTTTGAATGTCACGC
++
+IIIIIIIIIIIIIIIIII
+@seq15/1
+CCAACTTACCAAGGTGGG
++
+IIIIIIIIIIIIIIIIII
+@seq16/1
+TCAGGGTATTAAAAGAGA
++
+IIIIIIIIIIIIIIIIII
+@seq17/1
+GTGATGTGCTTGCTACCG
++
+IIIIIIIIIIIIIIIIII
+@seq18/1
+TCAATCCCCCATGCTTGG
++
+IIIIIIIIIIIIIIIIII
+@seq19/1
+TTCCTGCGCTTAATGCTT
++
+IIIIIIIIIIIIIIIIII
+@seq20/1
+CTTATTACCATTTCAACT
++
+IIIIIIIIIIIIIIIIII
+@seq21/1
+CTGATACCAATAAAACCC
++
+II#IIIIIII$5+.(9II
+@seq22/1
+AATCAAACTTACCAAGGG
++
+IIIIIIIIIIIIIIIIII
+@seq23/1
+TGTGCTTCCCCAACTTGA
++
+IIIIIIIIIIIIIIIIII
+@seq24/1
+TTTCTCAATCCCCAATGC
++
+IIIIIIIIIIIIIIIIII
+@seq25/1
+TTGCTACTGACCGCTCTT
++
+IIIIIIIIIIIIIIIIII
+@seq26/1
+CCGCGTGAAATTTCTATG
++
+IIIIIIIIIIIIIIIIII
+@seq27/1
+CGCTAATCAAGTTGTTTC
++
+IIIIIIIIIIIIIIIIII
+@seq28/1
+AAAGAGATTATTTGTCGG
++
+IIIIIIIIIIIIIIIIII
+@seq29/1
+CAAATTAATGCGCGCTTC
++
+IIIIIIIIIIIIIIIIII
+@seq30/1
+ATCCCCTATGCTTGGCTT
++
+IIIIIIIIIIIIIIIIII
diff -r 000000000000 -r ff4ec13e496e test-data/bwa_wrapper_in3.fastqsanger
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/bwa_wrapper_in3.fastqsanger Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,120 @@
+@seq1/2
+ACGCTCCTTTAAAATATC
++
+IIIII$%*$G$A31I&&B
+@seq2/2
+CAGCTCGAGAAGCTCTTA
++
+IIIIIIIIIIIIIIIIII
+@seq3/2
+CTACTGACCGCTCTCGTG
++
+IIIIIIIIIIIIIIIIII
+@seq4/2
+TAGGTGGTCAACCATTTT
++
+IIIIIIIIIIIIIIIIII
+@seq5/2
+TTTCTATGTGGCTTAATA
++
+IIIIIIIIIIIIIIIIII
+@seq6/2
+GTAGGTGGTCAACAATTT
++
+IIIIIIIIIIIIIIIIII
+@seq7/2
+TTTAATTGCAGGGGCTTC
++
+IIIIIIIIIIIIIIIIII
+@seq8/2
+ATGCGCTCTATTCTCTGG
++
+IIIIIIIIIIIIIIIIII
+@seq9/2
+TTCTGTTGGTGCTGATAT
++
+IIIIIIIIIIIIIIIIII
+@seq10/2
+AGGGCGTTGAGTTCGATA
++
+IIIIIIIIIIIIIIIIII
+@seq11/2
+ATCCCCAATGCTTGGCTT
++
+IIIII$%*$G$A31I&&B
+@seq12/2
+GGATTGGCGTTTCCAACC
++
+IIIIIIIIIIIIIIIIII
+@seq13/2
+CCCCAATCCTTGCCTTCC
++
+IIIAAIIIIIIIIIIIII
+@seq14/2
+TGATATTTTGACTTTGAG
++
+IIIIIIIIIIIIIIIIII
+@seq15/2
+TTACGAAACGCGACGCCG
++
+IIIIIIIIIIIIIIIIII
+@seq16/2
+TTATTTTTCTCCAGCCAC
++
+IIIIIIIIIIIIIIIIII
+@seq17/2
+AAACAATACTTTAGGCAT
++
+IIIIIIIIIIIIIIIIII
+@seq18/2
+CCGTTCCATAAGCAGATG
++
+IIIIIIIIIIIIIIIIII
+@seq19/2
+GAGCGTCCTGGTGCTGAT
++
+IIIIIIIIIIIIIIIIII
+@seq20/2
+ACTCCGGTTATCGCTGGC
++
+IIIIIIIIIIIIIIIIII
+@seq21/2
+TAAGCATTTGGTTCAGGG
++
+IIIII$%*$G$A31I&&B
+@seq22/2
+GTTACGACGCGACGCCGT
++
+IIIIIIIIIIIIIIIIII
+@seq23/2
+TTTAATAACCCTATAGAC
++
+IIIIIIIIIIIIIIIIII
+@seq24/2
+CTTGGCTTCCCTAAGCAG
++
+IIIIIIIIIIIIIIIIII
+@seq25/2
+CGTGCTCGTTGCTGCGTT
++
+IIIIIIIIIIIIIIIIII
+@seq26/2
+AAGGATGTTTTCCGTTCT
++
+IIIIIIIIIIIIIIIIII
+@seq27/2
+TGTTTGGTGCTGATATTG
++
+IIIIIIIIIIIIIIIIII
+@seq28/2
+TCCAGCCACTAAAGTGAG
++
+IIIIIIIIIIIIIIIIII
+@seq29/2
+GATAATGATTGGGGTATC
++
+IIIIIIIIIIIIIIIIII
+@seq30/2
+ACCATAAGCAGATGGATA
++
+IIIIIIIIIIIIIIIIII
diff -r 000000000000 -r ff4ec13e496e test-data/bwa_wrapper_out3.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/bwa_wrapper_out3.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,63 @@
+@SQ SN:phiX174 LN:5386
+@RG ID:abcdefg LB:lib-mom-A PL:ILLUMINA SM:mom DS:descrip DT:2010-11-01 PI:400
+@PG ID:bwa PN:bwa VN:0.5.9-r16
+seq1 113 phiX174 340 37 18M = 322 -18 GGATTACTATCTGAGTCC II9(.+5$IIIIIII#II RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq1 177 phiX174 322 25 18M = 340 18 GATATTTTAAAGGAGCGT B&&I13A$G$*%$IIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:2C8A6
+seq2 65 phiX174 141 37 18M = 159 18 ATTCGACCTATCCTTGCG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq2 129 phiX174 159 37 18M = 141 -18 CAGCTCGAGAAGCTCTTA IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq3 65 phiX174 505 37 18M = 523 18 GTAACAAAGTTTGGATTG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq3 129 phiX174 523 37 18M = 505 -18 CTACTGACCGCTCTCGTG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq4 69 phiX174 945 0 * = 945 0 AGCCGCTCGTCTTTTATG IIIIIIIIIIIIIIIIII RG:Z:abcdefg
+seq4 137 phiX174 945 23 18M = 945 0 TAGGTGGTCAACCATTTT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:23 AM:i:0 X0:i:1 X1:i:1 XM:i:1 XO:i:0 XG:i:0 MD:Z:12A5 XA:Z:phiX174,+945,17M1S,2;
+seq5 65 phiX174 4985 37 18M = 5003 18 CAGTTATATGGCTTTTGG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:13G4
+seq5 129 phiX174 5003 37 18M = 4985 -18 TTTCTATGTGGCTTAATA IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:13A4
+seq6 65 phiX174 925 37 11M1D7M = 944 19 AGGCGCTCGTCTTGGTAT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 MD:Z:11^T7
+seq6 129 phiX174 944 37 18M = 925 -19 GTAGGTGGTCAACAATTT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq7 65 phiX174 943 25 18M = 960 17 TGTAGGTGGTCAACCAAT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:14A1T1 XA:Z:phiX174,+943,13M1I4M,2;
+seq7 129 phiX174 960 37 18M = 943 -17 TTTAATTGCAGGGGCTTC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq8 69 phiX174 1715 0 * = 1715 0 ACACCCGTCCTTTACGTC IIIIIIIIIIIIIIIIII RG:Z:abcdefg
+seq8 137 phiX174 1715 37 18M = 1715 0 ATGCGCTCTATTCTCTGG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:10A7
+seq9 65 phiX174 2596 37 18M = 2613 17 GCCGCTATTCAGGTTGTT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:7A10
+seq9 129 phiX174 2613 37 18M = 2596 -17 TTCTGTTGGTGCTGATAT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq10 65 phiX174 4149 25 18M = 4168 19 ATTCTTTCTTTTCGTATC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:5G11G0
+seq10 129 phiX174 4168 37 18M = 4149 -19 AGGGCGTTGAGTTCGATA IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq11 65 phiX174 4072 37 18M = 4091 19 GCATTTCTACTCCTTCTC II#IIIIIII$5+.(9II RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:12T5
+seq11 129 phiX174 4091 37 18M = 4072 -19 ATCCCCAATGCTTGGCTT IIIII$%*$G$A31I&&B RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq12 65 phiX174 5349 37 18M = 5365 16 CGCGCTTCGATAAAAATG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq12 129 phiX174 5365 25 18M = 5349 -16 GGATTGGCGTTTCCAACC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0T9A7
+seq13 65 phiX174 4074 37 18M = 4093 19 ATTTCTACTCTTTCTCAT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:17A0
+seq13 129 phiX174 4093 25 18M = 4074 -19 CCCCAATCCTTGCCTTCC IIIAAIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:7G4G5
+seq14 65 phiX174 3998 37 18M = 4016 18 CCCTTTTGAATGTCACGC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:5C12
+seq14 129 phiX174 4016 37 3M1D15M = 3998 -18 TGATATTTTGACTTTGAG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 MD:Z:3^T15
+seq15 65 phiX174 5198 37 18M = 5216 18 CCAACTTACCAAGGTGGG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:13C4
+seq15 129 phiX174 5216 37 5M2I11M = 5198 -18 TTACGAAACGCGACGCCG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:16
+seq16 65 phiX174 2880 37 10M1I7M = 2897 17 TCAGGGTATTAAAAGAGA IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:1 XG:i:1 MD:Z:5T11
+seq16 129 phiX174 2897 37 18M = 2880 -17 TTATTTTTCTCCAGCCAC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:6G11
+seq17 65 phiX174 3034 37 18M = 3053 19 GTGATGTGCTTGCTACCG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq17 129 phiX174 3053 25 18M = 3034 -19 AAACAATACTTTAGGCAT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0T9G7
+seq18 73 phiX174 4088 37 18M = 4088 0 TCAATCCCCCATGCTTGG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:9A8
+seq18 133 phiX174 4088 0 * = 4088 0 CCGTTCCATAAGCAGATG IIIIIIIIIIIIIIIIII RG:Z:abcdefg
+seq19 65 phiX174 3304 37 18M = 3324 20 TTCCTGCGCTTAATGCTT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:6A11
+seq19 129 phiX174 3324 37 18M = 3304 -20 GAGCGTCCTGGTGCTGAT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:6G11
+seq20 65 phiX174 1082 37 18M = 1100 18 CTTATTACCATTTCAACT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq20 129 phiX174 1100 37 18M = 1082 -18 ACTCCGGTTATCGCTGGC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq21 65 phiX174 1344 23 18M = 1363 19 CTGATACCAATAAAACCC II#IIIIIII$5+.(9II RG:Z:abcdefg XT:A:U NM:i:1 SM:i:23 AM:i:23 X0:i:1 X1:i:1 XM:i:1 XO:i:0 XG:i:0 MD:Z:15T2 XA:Z:phiX174,+1344,15M1D3M,2;
+seq21 129 phiX174 1363 37 18M = 1344 -19 TAAGCATTTGGTTCAGGG IIIII$%*$G$A31I&&B RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:23 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:10T7
+seq22 69 phiX174 5215 0 * = 5215 0 AATCAAACTTACCAAGGG IIIIIIIIIIIIIIIIII RG:Z:abcdefg
+seq22 137 phiX174 5215 37 18M = 5215 0 GTTACGACGCGACGCCGT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq23 65 phiX174 4289 37 18M = 4308 19 TGTGCTTCCCCAACTTGA IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:6C11
+seq23 129 phiX174 4308 25 18M = 4289 -19 TTTAATAACCCTATAGAC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:0A8A8
+seq24 65 phiX174 4084 37 18M = 4101 17 TTTCTCAATCCCCAATGC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq24 129 phiX174 4101 37 18M = 4084 -17 CTTGGCTTCCCTAAGCAG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:10A7
+seq25 65 phiX174 520 37 18M = 537 17 TTGCTACTGACCGCTCTT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:17C0
+seq25 129 phiX174 537 37 18M = 520 -17 CGTGCTCGTTGCTGCGTT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:9C8
+seq26 65 phiX174 1976 37 18M = 1994 18 CCGCGTGAAATTTCTATG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq26 129 phiX174 1994 37 18M = 1976 -18 AAGGATGTTTTCCGTTCT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:18
+seq27 65 phiX174 2598 37 18M = 2614 16 CGCTAATCAAGTTGTTTC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:9G8
+seq27 129 phiX174 2614 37 3M1D15M = 2598 -16 TGTTTGGTGCTGATATTG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:1 XG:i:1 MD:Z:1C1^G15
+seq28 65 phiX174 2890 25 18M = 2906 16 AAAGAGATTATTTGTCGG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:2 SM:i:25 AM:i:25 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:16T0C0
+seq28 129 phiX174 2906 37 18M = 2890 -16 TCCAGCCACTAAAGTGAG IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:25 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:10T7
+seq29 73 phiX174 5339 37 18M = 5339 0 CAAATTAATGCGCGCTTC IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:6T11
+seq29 133 phiX174 5339 0 * = 5339 0 GATAATGATTGGGGTATC IIIIIIIIIIIIIIIIII RG:Z:abcdefg
+seq30 65 phiX174 4091 37 18M = 4108 17 ATCCCCTATGCTTGGCTT IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:6A11
+seq30 129 phiX174 4108 37 18M = 4091 -17 ACCATAAGCAGATGGATA IIIIIIIIIIIIIIIIII RG:Z:abcdefg XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:0T17
diff -r 000000000000 -r ff4ec13e496e test-data/phiX.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/phiX.fasta Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,79 @@
+>phiX174
+GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
+GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA
+ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG
+TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA
+GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC
+TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT
+TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT
+CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT
+TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG
+TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC
+GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA
+CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCAGAAGGAG
+TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT
+AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC
+CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA
+TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC
+TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA
+CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA
+GGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTT
+GGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACA
+ACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGC
+TCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTT
+TCTCGCCAAATGACGACTTCTACCACATCTATTGACATTATGGGTCTGCAAGCTGCTTATGCTAATTTGC
+ATACTGACCAAGAACGTGATTACTTCATGCAGCGTTACCGTGATGTTATTTCTTCATTTGGAGGTAAAAC
+CTCTTATGACGCTGACAACCGTCCTTTACTTGTCATGCGCTCTAATCTCTGGGCATCTGGCTATGATGTT
+GATGGAACTGACCAAACGTCGTTAGGCCAGTTTTCTGGTCGTGTTCAACAGACCTATAAACATTCTGTGC
+CGCGTTTCTTTGTTCCTGAGCATGGCACTATGTTTACTCTTGCGCTTGTTCGTTTTCCGCCTACTGCGAC
+TAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGACTTATACCGATATTGCTGGCGACCCTGTTTTG
+TATGGCAACTTGCCGCCGCGTGAAATTTCTATGAAGGATGTTTTCCGTTCTGGTGATTCGTCTAAGAAGT
+TTAAGATTGCTGAGGGTCAGTGGTATCGTTATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTCTTGA
+AGGCTTCCCATTCATTCAGGAACCGCCTTCTGGTGATTTGCAAGAACGCGTACTTATTCGCCACCATGAT
+TATGACCAGTGTTTCCAGTCCGTTCAGTTGTTGCAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTTT
+ATCGCAATCTGCCGACCACTCGCGATTCAATCATGACTTCGTGATAAAAGATTGAGTGTGAGGTTATAAC
+GCCGAAGCGGTAAAAATTTTAATTTTTGCCGCTGAGGGGTTGACCAAGCGAAGCGCGGTAGGTTTTCTGC
+TTAGGAGTTTAATCATGTTTCAGACTTTTATTTCTCGCCATAATTCAAACTTTTTTTCTGATAAGCTGGT
+TCTCACTTCTGTTACTCCAGCTTCTTCGGCACCTGTTTTACAGACACCTAAAGCTACATCGTCAACGTTA
+TATTTTGATAGTTTGACGGTTAATGCTGGTAATGGTGGTTTTCTTCATTGCATTCAGATGGATACATCTG
+TCAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGC
+CTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTCCGACTACCCTCCCGACTGCCTATGATGTTTATCCTTTG
+AATGGTCGCCATGATGGTGGTTATTATACCGTCAAGGACTGTGTGACTATTGACGTCCTTCCCCGTACGC
+CGGGCAATAATGTTTATGTTGGTTTCATGGTTTGGTCTAACTTTACCGCTACTAAATGCCGCGGATTGGT
+TTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTG
+CTATTGCTGGCGGTATTGCTTCTGCTCTTGCTGGTGGCGCCATGTCTAAATTGTTTGGAGGCGGTCAAAA
+AGCCGCCTCCGGTGGCATTCAAGGTGATGTGCTTGCTACCGATAACAATACTGTAGGCATGGGTGATGCT
+GGTATTAAATCTGCCATTCAAGGCTCTAATGTTCCTAACCCTGATGAGGCCGCCCCTAGTTTTGTTTCTG
+GTGCTATGGCTAAAGCTGGTAAAGGACTTCTTGAAGGTACGTTGCAGGCTGGCACTTCTGCCGTTTCTGA
+TAAGTTGCTTGATTTGGTTGGACTTGGTGGCAAGTCTGCCGCTGATAAAGGAAAGGATACTCGTGATTAT
+CTTGCTGCTGCATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGCTGATGCTTCCTCTGCTGGTATGG
+TTGACGCCGGATTTGAGAATCAAAAAGAGCTTACTAAAATGCAACTGGACAATCAGAAAGAGATTGCCGA
+GATGCAAAATGAGACTCAAAAAGAGATTGCTGGCATTCAGTCGGCGACTTCACGCCAGAATACGAAAGAC
+CAGGTATATGCACAAAATGAGATGCTTGCTTATCAACAGAAGGAGTCTACTGCTCGCGTTGCGTCTATTA
+TGGAAAACACCAATCTTTCCAAGCAACAGCAGGTTTCCGAGATTATGCGCCAAATGCTTACTCAAGCTCA
+AACGGCTGGTCAGTATTTTACCAATGACCAAATCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGAC
+TTAGTTCATCAGCAAACGCAGAATCAGCGGTATGGCTCTTCTCATATTGGCGCTACTGCAAAGGATATTT
+CTAATGTCGTCACTGATGCTGCTTCTGGTGTGGTTGATATTTTTCATGGTATTGATAAAGCTGTTGCCGA
+TACTTGGAACAATTTCTGGAAAGACGGTAAAGCTGATGGTATTGGCTCTAATTTGTCTAGGAAATAACCG
+TCAGGATTGACACCCTCCCAATTGTATGTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATGGTTCGTT
+CTTATTACCCTTCTGAATGTCACGCTGATTATTTTGACTTTGAGCGTATCGAGGCTCTTAAACCTGCTAT
+TGAGGCTTGTGGCATTTCTACTCTTTCTCAATCCCCAATGCTTGGCTTCCATAAGCAGATGGATAACCGC
+ATCAAGCTCTTGGAAGAGATTCTGTCTTTTCGTATGCAGGGCGTTGAGTTCGATAATGGTGATATGTATG
+TTGACGGCCATAAGGCTGCTTCTGACGTTCGTGATGAGTTTGTATCTGTTACTGAGAAGTTAATGGATGA
+ATTGGCACAATGCTACAATGTGCTCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGG
+GACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAGTTTTGCCGCAAGCTGGCTGCTGAACGCC
+CTCTTAAGGATATTCGCGATGAGTATAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATT
+GCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTACTATTCAGCGTTTGATGAATGCAATGCGACAG
+GCTCATGCTGATGGTTGGTTTATCGTTTTTGACACTCTCACGTTGGCTGACGACCGATTAGAGGCGTTTT
+ATGATAATCCCAATGCTTTGCGTGACTATTTTCGTGATATTGGTCGTATGGTTCTTGCTGCCGAGGGTCG
+CAAGGCTAATGATTCACACGCCGACTGCTATCAGTATTTTTGTGTGCCTGAGTATGGTACAGCTAATGGC
+CGTCTTCATTTCCATGCGGTGCATTTTATGCGGACACTTCCTACAGGTAGCGTTGACCCTAATTTTGGTC
+GTCGGGTACGCAATCGCCGCCAGTTAAATAGCTTGCAAAATACGTGGCCTTATGGTTACAGTATGCCCAT
+CGCAGTTCGCTACACGCAGGACGCTTTTTCACGTTCTGGTTGGTTGTGGCCTGTTGATGCTAAAGGTGAG
+CCGCTTAAAGCTACCAGTTATATGGCTGTTGGTTTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATA
+TGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTCGCTACT
+TCCCAAGAAGCTGTTCAGAATCAGAATGAGCCGCAACTTCGGGATGAAAATGCTCACAATGACAAATCTG
+TCCACGGAGTGCTTAATCCAACTTACCAAGCTGGGTTACGACGCGACGCCGTTCAACCAGATATTGAAGC
+AGAACGCAAAAAGAGAGATGAGATTGAGGCTGGGAAAAGTTACTGTAGCCGACGTTTTGGCGGCGCAACC
+TGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCA
+
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_input1.bam
Binary file test-data/picard_ARRG_input1.bam has changed
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_input1.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_ARRG_input1.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,25 @@
+@HD VN:1.0 SO:coordinate
+@SQ SN:chr1 LN:10001
+@SQ SN:chr2 LN:100001
+@SQ SN:chr3 LN:10001
+@SQ SN:chr4 LN:1001
+@RG ID:rg1 SM:s1
+@RG ID:rg2 SM:s3
+bar:record:4 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg1
+bar:record:6 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg2
+bar:record:1 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg1
+bar:record:3 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg2
+bar:record:1 141 chr1 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg1
+bar:record:7 77 chr1 20 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg2
+bar:record:8 77 chr1 30 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg2
+bar:record:4 141 chr1 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg1
+bar:record:5 77 chr1 40 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg2
+bar:record:6 141 chr1 50 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg2
+bar:record:2 77 chr2 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg1
+bar:record:2 141 chr2 30 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg2
+bar:record:3 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg1
+bar:record:8 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg1
+bar:record:5 141 chr3 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg1
+bar:record:9 77 chr4 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:rg1
+bar:record:7 141 chr4 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg1
+bar:record:9 141 chr4 60 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:rg1
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_input2.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_ARRG_input2.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,23 @@
+@HD VN:1.0 SO:coordinate
+@SQ SN:chr1 LN:10001
+@SQ SN:chr2 LN:100001
+@SQ SN:chr3 LN:10001
+@SQ SN:chr4 LN:1001
+bar:record:4 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:6 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:1 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:3 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:1 141 chr1 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:7 77 chr1 20 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:8 77 chr1 30 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:4 141 chr1 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:5 77 chr1 40 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:6 141 chr1 50 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:2 77 chr2 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:2 141 chr2 30 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:3 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:8 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:5 141 chr3 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:9 77 chr4 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111
+bar:record:7 141 chr4 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
+bar:record:9 141 chr4 60 0 * * 0 0 CCCCCCCCCCCCC 2222222222222
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_output1.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_ARRG_output1.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,24 @@
+@HD VN:1.0 SO:coordinate
+@SQ SN:chr1 LN:10001
+@SQ SN:chr2 LN:100001
+@SQ SN:chr3 LN:10001
+@SQ SN:chr4 LN:1001
+@RG ID:one PL:illumina PU:peaewe LB:lib SM:sam1
+bar:record:4 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:6 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:1 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:3 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:1 141 chr1 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:7 77 chr1 20 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:8 77 chr1 30 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:4 141 chr1 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:5 77 chr1 40 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:6 141 chr1 50 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:2 77 chr2 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:2 141 chr2 30 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:3 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:8 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:5 141 chr3 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:9 77 chr4 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:one
+bar:record:7 141 chr4 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
+bar:record:9 141 chr4 60 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:one
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_output2.bam
Binary file test-data/picard_ARRG_output2.bam has changed
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_output2.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_ARRG_output2.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,24 @@
+@HD VN:1.0 SO:coordinate
+@SQ SN:chr1 LN:10001
+@SQ SN:chr2 LN:100001
+@SQ SN:chr3 LN:10001
+@SQ SN:chr4 LN:1001
+@RG ID:M5 PL:IL PU:PLAT LB:LIB DS:description with spaces SM:smp CN:FamousCenter
+bar:record:4 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:6 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:1 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:3 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:1 141 chr1 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:7 77 chr1 20 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:8 77 chr1 30 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:4 141 chr1 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:5 77 chr1 40 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:6 141 chr1 50 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:2 77 chr2 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:2 141 chr2 30 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:3 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:8 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:5 141 chr3 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:9 77 chr4 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M5
+bar:record:7 141 chr4 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
+bar:record:9 141 chr4 60 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M5
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_output3.bam
Binary file test-data/picard_ARRG_output3.bam has changed
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_output3.bam.bai
Binary file test-data/picard_ARRG_output3.bam.bai has changed
diff -r 000000000000 -r ff4ec13e496e test-data/picard_ARRG_output3.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_ARRG_output3.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,24 @@
+@HD VN:1.0 SO:coordinate
+@SQ SN:chr1 LN:10001
+@SQ SN:chr2 LN:100001
+@SQ SN:chr3 LN:10001
+@SQ SN:chr4 LN:1001
+@RG ID:M6 PL:IL PU:PLAT LB:LIB SM:smp1
+bar:record:4 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:6 77 chr1 1 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:1 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:3 77 chr1 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:1 141 chr1 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:7 77 chr1 20 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:8 77 chr1 30 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:4 141 chr1 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:5 77 chr1 40 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:6 141 chr1 50 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:2 77 chr2 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:2 141 chr2 30 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:3 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:8 141 chr3 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:5 141 chr3 40 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:9 77 chr4 10 0 * * 0 0 AAAAAAAAAAAAA 1111111111111 RG:Z:M6
+bar:record:7 141 chr4 20 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
+bar:record:9 141 chr4 60 0 * * 0 0 CCCCCCCCCCCCC 2222222222222 RG:Z:M6
diff -r 000000000000 -r ff4ec13e496e test-data/picard_BIS_input1.bam
Binary file test-data/picard_BIS_input1.bam has changed
diff -r 000000000000 -r ff4ec13e496e test-data/picard_BIS_input1.sam
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_BIS_input1.sam Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,18 @@
+@HD VN:1.0 SO:coordinate
+@SQ SN:chr1 LN:101
+@SQ SN:chr7 LN:404
+@SQ SN:chr8 LN:202
+@SQ SN:chr10 LN:303
+@SQ SN:chr14 LN:505
+@RG ID:0 SM:Hi,Mom!
+@RG ID:1 SM:samplesample DS:ClearDescription
+@PG ID:1 PN:Hey! VN:2.0
+@CO Just a generic comment to make the header longer
+both_reads_align_clip_marked 83 chr7 1 255 101M = 302 201 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))&&'&*/)-&*-)&.-)&)&),/-&&..)./.,.).*&&,&.&&-)&&&0*&&&&&&&&/32/,01460&&/6/*0*/2/283//36868/& RG:Z:0
+both_reads_present_only_first_aligns 89 chr7 1 255 101M * 0 0 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))&&'&*/)-&*-)&.-)&)&),/-&&..)./.,.).*&&,&.&&-)&&&0*&&&&&&&&/32/,01460&&/6/*0*/2/283//36868/& RG:Z:0
+read_2_too_many_gaps 83 chr7 1 255 101M = 302 201 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))&&'&*/)-&*-)&.-)&)&),/-&&..)./.,.).*&&,&.&&-)&&&0*&&&&&&&&/32/,01460&&/6/*0*/2/283//36868/& RG:Z:0
+both_reads_align_clip_adapter 147 chr7 16 255 101M = 21 -96 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))&&'&*/)-&*-)&.-)&)&),/-&&..)./.,.).*&&,&.&&-)&&&0*&&&&&&&&/32/,01460&&/6/*0*/2/283//36868/& RG:Z:0
+both_reads_align_clip_adapter 99 chr7 21 255 101M = 16 96 CAACAGAAGCNGGNATCTGTGTTTGTGTTTCGGATTTCCTGCTGAANNGNTTNTCGNNTCNNNNNNNNATCCCGATTTCNTTCCGCAGCTNACCTCCCAAN )'.*.+2,))&&'&*/)-&*-)&.-)&)&),/-&&..)./.,.).*&&,&.&&-)&&&0*&&&&&&&&/32/,01460&&/6/*0*/2/283//36868/& RG:Z:0
+both_reads_align_clip_marked 163 chr7 302 255 101M = 1 -201 NCGCGGCATCNCGATTTCTTTCCGCAGCTAACCTCCCGACAGATCGGCAGCGCGTCGTGTAGGTTATTATGGTACATCTTGTCGTGCGGCNAGAGCATACA &/15445666651/566666553+2/14/&/555512+3/)-'/-&-'*+))*''13+3)'//++''/'))/3+&*5++)&'2+&+/*&-&&*)&-./1'1 RG:Z:0
+read_2_too_many_gaps 163 chr7 302 255 10M1D10M5I76M = 1 -201 NCGCGGCATCNCGATTTCTTTCCGCAGCTAACCTCCCGACAGATCGGCAGCGCGTCGTGTAGGTTATTATGGTACATCTTGTCGTGCGGCNAGAGCATACA &/15445666651/566666553+2/14/&/555512+3/)-'/-&-'*+))*''13+3)'//++''/'))/3+&*5++)&'2+&+/*&-&&*)&-./1'1 RG:Z:0
+both_reads_present_only_first_aligns 165 * 0 0 * chr7 1 0 NCGCGGCATCNCGATTTCTTTCCGCAGCTAACCTCCCGACAGATCGGCAGCGCGTCGTGTAGGTTATTATGGTACATCTTGTCGTGCGGCNAGAGCATACA &/15445666651/566666553+2/14/&/555512+3/)-'/-&-'*+))*''13+3)'//++''/'))/3+&*5++)&'2+&+/*&-&&*)&-./1'1 RG:Z:0
diff -r 000000000000 -r ff4ec13e496e test-data/picard_BIS_output1.txt
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_BIS_output1.txt Tue Oct 23 10:49:35 2012 -0400
@@ -0,0 +1,39 @@
+
+
+
+
+
+
+
+
+
+
+
+Galaxy tool BamIndexStats run at 12/05/2011 14:18:06 The following output files were created (click the filename to view/download a copy):
+Galaxy tool CollectAlignmentSummaryMetrics run at 11/11/2011 08:07:27 The following output files were created (click the filename to view/download a copy):
+Galaxy tool CollectAlignmentSummaryMetrics run at 11/11/2011 08:07:10 The following output files were created (click the filename to view/download a copy):
+ Log of activity
+
+ ## executing java -Xmx2g -jar /share/shared/relul.galaxy/tool-data/shared/jars/MarkDuplicates.jar I= /share/shared/relul.galaxy/database/files/000/dataset_57.dat O= /share/shared/relul.galaxy/database/files/000/dataset_99.dat M= rgPicardMarkDupsMetrics.txt READ_NAME_REGEX="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 returned status 1 and log (stdout/stderr) records:
+
+[Fri Nov 19 18:25:23 EST 2010] net.sf.picard.sam.MarkDuplicates INPUT=/share/shared/relul.galaxy/database/files/000/dataset_57.dat OUTPUT=/share/shared/relul.galaxy/database/files/000/dataset_99.dat METRICS_FILE=rgPicardMarkDupsMetrics.txt READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 TMP_DIR=/tmp/relul VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
+
+INFO 2010-11-19 18:25:23 MarkDuplicates Start of doWork freeMemory: 8645600; totalMemory: 9109504; maxMemory: 1908932608
+
+INFO 2010-11-19 18:25:23 MarkDuplicates Reading input file and constructing read end information.
+
+INFO 2010-11-19 18:25:23 MarkDuplicates Will retain up to 7575129 data points before spilling to disk.
+
+[Fri Nov 19 18:25:23 EST 2010] net.sf.picard.sam.MarkDuplicates done.
+
+Runtime.totalMemory()=130351104
+
+Exception in thread "main" net.sf.picard.PicardException: /share/shared/relul.galaxy/database/files/000/dataset_57.dat is not coordinate sorted.
+
+ at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:248)
+
+ at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:109)
+
+ at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:165)
+
+ at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:93)
+
+
+
+
+
+
+
+
Note: The freely available
+ Picard software
+ generated all outputs reported here. These third party tools were orchestrated by the Galaxy
+ rgPicardMarkDups.py wrapper and this command line from the Galaxy form:
+
+Galaxy Rgenetics tool output rgPicardValidate.py run at 19/04/2011 11:19:17 Running this Galaxy tool produced the following output files (click the filename to view/download a copy).
['WARNING: Record 1, Read name both_reads_align_clip_marked, NM tag (nucleotide differences) is missing\n']
+
['WARNING: Record 2, Read name both_reads_present_only_first_aligns, NM tag (nucleotide differences) is missing\n']
+
['WARNING: Record 3, Read name read_2_too_many_gaps, NM tag (nucleotide differences) is missing\n']
+
['ERROR: Record 4, Read name both_reads_align_clip_adapter, The record is out of [queryname] order, prior read name [read_2_too_many_gaps], prior coodinates [1:1]\n']
+
['WARNING: Record 4, Read name both_reads_align_clip_adapter, NM tag (nucleotide differences) is missing\n']
+
['WARNING: Record 5, Read name both_reads_align_clip_adapter, NM tag (nucleotide differences) is missing\n']
+
['WARNING: Record 6, Read name both_reads_align_clip_marked, NM tag (nucleotide differences) is missing\n']
+
['WARNING: Record 7, Read name read_2_too_many_gaps, NM tag (nucleotide differences) is missing\n']
+
['ERROR: Record 8, Read name both_reads_present_only_first_aligns, The record is out of [queryname] order, prior read name [read_2_too_many_gaps], prior coodinates [1:302]\n']
+
+Picard log
+
## executing samtools sort /udd/rerla/rgalaxy/database/job_working_directory/98/dataset_100_files/tmpELItj4rgSortBamTemp.bam /udd/rerla/rgalaxy/database/job_working_directory/98/dataset_100_files/rgcleansam.sorted returned status 0. Nothing appeared on stderr/stdout
+
+rectory/98/dataset_100_files/rgPicardValidate.out IGNORE=INVALID_TAG_NM MAX_OUTPUT=100 TMP_DIR=/tmp returned status 1 and log (stdout/stderr) records:
+[Tue Apr 19 11:19:17 EDT 2011] net.sf.picard.sam.ValidateSamFile INPUT=/udd/rerla/rgalaxy/database/job_working_directory/98/dataset_100_files/rgcleansam.sorted.bam OUTPUT=/udd/rerla/rgalaxy/database/job_working_directory/98/dataset_100_files/rgPicardValidate.out IGNORE=[INVALID_TAG_NM] MAX_OUTPUT=100 REFERENCE_SEQUENCE=/share/shared/data/hg18/hg18.fasta TMP_DIR=/tmp MODE=VERBOSE IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
+[Tue Apr 19 11:19:17 EDT 2011] net.sf.picard.sam.ValidateSamFile done.
+Runtime.totalMemory()=9109504
+
+
+
The freely available Picard software
+generated all outputs reported here, using this command line:
+