Mercurial > repos > fubar > htseq_bams_to_count_matrix

--- a/htseqsams2mx.xml	Wed Apr 29 12:06:59 2015 -0400
+++ b/htseqsams2mx.xml	Mon May 04 22:47:20 2015 -0400
@@ -1,19 +1,19 @@
 <tool id="htseqsams2mxlocal" name="SAM/BAM to count matrix" version="0.5">
   <description>using HTSeq code</description>
+  <requirements>
+      <requirement type="package" version="0.7.6">pysam</requirement>
+      <requirement type="package" version="1.2.1">matplotlib</requirement>
+      <requirement type="package" version="0.5.4p3">htseq</requirement>
+  </requirements>
   <stdio>
    <regex match=".*" source="both" level="warning" description="chatter from HTSeq:"/>
   </stdio>
-  <requirements>
-      <requirement type="package" version="0.7.6">pysam</requirement>
-      <requirement type="package" version="1.2.1">matplotlib</requirement>
-      <requirement type="package" version="0.5.4p3">htseq</requirement>
-  </requirements>
   <command interpreter="python">
     htseqsams2mx.py -g "$gfffile" -o "$outfile" -m "$model" --id_attribute "$id_attr" --feature_type "$feature_type"
-    --mapqMin $mapqMin
+    --mapqMin $mapqMin
     #for $s in $samfiles:
       #if $s.ext != 'data':
-        --samf "'${s}','${s.name}','${s.ext}','${s.metadata.bam_index}'"
+        --samf "'${s}','${s.name}','${s.ext}','${s.metadata.bam_index}'"
       #end if
     #end for
     #if $filter_extras:
@@ -22,8 +22,8 @@
   </command>
   <inputs>
     <param format="gtf" name="gfffile" type="data" label="Gene model (GFF) file to count reads over from your current history" size="100" />
-    <param name="mapqMin" label="Filter reads with mapq below than this value"
-    help="0 to count any mapping quality read. Otherwise only reads at or above specified mapq will be counted"
+    <param name="mapqMin" label="Filter reads with mapq below than this value"
+    help="0 to count any mapping quality read. Otherwise only reads at or above specified mapq will be counted"
     type="integer" value="5"/>
     <param name="title" label="Name for this job's output file" type="text" size="80" value="bams to DGE count matrix"/>
     <param name="stranded" value="false" type="boolean" label="Reads are stranded - use strand in counting" display="checkbox"
@@ -33,27 +33,27 @@
         <option value="union" selected="true">union</option>
         <option value="intersection-strict">intersection-strict</option>
         <option value="intersection-nonempty">intersection-nonempty</option>
-    </param>
+    </param>
     <param name="id_attr" type="select" label="GTF attribute to output as the name for each contig - see HTSeq docs"
         help="If in doubt, use gene name or if you need the id in your GTF, gene id">
         <option value="gene_name" selected="true">gene name</option>
         <option value="gene_id">gene id</option>
         <option value="transcript_id">transcript id</option>
         <option value="transcript_name">transcript name</option>
-    </param>
+    </param>
     <param name="feature_type" type="select" label="GTF feature type for counting reads over the supplied gene model- see HTSeq docs"
         help="GTF feature type to count over - exon is a good choice with gene name as the contig to count over">
         <option value="exon" selected="true">exon</option>
         <option value="CDS">CDS</option>
         <option value="UTR">UTR</option>
         <option value="transcript">transcript</option>
-    </param>
+    </param>
     <param name="filter_extras" type="select" label="Filter any read with one or more flags"
         help="eg the XS tag created by bowtie for multiple reads" optional="true" mutliple="true">
         <option value="">None</option>
         <option value="XS">XS:i > 0 - More than one mapping position Bowtie</option>
         <option value="XS:A">Might be useful for tophat</option>
-    </param>
+    </param>

     <param name="samfiles" type="data" label="bam/sam file from your history" format="sam,bam" size="100" multiple="true"/>
   </inputs>
@@ -80,36 +80,36 @@

 Counts reads in multiple sam/bam format mapped files and generates a matrix ideal for edgeR and other count based tools
 It uses HTSeq to count your sam reads over a gene model supplied as a GTF file
-The output is a tabular text (columnar - spreadsheet) file containing the
+The output is a tabular text (columnar - spreadsheet) file containing the
 count matrix for downstream processing. Each row contains the counts from each sample for each
-of the non-emtpy GTF input file contigs matching the GTF attribute choice above.
-You probably want to use gene level GTF output attribute and count reads that overlap
+of the non-emtpy GTF input file contigs matching the GTF attribute choice above.
+You probably want to use gene level GTF output attribute and count reads that overlap
 GTF exons for RNA-seq. Or you can count over exons by using transcript level output names or ids. Etc.

 ----

 **Author's plea on replicates**

-If you want to interpret the downstream p values in terms of rejecting or accepting the null hypothesis
+If you want to interpret the downstream p values in terms of rejecting or accepting the null hypothesis
 under random sampling with replacement from the universe of possible biological/experimental replicates from which your data was derived,
-which is what published p values are often assumed to do, then you need biological
-(or for cell culture material experimental) replicates.
+which is what published p values are often assumed to do, then you need biological
+(or for cell culture material experimental) replicates.

-Using technical or no replicates means the downstream p values are not interpretable the way most people would assume
+Using technical or no replicates means the downstream p values are not interpretable the way most people would assume
 they are - ie as the probability of obtaining a result as or more extreme as your experimental data
 in millions of experiments conducted using the same methods under the null hypothesis.

-There is no way around this and it is scientific fraud to ignore this issue and publish bogus p values derived from
-technical or no replicates without making the lack of biological or experimental error in the p value calculations
+There is no way around this and it is scientific fraud to ignore this issue and publish bogus p values derived from
+technical or no replicates without making the lack of biological or experimental error in the p value calculations
 clear to your readers so they can adjust their expectations. However, the buck stops here at higher level inference.
 If you have no replicates, you must not use this tool as the p values are uninterpretable. So there.

-See your stats 101 notes on the central limit theorem and test statistics for a refresher or talk to a
+See your stats 101 notes on the central limit theorem and test statistics for a refresher or talk to a
 statistician if this makes no sense please.

 **Attribution**

-This Galaxy tool relies on HTSeq_ from http://www-huber.embl.de/users/anders/HTSeq/doc/index.html
+This Galaxy tool relies on HTSeq_ from http://www-huber.embl.de/users/anders/HTSeq/doc/index.html
 for the tricky work of counting. That code includes the following attribution:

 ## Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
@@ -121,7 +121,7 @@

 Otherwise, all code and documentation comprising this tool including the requirement
 for more than one sample bam
-was written by Ross Lazarus and is
+was written by Ross Lazarus and is
 licensed to you under the LGPL_ like other rgenetics artefacts

 Sorry, I don't use readgroups so had no reason to code read groups. Contributions welcome. Send code
@@ -129,5 +129,6 @@
 .. _LGPL: http://www.gnu.org/copyleft/lesser.html
 .. _HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/index.html
   </help>
-
+  <citations>
+  </citations>
 </tool>