diff kraken-filter.xml @ 2:317726be0703 draft

planemo upload for repository https://github.com/galaxyproject/tools-devteam/blob/master/tool_collections/kraken/kraken_filter/ commit cb6ebb843c71dcfc73aa05cc616f8e3229170108-dirty
author devteam
date Wed, 15 Jul 2015 15:22:22 -0400
parents f093ba52debe
children 7fb926851f66
line wrap: on
line diff
--- a/kraken-filter.xml	Tue May 19 16:42:21 2015 -0400
+++ b/kraken-filter.xml	Wed Jul 15 15:22:22 2015 -0400
@@ -1,6 +1,6 @@
-<tool id="kraken-filter" name="Filter Kraken" version="1.0.0">
+<tool id="kraken-filter" name="Kraken-filter" version="1.1.0">
     <description>
-        by confidence score
+        filter classification by confidence score
     </description>
     <macros>
         <import>macros.xml</import>
@@ -12,8 +12,8 @@
         ]]>
     </command>
     <inputs>
-        <param format="tabular" label="Kraken classified output" name="input" type="data" />
-        <param label="Confidence threshold" max="1" min="0" name="threshold" type="float" value="0" />
+        <param format="tabular" label="Kraken output" name="input" type="data" help="Select taxonomy classification produced by kraken"/>
+        <param label="Confidence threshold" max="1" min="0" name="threshold" type="float" value="0" help="--threshold; A number between 0 and 1; default=0"/>
         <expand macro="input_database" />
     </inputs>
     <outputs>
@@ -22,18 +22,27 @@
     <help>
 <![CDATA[
 
-***Note that the database used must be the same as the one used to generate
-the output file, or the report script may encounter problems.***
+.. class:: warningmark
+
+**Note**: the database used must be the same as the one used in the original Kraken run
 
-A sequence label's score is a fraction C/Q, where C is the number of k-mers mapped to LCA values in the clade rooted at the label, and Q is the number of k-mers in the sequence that lack an ambiguous nucleotide (i.e., they were queried against the database). Consider the example of the LCA mappings in Kraken's output given earlier:
+-----
+
+**What it does**
+
+At present, we have not yet developed a confidence score with a solid probabilistic interpretation for Kraken. However, we have developed a simple scoring scheme that has yielded good results for us, and we've made that available in the kraken-filter script. The approach we use allows a user to specify a threshold score in the [0,1] interval; the ``kraken-filter`` script then will adjust labels up the tree until the label's score (described below) meets or exceeds that threshold. If a label at the root of the taxonomic tree would not have a score exceeding the threshold, the sequence is called unclassified by ``kraken-filter``.
 
-"562:13 561:4 A:31 0:1 562:3" would indicate that:
+A sequence label's score is a fraction C/Q, where C is the number of k-mers mapped to LCA values in the clade rooted at the label, and Q is the number of k-mers in the sequence that lack an ambiguous nucleotide (i.e., they were queried against the database). Consider the example of the LCA mappings in Kraken's output::
+
+ 562:13 561:4 A:31 0:1 562:3 
+
+would indicate that::
 
-    the first 13 k-mers mapped to taxonomy ID #562
-    the next 4 k-mers mapped to taxonomy ID #561
-    the next 31 k-mers contained an ambiguous nucleotide
-    the next k-mer was not in the database
-    the last 3 k-mers mapped to taxonomy ID #562
+ the first 13 k-mers mapped to taxonomy ID #562
+ the next 4 k-mers mapped to taxonomy ID #561
+ the next 31 k-mers contained an ambiguous nucleotide
+ the next k-mer was not in the database
+ the last 3 k-mers mapped to taxonomy ID #562
 
 In this case, ID #561 is the parent node of #562. Here, a label of #562 for this sequence would have a score of C/Q = (13+3)/(13+4+1+3) = 16/21. A label of #561 would have a score of C/Q = (13+4+3)/(13+4+1+3) = 20/21. If a user specified a threshold over 16/21, kraken-filter would adjust the original label from #562 to #561; if the threshold was greater than 20/21, the sequence would become unclassified.
     ]]>