Mercurial > repos > devteam > kraken

--- a/README.md	Tue May 19 16:41:06 2015 -0400
+++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
@@ -1,88 +0,0 @@
-Introduction
-============
-
-[Kraken] is a taxonomic sequence classifier that assigns taxonomic
-labels to short DNA reads. It does this by examining the $k$-mers
-within a read and querying a database with those $k$-mers. This database
-contains a mapping of every $k$-mer in [Kraken]'s genomic library to the
-lowest common ancestor (LCA) in a taxonomic tree of all genomes that
-contain that $k$-mer. The set of LCA taxa that correspond to the $k$-mers
-in a read are then analyzed to create a single taxonomic label for the
-read; this label can be any of the nodes in the taxonomic tree.
-[Kraken] is designed to be rapid, sensitive, and highly precise. Our
-tests on various real and simulated data have shown [Kraken] to have
-sensitivity slightly lower than Megablast with precision being slightly
-higher. On a set of simulated 100 bp reads, [Kraken] processed over 1.3
-million reads per minute on a single core in normal operation, and over
-4.1 million reads per minute in quick operation.
-
-The latest released version of Kraken will be available at the
-[Kraken website], and the latest updates to the Kraken source code
-are available at the [Kraken GitHub repository].
-
-If you use [Kraken] in your research, please cite the [Kraken paper].
-Thank you!
-
-[Kraken]:                     http://ccb.jhu.edu/software/kraken/
-[Kraken website]:             http://ccb.jhu.edu/software/kraken/
-[Kraken paper]:               http://genomebiology.com/2014/15/3/R46
-[Kraken GitHub repository]:   https://github.com/DerrickWood/kraken
-
-
-System Requirements
-===================
-
-Note: Users concerned about the disk or memory requirements should
-read the paragraph about MiniKraken, below.
-
-* **Disk space**: Construction of Kraken's standard database will require at
-    least 160 GB of disk space. Customized databases may require
-    more or less space.  Disk space used is linearly proportional
-    to the number of distinct $k$-mers; as of Feb. 2015, Kraken's
-    default database contains just under 6 billion (6e9) distinct $k$-mers.
-
-    In addition, the disk used to store the database should be
-    locally-attached storage. Storing the database on a network
-    filesystem (NFS) partition can cause Kraken's operation to be
-    very slow, or to be stopped completely. As NFS accesses are
-    much slower than local disk accesses, both preloading and database
-    building will be slowed by use of NFS.
-
-* **Memory**: To run efficiently, Kraken requires enough free memory to
-    hold the database in RAM. While this can be accomplished using a
-    ramdisk, Kraken supplies a utility for loading the database into
-    RAM via the OS cache. The default database size is 75 GB (as of
-    Feb. 2015), and so you will need at least that much RAM if you want
-    to build or run with the default database.
-
-* **Dependencies**: Kraken currently makes extensive use of Linux utilities
-    such as sed, find, and wget. Many scripts are written using the
-    Bash shell, and the main scripts are written using Perl. Core
-    programs needed to build the database and run the classifier are
-    written in C++, and need to be compiled using g++.  Multithreading
-    is handled using OpenMP.  Downloads of NCBI data are performed by
-    wget and in some cases, by rsync.  Most Linux systems that have any
-    sort of development package installed will have all of the above
-    listed programs and libraries available.
-
-    Finally, if you want to build your own database, you will need to
-    install the [Jellyfish] $k$-mer counter.  Note that Kraken only
-    supports use of Jellyfish version 1.  Jellyfish version 2 is not
-    yet compatible with Kraken.
-
-* **Network connectivity**: Kraken's standard database build and download
-    commands expect unfettered FTP and rsync access to the NCBI FTP
-    server. If you're working behind a proxy, you may need to set
-    certain environment variables (such as `ftp_proxy` or `RSYNC_PROXY`)
-    in order to get these commands to work properly.
-
-* **MiniKraken**: To allow users with low-memory computing environments to
-    use Kraken, we supply a reduced standard database that can be
-    downloaded from the Kraken web site. When Kraken is run with a
-    reduced database, we call it MiniKraken.
-
-    The database we make available is only 4 GB in size, and should
-    run well on computers with as little as 8 GB of RAM. Disk space
-    required for this database is also only 4 GB.
-
-[Jellyfish]:  http://www.cbcb.umd.edu/software/jellyfish/
\ No newline at end of file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/README.rst	Wed Jul 15 14:59:31 2015 -0400
@@ -0,0 +1,92 @@
+Introduction
+============
+
+`Kraken <http://ccb.jhu.edu/software/kraken/>`__ is a taxonomic sequence
+classifier that assigns taxonomic labels to short DNA reads. It does
+this by examining the :math:`k`-mers within a read and querying a
+database with those :math:`k`-mers. This database contains a mapping of
+every :math:`k`-mer in
+`Kraken <http://ccb.jhu.edu/software/kraken/>`__'s genomic library to
+the lowest common ancestor (LCA) in a taxonomic tree of all genomes that
+contain that :math:`k`-mer. The set of LCA taxa that correspond to the
+:math:`k`-mers in a read are then analyzed to create a single taxonomic
+label for the read; this label can be any of the nodes in the taxonomic
+tree. `Kraken <http://ccb.jhu.edu/software/kraken/>`__ is designed to be
+rapid, sensitive, and highly precise. Our tests on various real and
+simulated data have shown
+`Kraken <http://ccb.jhu.edu/software/kraken/>`__ to have sensitivity
+slightly lower than Megablast with precision being slightly higher. On a
+set of simulated 100 bp reads,
+`Kraken <http://ccb.jhu.edu/software/kraken/>`__ processed over 1.3
+million reads per minute on a single core in normal operation, and over
+4.1 million reads per minute in quick operation.
+
+The latest released version of Kraken will be available at the `Kraken
+website <http://ccb.jhu.edu/software/kraken/>`__, and the latest updates
+to the Kraken source code are available at the `Kraken GitHub
+repository <https://github.com/DerrickWood/kraken>`__.
+
+If you use `Kraken <http://ccb.jhu.edu/software/kraken/>`__ in your
+research, please cite the `Kraken
+paper <http://genomebiology.com/2014/15/3/R46>`__. Thank you!
+
+System Requirements
+===================
+
+Note: Users concerned about the disk or memory requirements should read
+the paragraph about MiniKraken, below.
+
+-  **Disk space**: Construction of Kraken's standard database will
+   require at least 160 GB of disk space. Customized databases may
+   require more or less space. Disk space used is linearly proportional
+   to the number of distinct :math:`k`-mers; as of Feb. 2015, Kraken's
+   default database contains just under 6 billion (6e9) distinct
+   :math:`k`-mers.
+
+   In addition, the disk used to store the database should be
+   locally-attached storage. Storing the database on a network
+   filesystem (NFS) partition can cause Kraken's operation to be very
+   slow, or to be stopped completely. As NFS accesses are much slower
+   than local disk accesses, both preloading and database building will
+   be slowed by use of NFS.
+
+-  **Memory**: To run efficiently, Kraken requires enough free memory to
+   hold the database in RAM. While this can be accomplished using a
+   ramdisk, Kraken supplies a utility for loading the database into RAM
+   via the OS cache. The default database size is 75 GB (as of Feb.
+   2015), and so you will need at least that much RAM if you want to
+   build or run with the default database.
+
+-  **Dependencies**: Kraken currently makes extensive use of Linux
+   utilities such as sed, find, and wget. Many scripts are written using
+   the Bash shell, and the main scripts are written using Perl. Core
+   programs needed to build the database and run the classifier are
+   written in C++, and need to be compiled using g++. Multithreading is
+   handled using OpenMP. Downloads of NCBI data are performed by wget
+   and in some cases, by rsync. Most Linux systems that have any sort of
+   development package installed will have all of the above listed
+   programs and libraries available.
+
+   Finally, if you want to build your own database, you will need to
+   install the
+   `Jellyfish <http://www.cbcb.umd.edu/software/jellyfish/>`__
+   :math:`k`-mer counter. Note that Kraken only supports use of
+   Jellyfish version 1. Jellyfish version 2 is not yet compatible with
+   Kraken.
+
+-  **Network connectivity**: Kraken's standard database build and
+   download commands expect unfettered FTP and rsync access to the NCBI
+   FTP server. If you're working behind a proxy, you may need to set
+   certain environment variables (such as ``ftp_proxy`` or
+   ``RSYNC_PROXY``) in order to get these commands to work properly.
+
+-  **MiniKraken**: To allow users with low-memory computing environments
+   to use Kraken, we supply a reduced standard database that can be
+   downloaded from the Kraken web site. When Kraken is run with a
+   reduced database, we call it MiniKraken.
+
+   The database we make available is only 4 GB in size, and should run
+   well on computers with as little as 8 GB of RAM. Disk space required
+   for this database is also only 4 GB.
+
+
--- a/kraken.xml	Tue May 19 16:41:06 2015 -0400
+++ b/kraken.xml	Wed Jul 15 14:59:31 2015 -0400
@@ -1,7 +1,7 @@
 <?xml version="1.0"?>
-<tool id="kraken" name="Kraken" version="1.0.0">
+<tool id="kraken" name="Kraken" version="1.1.0">
     <description>
-        assign taxonomic labels to short DNA reads
+        assign taxonomic labels to sequencing reads
     </description>
     <macros>
         <import>macros.xml</import>
@@ -10,17 +10,49 @@
         <![CDATA[
         @SET_DATABASE_PATH@ &&
         kraken --threads \${GALAXY_SLOTS:-1} @INPUT_DATABASE@
+
+        #if $input_sequences.is_of_type( 'fastq' ):
+            --fastq-input
+        #else:
+            --fasta-input
+        #end if
+
+        ${only_classified_output}
+
+        #if str( $quick_operation.quick ) == "yes":
+            --quick
+            --min-hits ${quick_operation.min_hits}
+
+        #end if
+
         "$input_sequences"
+
         #if $split_reads:
             --classified-out "${classified_out}" --unclassified-out "${unclassified_out}"
         #end if
-        --output "${output}" &&
-        kraken-translate --db ${kraken_database.fields.name} "${output}" > "${translated}"
+        --output "${output}"
+        ##kraken-translate --db ${kraken_database.fields.name} "${output}" > "${translated}"
         ]]>
     </command>
     <inputs>
-        <param format="fasta,fastq,fastqsanger" label="Input sequences" name="input_sequences" type="data" />
-        <param label="Output classified and unclassified reads" name="split_reads" type="boolean" />
+        <param format="fasta,fastq" label="Input sequences" name="input_sequences" type="data" help="FASTA or FASTQ datasets"/>
+        <param label="Output classified and unclassified reads?" name="split_reads" type="boolean" help="Sets --unclassified-out and --classified-out"/>
+
+        <conditional name="quick_operation">
+            <param name="quick" type="select" label="Enable quick operation?" help="--quick; Rather than searching all k-mers in a sequence, stop classification after a specified number of database hit">
+                <option value="yes">Yes</option>
+                <option selected="True" value="no">No</option>
+            </param>
+            <when value="yes">
+                <param name="min_hits" type="integer" value="1" label="Number of hits required for classification" help="--min-hits; min-hits will allow you to require multiple hits before declaring a sequence classified, which can be especially useful with custom databases when testing to see if sequences either do or do not belong to a particular genome; default=1"/>
+            </when>
+            <when value="no">
+                <!-- Do absolutely nothing -->
+            </when>
+        </conditional>
+
+        <param name="only_classified_output" type="boolean" checked="False" truevalue="--only-classified-output" falsevalue="" label="Print no Kraken output for unclassified sequences" help="--only-classified-output"/>
+
         <expand macro="input_database" />
     </inputs>
     <outputs>
@@ -30,57 +62,43 @@
         <data format="tabular" label="${tool.name} on ${on_string}: Unclassified reads" name="unclassified_out">
             <filter>(split_reads)</filter>
         </data>
-        <data format="tabular" label="${tool.name} on ${on_string}: Histogram" name="histogram">
-            <filter>(draw_histogram)</filter>
-        </data>
         <data format="tabular" label="${tool.name} on ${on_string}: Classification" name="output" />
-        <data format="tabular" label="${tool.name} on ${on_string}: Translated classification" name="translated" />
+        <!--<data format="tabular" label="${tool.name} on ${on_string}: Translated classification" name="translated" />-->
     </outputs>
     <help>
         <![CDATA[
-        **What it does**
-
-        Kraken is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It does this by examining the k-mers within a read and querying a database with those k-mers. This database contains a mapping of every k-mer in Kraken's genomic library to the lowest common ancestor (LCA) in a taxonomic tree of all genomes that contain that k-mer. The set of LCA taxa that correspond to the k-mers in a read are then analyzed to create a single taxonomic label for the read; this label can be any of the nodes in the taxonomic tree. Kraken is designed to be rapid, sensitive, and highly precise. Our tests on various real and simulated data have shown Kraken to have sensitivity slightly lower than Megablast with precision being slightly higher. On a set of simulated 100 bp reads, Kraken processed over 1.3 million reads per minute on a single core in normal operation, and over 4.1 million reads per minute in quick operation.
+**What it does**

-        **Usage**
-
-        Kraken classifies a set of sequences (reads) with the commands below:
-
-        kraken --db $DBNAME sequences.fa > sequences.kraken
+Kraken is a taxonomic sequence classifier that assigns taxonomic labels to short DNA reads. It does this by examining the k-mers within a read and querying a database with those k-mers. This database contains a mapping of every k-mer in Kraken's genomic library to the lowest common ancestor (LCA) in a taxonomic tree of all genomes that contain that k-mer. The set of LCA taxa that correspond to the k-mers in a read are then analyzed to create a single taxonomic label for the read; this label can be any of the nodes in the taxonomic tree. Kraken is designed to be rapid, sensitive, and highly precise.

-        or
-
-        kraken --db $DBNAME sequences.fq > sequences.kraken
-
+-----

-        -DBNAME is the name of the Kraken Database to be used.
+**Kraken options**

-        -sequences.fa or sequences.fq is the FASTA or FASTQ input file containing the desired sequences for classification.
-
-        -sequences.kraken is the generated output.
-
+The Galaxy version of Kraken implements the following options::

-
-        **Options**
-
-        The kraken program allows several different sequencing modifiers (parameters):
-
-        **Multithreading:** Use the --threads NUM switch to use multiple threads.
-
-        **Sequence filtering:** Classified or unclassified sequences can be sent to a file for later processing, using the --classified-out and --unclassified-out switches, respectively.
-
-
+
+  --fasta-input             Input is FASTA format
+  --fastq-input             Input is FASTQ format
+  --quick                   Quick operation (use first hit or hits)
+  --min-hits NUM            In quick op., number of hits req'd for classification
+                            NOTE: this is ignored if --quick is not specified
+  --unclassified-out        Print unclassified sequences to filename
+  --classified-out          Print classified sequences to filename

-        **Output Format**
+  --only-classified-output  Print no Kraken output for unclassified sequences
+
+------

-        Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:
+**Output Format**

-        1. "C"/"U": one letter code indicating that the sequence was either classified or unclassified.
-        2. The sequence ID, obtained from the FASTA/FASTQ header.
-        3. The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.
-        4. The length of the sequence in bp.
+Each sequence classified by Kraken results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are::

-        5. A space-delimited list indicating the LCA mapping of each k-mer in the sequence. For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
+    1. "C"/"U": one letter code indicating that the sequence was either classified or unclassified.
+    2. The sequence ID, obtained from the FASTA/FASTQ header.
+    3. The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.
+    4. The length of the sequence in bp.
+    5. A space-delimited list indicating the LCA mapping of each k-mer in the sequence. For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
             a) the first 13 k-mers mapped to taxonomy ID #562
             b) the next 4 k-mers mapped to taxonomy ID #561
             c) the next 31 k-mers contained an ambiguous nucleotide
@@ -92,4 +110,4 @@
     <expand macro="stdio" />
     <expand macro="version_command" />
     <expand macro="citations" />
-</tool>
\ No newline at end of file
+</tool>
--- a/tool_dependencies.xml	Tue May 19 16:41:06 2015 -0400
+++ b/tool_dependencies.xml	Wed Jul 15 14:59:31 2015 -0400
@@ -1,6 +1,6 @@
 <?xml version="1.0"?>
 <tool_dependency>
   <package name="kraken" version="0.10.5">
-      <repository changeset_revision="e79fee8f87fa" name="package_kraken_0_10_5" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu" />
+      <repository changeset_revision="3525db901c16" name="package_kraken_0_10_5" owner="iuc" toolshed="https://testtoolshed.g2.bx.psu.edu" />
     </package>
 </tool_dependency>