comparison README.md @ 0:8bbf903bf0cb draft

Uploaded
author devteam
date Wed, 22 Apr 2015 13:04:06 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:8bbf903bf0cb
1 Introduction
2 ============
3
4 [Kraken] is a taxonomic sequence classifier that assigns taxonomic
5 labels to short DNA reads. It does this by examining the $k$-mers
6 within a read and querying a database with those $k$-mers. This database
7 contains a mapping of every $k$-mer in [Kraken]'s genomic library to the
8 lowest common ancestor (LCA) in a taxonomic tree of all genomes that
9 contain that $k$-mer. The set of LCA taxa that correspond to the $k$-mers
10 in a read are then analyzed to create a single taxonomic label for the
11 read; this label can be any of the nodes in the taxonomic tree.
12 [Kraken] is designed to be rapid, sensitive, and highly precise. Our
13 tests on various real and simulated data have shown [Kraken] to have
14 sensitivity slightly lower than Megablast with precision being slightly
15 higher. On a set of simulated 100 bp reads, [Kraken] processed over 1.3
16 million reads per minute on a single core in normal operation, and over
17 4.1 million reads per minute in quick operation.
18
19 The latest released version of Kraken will be available at the
20 [Kraken website], and the latest updates to the Kraken source code
21 are available at the [Kraken GitHub repository].
22
23 If you use [Kraken] in your research, please cite the [Kraken paper].
24 Thank you!
25
26 [Kraken]: http://ccb.jhu.edu/software/kraken/
27 [Kraken website]: http://ccb.jhu.edu/software/kraken/
28 [Kraken paper]: http://genomebiology.com/2014/15/3/R46
29 [Kraken GitHub repository]: https://github.com/DerrickWood/kraken
30
31
32 System Requirements
33 ===================
34
35 Note: Users concerned about the disk or memory requirements should
36 read the paragraph about MiniKraken, below.
37
38 * **Disk space**: Construction of Kraken's standard database will require at
39 least 160 GB of disk space. Customized databases may require
40 more or less space. Disk space used is linearly proportional
41 to the number of distinct $k$-mers; as of Feb. 2015, Kraken's
42 default database contains just under 6 billion (6e9) distinct $k$-mers.
43
44 In addition, the disk used to store the database should be
45 locally-attached storage. Storing the database on a network
46 filesystem (NFS) partition can cause Kraken's operation to be
47 very slow, or to be stopped completely. As NFS accesses are
48 much slower than local disk accesses, both preloading and database
49 building will be slowed by use of NFS.
50
51 * **Memory**: To run efficiently, Kraken requires enough free memory to
52 hold the database in RAM. While this can be accomplished using a
53 ramdisk, Kraken supplies a utility for loading the database into
54 RAM via the OS cache. The default database size is 75 GB (as of
55 Feb. 2015), and so you will need at least that much RAM if you want
56 to build or run with the default database.
57
58 * **Dependencies**: Kraken currently makes extensive use of Linux utilities
59 such as sed, find, and wget. Many scripts are written using the
60 Bash shell, and the main scripts are written using Perl. Core
61 programs needed to build the database and run the classifier are
62 written in C++, and need to be compiled using g++. Multithreading
63 is handled using OpenMP. Downloads of NCBI data are performed by
64 wget and in some cases, by rsync. Most Linux systems that have any
65 sort of development package installed will have all of the above
66 listed programs and libraries available.
67
68 Finally, if you want to build your own database, you will need to
69 install the [Jellyfish] $k$-mer counter. Note that Kraken only
70 supports use of Jellyfish version 1. Jellyfish version 2 is not
71 yet compatible with Kraken.
72
73 * **Network connectivity**: Kraken's standard database build and download
74 commands expect unfettered FTP and rsync access to the NCBI FTP
75 server. If you're working behind a proxy, you may need to set
76 certain environment variables (such as `ftp_proxy` or `RSYNC_PROXY`)
77 in order to get these commands to work properly.
78
79 * **MiniKraken**: To allow users with low-memory computing environments to
80 use Kraken, we supply a reduced standard database that can be
81 downloaded from the Kraken web site. When Kraken is run with a
82 reduced database, we call it MiniKraken.
83
84 The database we make available is only 4 GB in size, and should
85 run well on computers with as little as 8 GB of RAM. Disk space
86 required for this database is also only 4 GB.
87
88 [Jellyfish]: http://www.cbcb.umd.edu/software/jellyfish/