annotate README.md @ 1:b0276d1141fe default tip

Fix test case
author Jim Johnson <jj@umn.edu>
date Thu, 30 Jan 2014 13:10:12 -0600
parents 08439b004404
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
08439b004404 Uploaded
jjohnson
parents:
diff changeset
1 # Scythe - A very simple adapter trimmer (version 0.981 BETA)
08439b004404 Uploaded
jjohnson
parents:
diff changeset
2
08439b004404 Uploaded
jjohnson
parents:
diff changeset
3 Scythe and all supporting documentation
08439b004404 Uploaded
jjohnson
parents:
diff changeset
4 Copyright (c) Vince Buffalo, 2011-2012
08439b004404 Uploaded
jjohnson
parents:
diff changeset
5
08439b004404 Uploaded
jjohnson
parents:
diff changeset
6 Contact: Vince Buffalo <vsbuffaloAAAAA@gmail.com> (with the poly-A tail removed)
08439b004404 Uploaded
jjohnson
parents:
diff changeset
7
08439b004404 Uploaded
jjohnson
parents:
diff changeset
8 If you wish to report a bug, please open an issue on Github
08439b004404 Uploaded
jjohnson
parents:
diff changeset
9 (http://github.com/vsbuffalo/scythe/issues) so that it can be
08439b004404 Uploaded
jjohnson
parents:
diff changeset
10 tracked. You can contact me as well, but please open an issue first.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
11
08439b004404 Uploaded
jjohnson
parents:
diff changeset
12 ## About
08439b004404 Uploaded
jjohnson
parents:
diff changeset
13
08439b004404 Uploaded
jjohnson
parents:
diff changeset
14 Scythe uses a Naive Bayesian approach to classify contaminant
08439b004404 Uploaded
jjohnson
parents:
diff changeset
15 substrings in sequence reads. It considers quality information, which
08439b004404 Uploaded
jjohnson
parents:
diff changeset
16 can make it robust in picking out 3'-end adapters, which often include
08439b004404 Uploaded
jjohnson
parents:
diff changeset
17 poor quality bases.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
18
08439b004404 Uploaded
jjohnson
parents:
diff changeset
19 Most next generation sequencing reads have deteriorating quality
08439b004404 Uploaded
jjohnson
parents:
diff changeset
20 towards the 3'-end. It's common for a quality-based trimmer to be
08439b004404 Uploaded
jjohnson
parents:
diff changeset
21 employed before mapping, assemblies, and analysis to remove these poor
08439b004404 Uploaded
jjohnson
parents:
diff changeset
22 quality bases. However, quality-based trimming could remove bases that
08439b004404 Uploaded
jjohnson
parents:
diff changeset
23 are helpful in identifying (and removing) 3'-end adapter
08439b004404 Uploaded
jjohnson
parents:
diff changeset
24 contaminants. Thus, it is recommended you run Scythe *before*
08439b004404 Uploaded
jjohnson
parents:
diff changeset
25 quality-based trimming, as part of a read quality control pipeline.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
26
08439b004404 Uploaded
jjohnson
parents:
diff changeset
27 The Bayesian approach Scythe uses compares two likelihood models: the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
28 probability of seeing the matches in a sequence given contamination,
08439b004404 Uploaded
jjohnson
parents:
diff changeset
29 and not given contamination. Given that the read is contaminated, the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
30 probability of seeing a certain number of matches and mistmatches is a
08439b004404 Uploaded
jjohnson
parents:
diff changeset
31 function of the quality of the sequence. Given the read is not
08439b004404 Uploaded
jjohnson
parents:
diff changeset
32 contaminated (and is thus assumed to be random sequence), the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
33 probability of seeing a certain number of matches and mismatches is
08439b004404 Uploaded
jjohnson
parents:
diff changeset
34 chance. The posterior is calculated across both these likelihood
08439b004404 Uploaded
jjohnson
parents:
diff changeset
35 models, and the class (contaminated or not contaminated) with the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
36 maximum posterior probability is the class selected.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
37
08439b004404 Uploaded
jjohnson
parents:
diff changeset
38 ## Requirements
08439b004404 Uploaded
jjohnson
parents:
diff changeset
39
08439b004404 Uploaded
jjohnson
parents:
diff changeset
40 Scythe can be compiled using GCC or Clang; compilation during
08439b004404 Uploaded
jjohnson
parents:
diff changeset
41 development used the latter. Scythe relies on Heng Li's kseq.h, which
08439b004404 Uploaded
jjohnson
parents:
diff changeset
42 is bundled with the source.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
43
08439b004404 Uploaded
jjohnson
parents:
diff changeset
44 Scythe requires Zlib, which can be obtained at <http://www.zlib.net/>.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
45
08439b004404 Uploaded
jjohnson
parents:
diff changeset
46 ## Building and Installing Scythe
08439b004404 Uploaded
jjohnson
parents:
diff changeset
47
08439b004404 Uploaded
jjohnson
parents:
diff changeset
48 To build Scythe, enter:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
49
08439b004404 Uploaded
jjohnson
parents:
diff changeset
50 make build
08439b004404 Uploaded
jjohnson
parents:
diff changeset
51
08439b004404 Uploaded
jjohnson
parents:
diff changeset
52 Then, copy or move "scythe" to a directory in your $PATH.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
53
08439b004404 Uploaded
jjohnson
parents:
diff changeset
54 ## Usage
08439b004404 Uploaded
jjohnson
parents:
diff changeset
55
08439b004404 Uploaded
jjohnson
parents:
diff changeset
56 Scythe can be run minimally with:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
57
08439b004404 Uploaded
jjohnson
parents:
diff changeset
58 scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq
08439b004404 Uploaded
jjohnson
parents:
diff changeset
59
08439b004404 Uploaded
jjohnson
parents:
diff changeset
60 By default, the prior contamination rate is 0.05. This can be changed
08439b004404 Uploaded
jjohnson
parents:
diff changeset
61 (and one is encouraged to do so!) with:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
62
08439b004404 Uploaded
jjohnson
parents:
diff changeset
63 scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq
08439b004404 Uploaded
jjohnson
parents:
diff changeset
64
08439b004404 Uploaded
jjohnson
parents:
diff changeset
65 If you'd like to use standard out, it is recommended you use the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
66 --quiet option:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
67
08439b004404 Uploaded
jjohnson
parents:
diff changeset
68 scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq
08439b004404 Uploaded
jjohnson
parents:
diff changeset
69
08439b004404 Uploaded
jjohnson
parents:
diff changeset
70 Also, more detailed output about matches can be obtained with:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
71
08439b004404 Uploaded
jjohnson
parents:
diff changeset
72 scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq
08439b004404 Uploaded
jjohnson
parents:
diff changeset
73
08439b004404 Uploaded
jjohnson
parents:
diff changeset
74 By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger
08439b004404 Uploaded
jjohnson
parents:
diff changeset
75 or Solexa (pipeline < 1.3) qualities can be specified with -q:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
76
08439b004404 Uploaded
jjohnson
parents:
diff changeset
77 scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq
08439b004404 Uploaded
jjohnson
parents:
diff changeset
78
08439b004404 Uploaded
jjohnson
parents:
diff changeset
79 Lastly, a minimum match length argument can be specified with -n <integer>:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
80
08439b004404 Uploaded
jjohnson
parents:
diff changeset
81 scythe -a adapter_file.fasta -n 0 -o trimmed_sequences.fasta sequences.fastq
08439b004404 Uploaded
jjohnson
parents:
diff changeset
82
08439b004404 Uploaded
jjohnson
parents:
diff changeset
83 The default is 5. If this pre-processing is upstream of assembly on a
08439b004404 Uploaded
jjohnson
parents:
diff changeset
84 very contaminated lane, decreasing this parameter could lead to *very*
08439b004404 Uploaded
jjohnson
parents:
diff changeset
85 liberal trimming, i.e. of only a few bases.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
86
08439b004404 Uploaded
jjohnson
parents:
diff changeset
87 ## Notes
08439b004404 Uploaded
jjohnson
parents:
diff changeset
88
08439b004404 Uploaded
jjohnson
parents:
diff changeset
89 Scythe only checks for 3'-end contaminants, up to the adapter's length
08439b004404 Uploaded
jjohnson
parents:
diff changeset
90 into the 3'-end. For reads with contamination in *any* position, the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
91 program TagDust (<http://genome.gsc.riken.jp/osc/english/dataresource/>)
08439b004404 Uploaded
jjohnson
parents:
diff changeset
92 is recommended. Scythe has the advantages of allowing fuzzier matching
08439b004404 Uploaded
jjohnson
parents:
diff changeset
93 and being base quality-aware, while TagDust has the advantages of very
08439b004404 Uploaded
jjohnson
parents:
diff changeset
94 fast matching (but allowing few mismatches, and not considering
08439b004404 Uploaded
jjohnson
parents:
diff changeset
95 quality) and FDR. TagDust also removes contaminated reads *entirely*, while
08439b004404 Uploaded
jjohnson
parents:
diff changeset
96 Scythe trims off contaminants.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
97
08439b004404 Uploaded
jjohnson
parents:
diff changeset
98 A possible pipeline would run FASTQ reads through Scythe, then
08439b004404 Uploaded
jjohnson
parents:
diff changeset
99 TagDust, then a quality-based trimmer, and finally through a read
08439b004404 Uploaded
jjohnson
parents:
diff changeset
100 quality statistics program such as qrqc
08439b004404 Uploaded
jjohnson
parents:
diff changeset
101 (<http://bioconductor.org/packages/devel/bioc/html/qrqc.html>) or FASTqc
08439b004404 Uploaded
jjohnson
parents:
diff changeset
102 (<http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/>).
08439b004404 Uploaded
jjohnson
parents:
diff changeset
103
08439b004404 Uploaded
jjohnson
parents:
diff changeset
104 ## FAQ
08439b004404 Uploaded
jjohnson
parents:
diff changeset
105
08439b004404 Uploaded
jjohnson
parents:
diff changeset
106 ### Does Scythe work with paired-end data?
08439b004404 Uploaded
jjohnson
parents:
diff changeset
107
08439b004404 Uploaded
jjohnson
parents:
diff changeset
108 Scythe does work with paired-end data. Each file must be run
08439b004404 Uploaded
jjohnson
parents:
diff changeset
109 separately, but Scythe will not remove reads entirely leaving
08439b004404 Uploaded
jjohnson
parents:
diff changeset
110 mismatched pairs.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
111
08439b004404 Uploaded
jjohnson
parents:
diff changeset
112 In some cases, barcodes are ligated to both the 3'-end and 5'-end of
08439b004404 Uploaded
jjohnson
parents:
diff changeset
113 reads. 5'-end removal is trivial since base calling is near-perfect
08439b004404 Uploaded
jjohnson
parents:
diff changeset
114 there, but 3'-end removal can be trickier. Some users have created
08439b004404 Uploaded
jjohnson
parents:
diff changeset
115 Scythe adapter files that contain all possible barcodes concatenated
08439b004404 Uploaded
jjohnson
parents:
diff changeset
116 with possible adapters, so that both can be recognized and
08439b004404 Uploaded
jjohnson
parents:
diff changeset
117 removed. This has worked well and is recommended for cases when 3'-end
08439b004404 Uploaded
jjohnson
parents:
diff changeset
118 quality deteriorates and prevents barcode removal. Newer Illumina
08439b004404 Uploaded
jjohnson
parents:
diff changeset
119 chemistry has the barcode separated from the fragment, so that it
08439b004404 Uploaded
jjohnson
parents:
diff changeset
120 appears as an entirely separate read and is used to demultiplex sample
08439b004404 Uploaded
jjohnson
parents:
diff changeset
121 reads by Illumina's CASAVA pipeline.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
122
08439b004404 Uploaded
jjohnson
parents:
diff changeset
123 ### Does Scythe work on 5'-end or other contaminants?
08439b004404 Uploaded
jjohnson
parents:
diff changeset
124
08439b004404 Uploaded
jjohnson
parents:
diff changeset
125 No. Embracing the Unix tool philosophy that tools should do one thing
08439b004404 Uploaded
jjohnson
parents:
diff changeset
126 very well, Scythe just removes 3'-end contaminants where there could
08439b004404 Uploaded
jjohnson
parents:
diff changeset
127 be multiple base mismatches due to poor base quality. N-mismatch
08439b004404 Uploaded
jjohnson
parents:
diff changeset
128 algorithms (such as TagDust) don't consider base qualities. Scythe
08439b004404 Uploaded
jjohnson
parents:
diff changeset
129 will allow more mismatches in an alignment if the mismatched bases are
08439b004404 Uploaded
jjohnson
parents:
diff changeset
130 of low quality.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
131
08439b004404 Uploaded
jjohnson
parents:
diff changeset
132 **Scythe only checks as far in as the entire adapter contaminant's
08439b004404 Uploaded
jjohnson
parents:
diff changeset
133 length.** However, some investigation has shown that Illumina
08439b004404 Uploaded
jjohnson
parents:
diff changeset
134 pipelines sometimes produce reads longer than the read length +
08439b004404 Uploaded
jjohnson
parents:
diff changeset
135 adapter length. The extra bases have always been observed to be
08439b004404 Uploaded
jjohnson
parents:
diff changeset
136 A's. Some testing has shown this can be addressed by appending A's to
08439b004404 Uploaded
jjohnson
parents:
diff changeset
137 the adapters in the adapters file. Since Scythe begins by checking for
08439b004404 Uploaded
jjohnson
parents:
diff changeset
138 contamination from the 5'-end of the adapter, this won't affect the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
139 normal adapter contaminant cases.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
140
08439b004404 Uploaded
jjohnson
parents:
diff changeset
141 ### What does the numeric output from Scythe mean?
08439b004404 Uploaded
jjohnson
parents:
diff changeset
142
08439b004404 Uploaded
jjohnson
parents:
diff changeset
143 For each adapter in the file, the contaminants removed by position are
08439b004404 Uploaded
jjohnson
parents:
diff changeset
144 returned via standard error. For example:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
145
08439b004404 Uploaded
jjohnson
parents:
diff changeset
146 Adapter 1 'fake adapter' contamination occurences:
08439b004404 Uploaded
jjohnson
parents:
diff changeset
147 [10, 2, 4, 5, 6]
08439b004404 Uploaded
jjohnson
parents:
diff changeset
148
08439b004404 Uploaded
jjohnson
parents:
diff changeset
149 indicates that "fake adapter" is 5 bases long (the length of the array
08439b004404 Uploaded
jjohnson
parents:
diff changeset
150 returned), and that there were 10 contaminants found of first base (-n
08439b004404 Uploaded
jjohnson
parents:
diff changeset
151 was set to 0 then), 2 of the first two bases, 4 contaminants of the
08439b004404 Uploaded
jjohnson
parents:
diff changeset
152 first 3 bases, 5 of the first 4 bases, etc.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
153
08439b004404 Uploaded
jjohnson
parents:
diff changeset
154 ### Does Scythe work on FASTA files?
08439b004404 Uploaded
jjohnson
parents:
diff changeset
155
08439b004404 Uploaded
jjohnson
parents:
diff changeset
156 No, as these have no quality information.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
157
08439b004404 Uploaded
jjohnson
parents:
diff changeset
158 ### How can I report a bug?
08439b004404 Uploaded
jjohnson
parents:
diff changeset
159
08439b004404 Uploaded
jjohnson
parents:
diff changeset
160 See the section below.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
161
08439b004404 Uploaded
jjohnson
parents:
diff changeset
162 ### How does Scythe compare to program "x"?
08439b004404 Uploaded
jjohnson
parents:
diff changeset
163
08439b004404 Uploaded
jjohnson
parents:
diff changeset
164 As far as I know, Scythe is the only program that employs a Bayesian
08439b004404 Uploaded
jjohnson
parents:
diff changeset
165 model that allows prior contaminant estimates to be used. This prior
08439b004404 Uploaded
jjohnson
parents:
diff changeset
166 is a more realistic approach than setting a fixed number of mismatches
08439b004404 Uploaded
jjohnson
parents:
diff changeset
167 because we can visually estimate it with the Unix tool `less`.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
168
08439b004404 Uploaded
jjohnson
parents:
diff changeset
169 Scythe also looks at base-level qualities, *not* just a fixed level of
08439b004404 Uploaded
jjohnson
parents:
diff changeset
170 mismatches. A fixed number of mismatches is a bad approach with data
08439b004404 Uploaded
jjohnson
parents:
diff changeset
171 our group (the UC Davis Bioinformatics Core) has seen, as a small bad
08439b004404 Uploaded
jjohnson
parents:
diff changeset
172 quality run can quickly exhaust even a high numbers of fixed
08439b004404 Uploaded
jjohnson
parents:
diff changeset
173 mismatches and lead to higher false negatives.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
174
08439b004404 Uploaded
jjohnson
parents:
diff changeset
175 ## Reporting Bugs
08439b004404 Uploaded
jjohnson
parents:
diff changeset
176
08439b004404 Uploaded
jjohnson
parents:
diff changeset
177 Scythe is free software and is proved without a warranty. However, I
08439b004404 Uploaded
jjohnson
parents:
diff changeset
178 am proud of this software and I will do my best to provide updates,
08439b004404 Uploaded
jjohnson
parents:
diff changeset
179 bug fixes, and additional documentation as needed. Please report all
08439b004404 Uploaded
jjohnson
parents:
diff changeset
180 bugs and issues to Github's issue tracker
08439b004404 Uploaded
jjohnson
parents:
diff changeset
181 (http://github.com/vsbuffalo/scythe/issues). If you want to email me,
08439b004404 Uploaded
jjohnson
parents:
diff changeset
182 do so in addition to an issue request.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
183
08439b004404 Uploaded
jjohnson
parents:
diff changeset
184 If you have a suggestion or comment on Scythe's methods, you can email
08439b004404 Uploaded
jjohnson
parents:
diff changeset
185 me directly.
08439b004404 Uploaded
jjohnson
parents:
diff changeset
186
08439b004404 Uploaded
jjohnson
parents:
diff changeset
187 ## Is there a paper about Scythe?
08439b004404 Uploaded
jjohnson
parents:
diff changeset
188
08439b004404 Uploaded
jjohnson
parents:
diff changeset
189 I am currently writing a paper on Scythe's methods. In my preliminary
08439b004404 Uploaded
jjohnson
parents:
diff changeset
190 testing, Scythe has fewew false positives and false negatives than
08439b004404 Uploaded
jjohnson
parents:
diff changeset
191 it competitors.