comparison README.md @ 0:08439b004404

Uploaded
author jjohnson
date Mon, 13 Jan 2014 14:57:53 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:08439b004404
1 # Scythe - A very simple adapter trimmer (version 0.981 BETA)
2
3 Scythe and all supporting documentation
4 Copyright (c) Vince Buffalo, 2011-2012
5
6 Contact: Vince Buffalo <vsbuffaloAAAAA@gmail.com> (with the poly-A tail removed)
7
8 If you wish to report a bug, please open an issue on Github
9 (http://github.com/vsbuffalo/scythe/issues) so that it can be
10 tracked. You can contact me as well, but please open an issue first.
11
12 ## About
13
14 Scythe uses a Naive Bayesian approach to classify contaminant
15 substrings in sequence reads. It considers quality information, which
16 can make it robust in picking out 3'-end adapters, which often include
17 poor quality bases.
18
19 Most next generation sequencing reads have deteriorating quality
20 towards the 3'-end. It's common for a quality-based trimmer to be
21 employed before mapping, assemblies, and analysis to remove these poor
22 quality bases. However, quality-based trimming could remove bases that
23 are helpful in identifying (and removing) 3'-end adapter
24 contaminants. Thus, it is recommended you run Scythe *before*
25 quality-based trimming, as part of a read quality control pipeline.
26
27 The Bayesian approach Scythe uses compares two likelihood models: the
28 probability of seeing the matches in a sequence given contamination,
29 and not given contamination. Given that the read is contaminated, the
30 probability of seeing a certain number of matches and mistmatches is a
31 function of the quality of the sequence. Given the read is not
32 contaminated (and is thus assumed to be random sequence), the
33 probability of seeing a certain number of matches and mismatches is
34 chance. The posterior is calculated across both these likelihood
35 models, and the class (contaminated or not contaminated) with the
36 maximum posterior probability is the class selected.
37
38 ## Requirements
39
40 Scythe can be compiled using GCC or Clang; compilation during
41 development used the latter. Scythe relies on Heng Li's kseq.h, which
42 is bundled with the source.
43
44 Scythe requires Zlib, which can be obtained at <http://www.zlib.net/>.
45
46 ## Building and Installing Scythe
47
48 To build Scythe, enter:
49
50 make build
51
52 Then, copy or move "scythe" to a directory in your $PATH.
53
54 ## Usage
55
56 Scythe can be run minimally with:
57
58 scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq
59
60 By default, the prior contamination rate is 0.05. This can be changed
61 (and one is encouraged to do so!) with:
62
63 scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq
64
65 If you'd like to use standard out, it is recommended you use the
66 --quiet option:
67
68 scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq
69
70 Also, more detailed output about matches can be obtained with:
71
72 scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq
73
74 By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger
75 or Solexa (pipeline < 1.3) qualities can be specified with -q:
76
77 scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq
78
79 Lastly, a minimum match length argument can be specified with -n <integer>:
80
81 scythe -a adapter_file.fasta -n 0 -o trimmed_sequences.fasta sequences.fastq
82
83 The default is 5. If this pre-processing is upstream of assembly on a
84 very contaminated lane, decreasing this parameter could lead to *very*
85 liberal trimming, i.e. of only a few bases.
86
87 ## Notes
88
89 Scythe only checks for 3'-end contaminants, up to the adapter's length
90 into the 3'-end. For reads with contamination in *any* position, the
91 program TagDust (<http://genome.gsc.riken.jp/osc/english/dataresource/>)
92 is recommended. Scythe has the advantages of allowing fuzzier matching
93 and being base quality-aware, while TagDust has the advantages of very
94 fast matching (but allowing few mismatches, and not considering
95 quality) and FDR. TagDust also removes contaminated reads *entirely*, while
96 Scythe trims off contaminants.
97
98 A possible pipeline would run FASTQ reads through Scythe, then
99 TagDust, then a quality-based trimmer, and finally through a read
100 quality statistics program such as qrqc
101 (<http://bioconductor.org/packages/devel/bioc/html/qrqc.html>) or FASTqc
102 (<http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/>).
103
104 ## FAQ
105
106 ### Does Scythe work with paired-end data?
107
108 Scythe does work with paired-end data. Each file must be run
109 separately, but Scythe will not remove reads entirely leaving
110 mismatched pairs.
111
112 In some cases, barcodes are ligated to both the 3'-end and 5'-end of
113 reads. 5'-end removal is trivial since base calling is near-perfect
114 there, but 3'-end removal can be trickier. Some users have created
115 Scythe adapter files that contain all possible barcodes concatenated
116 with possible adapters, so that both can be recognized and
117 removed. This has worked well and is recommended for cases when 3'-end
118 quality deteriorates and prevents barcode removal. Newer Illumina
119 chemistry has the barcode separated from the fragment, so that it
120 appears as an entirely separate read and is used to demultiplex sample
121 reads by Illumina's CASAVA pipeline.
122
123 ### Does Scythe work on 5'-end or other contaminants?
124
125 No. Embracing the Unix tool philosophy that tools should do one thing
126 very well, Scythe just removes 3'-end contaminants where there could
127 be multiple base mismatches due to poor base quality. N-mismatch
128 algorithms (such as TagDust) don't consider base qualities. Scythe
129 will allow more mismatches in an alignment if the mismatched bases are
130 of low quality.
131
132 **Scythe only checks as far in as the entire adapter contaminant's
133 length.** However, some investigation has shown that Illumina
134 pipelines sometimes produce reads longer than the read length +
135 adapter length. The extra bases have always been observed to be
136 A's. Some testing has shown this can be addressed by appending A's to
137 the adapters in the adapters file. Since Scythe begins by checking for
138 contamination from the 5'-end of the adapter, this won't affect the
139 normal adapter contaminant cases.
140
141 ### What does the numeric output from Scythe mean?
142
143 For each adapter in the file, the contaminants removed by position are
144 returned via standard error. For example:
145
146 Adapter 1 'fake adapter' contamination occurences:
147 [10, 2, 4, 5, 6]
148
149 indicates that "fake adapter" is 5 bases long (the length of the array
150 returned), and that there were 10 contaminants found of first base (-n
151 was set to 0 then), 2 of the first two bases, 4 contaminants of the
152 first 3 bases, 5 of the first 4 bases, etc.
153
154 ### Does Scythe work on FASTA files?
155
156 No, as these have no quality information.
157
158 ### How can I report a bug?
159
160 See the section below.
161
162 ### How does Scythe compare to program "x"?
163
164 As far as I know, Scythe is the only program that employs a Bayesian
165 model that allows prior contaminant estimates to be used. This prior
166 is a more realistic approach than setting a fixed number of mismatches
167 because we can visually estimate it with the Unix tool `less`.
168
169 Scythe also looks at base-level qualities, *not* just a fixed level of
170 mismatches. A fixed number of mismatches is a bad approach with data
171 our group (the UC Davis Bioinformatics Core) has seen, as a small bad
172 quality run can quickly exhaust even a high numbers of fixed
173 mismatches and lead to higher false negatives.
174
175 ## Reporting Bugs
176
177 Scythe is free software and is proved without a warranty. However, I
178 am proud of this software and I will do my best to provide updates,
179 bug fixes, and additional documentation as needed. Please report all
180 bugs and issues to Github's issue tracker
181 (http://github.com/vsbuffalo/scythe/issues). If you want to email me,
182 do so in addition to an issue request.
183
184 If you have a suggestion or comment on Scythe's methods, you can email
185 me directly.
186
187 ## Is there a paper about Scythe?
188
189 I am currently writing a paper on Scythe's methods. In my preliminary
190 testing, Scythe has fewew false positives and false negatives than
191 it competitors.