0
|
1 # Scythe - A very simple adapter trimmer (version 0.981 BETA)
|
|
2
|
|
3 Scythe and all supporting documentation
|
|
4 Copyright (c) Vince Buffalo, 2011-2012
|
|
5
|
|
6 Contact: Vince Buffalo <vsbuffaloAAAAA@gmail.com> (with the poly-A tail removed)
|
|
7
|
|
8 If you wish to report a bug, please open an issue on Github
|
|
9 (http://github.com/vsbuffalo/scythe/issues) so that it can be
|
|
10 tracked. You can contact me as well, but please open an issue first.
|
|
11
|
|
12 ## About
|
|
13
|
|
14 Scythe uses a Naive Bayesian approach to classify contaminant
|
|
15 substrings in sequence reads. It considers quality information, which
|
|
16 can make it robust in picking out 3'-end adapters, which often include
|
|
17 poor quality bases.
|
|
18
|
|
19 Most next generation sequencing reads have deteriorating quality
|
|
20 towards the 3'-end. It's common for a quality-based trimmer to be
|
|
21 employed before mapping, assemblies, and analysis to remove these poor
|
|
22 quality bases. However, quality-based trimming could remove bases that
|
|
23 are helpful in identifying (and removing) 3'-end adapter
|
|
24 contaminants. Thus, it is recommended you run Scythe *before*
|
|
25 quality-based trimming, as part of a read quality control pipeline.
|
|
26
|
|
27 The Bayesian approach Scythe uses compares two likelihood models: the
|
|
28 probability of seeing the matches in a sequence given contamination,
|
|
29 and not given contamination. Given that the read is contaminated, the
|
|
30 probability of seeing a certain number of matches and mistmatches is a
|
|
31 function of the quality of the sequence. Given the read is not
|
|
32 contaminated (and is thus assumed to be random sequence), the
|
|
33 probability of seeing a certain number of matches and mismatches is
|
|
34 chance. The posterior is calculated across both these likelihood
|
|
35 models, and the class (contaminated or not contaminated) with the
|
|
36 maximum posterior probability is the class selected.
|
|
37
|
|
38 ## Requirements
|
|
39
|
|
40 Scythe can be compiled using GCC or Clang; compilation during
|
|
41 development used the latter. Scythe relies on Heng Li's kseq.h, which
|
|
42 is bundled with the source.
|
|
43
|
|
44 Scythe requires Zlib, which can be obtained at <http://www.zlib.net/>.
|
|
45
|
|
46 ## Building and Installing Scythe
|
|
47
|
|
48 To build Scythe, enter:
|
|
49
|
|
50 make build
|
|
51
|
|
52 Then, copy or move "scythe" to a directory in your $PATH.
|
|
53
|
|
54 ## Usage
|
|
55
|
|
56 Scythe can be run minimally with:
|
|
57
|
|
58 scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq
|
|
59
|
|
60 By default, the prior contamination rate is 0.05. This can be changed
|
|
61 (and one is encouraged to do so!) with:
|
|
62
|
|
63 scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq
|
|
64
|
|
65 If you'd like to use standard out, it is recommended you use the
|
|
66 --quiet option:
|
|
67
|
|
68 scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq
|
|
69
|
|
70 Also, more detailed output about matches can be obtained with:
|
|
71
|
|
72 scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq
|
|
73
|
|
74 By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger
|
|
75 or Solexa (pipeline < 1.3) qualities can be specified with -q:
|
|
76
|
|
77 scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq
|
|
78
|
|
79 Lastly, a minimum match length argument can be specified with -n <integer>:
|
|
80
|
|
81 scythe -a adapter_file.fasta -n 0 -o trimmed_sequences.fasta sequences.fastq
|
|
82
|
|
83 The default is 5. If this pre-processing is upstream of assembly on a
|
|
84 very contaminated lane, decreasing this parameter could lead to *very*
|
|
85 liberal trimming, i.e. of only a few bases.
|
|
86
|
|
87 ## Notes
|
|
88
|
|
89 Scythe only checks for 3'-end contaminants, up to the adapter's length
|
|
90 into the 3'-end. For reads with contamination in *any* position, the
|
|
91 program TagDust (<http://genome.gsc.riken.jp/osc/english/dataresource/>)
|
|
92 is recommended. Scythe has the advantages of allowing fuzzier matching
|
|
93 and being base quality-aware, while TagDust has the advantages of very
|
|
94 fast matching (but allowing few mismatches, and not considering
|
|
95 quality) and FDR. TagDust also removes contaminated reads *entirely*, while
|
|
96 Scythe trims off contaminants.
|
|
97
|
|
98 A possible pipeline would run FASTQ reads through Scythe, then
|
|
99 TagDust, then a quality-based trimmer, and finally through a read
|
|
100 quality statistics program such as qrqc
|
|
101 (<http://bioconductor.org/packages/devel/bioc/html/qrqc.html>) or FASTqc
|
|
102 (<http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/>).
|
|
103
|
|
104 ## FAQ
|
|
105
|
|
106 ### Does Scythe work with paired-end data?
|
|
107
|
|
108 Scythe does work with paired-end data. Each file must be run
|
|
109 separately, but Scythe will not remove reads entirely leaving
|
|
110 mismatched pairs.
|
|
111
|
|
112 In some cases, barcodes are ligated to both the 3'-end and 5'-end of
|
|
113 reads. 5'-end removal is trivial since base calling is near-perfect
|
|
114 there, but 3'-end removal can be trickier. Some users have created
|
|
115 Scythe adapter files that contain all possible barcodes concatenated
|
|
116 with possible adapters, so that both can be recognized and
|
|
117 removed. This has worked well and is recommended for cases when 3'-end
|
|
118 quality deteriorates and prevents barcode removal. Newer Illumina
|
|
119 chemistry has the barcode separated from the fragment, so that it
|
|
120 appears as an entirely separate read and is used to demultiplex sample
|
|
121 reads by Illumina's CASAVA pipeline.
|
|
122
|
|
123 ### Does Scythe work on 5'-end or other contaminants?
|
|
124
|
|
125 No. Embracing the Unix tool philosophy that tools should do one thing
|
|
126 very well, Scythe just removes 3'-end contaminants where there could
|
|
127 be multiple base mismatches due to poor base quality. N-mismatch
|
|
128 algorithms (such as TagDust) don't consider base qualities. Scythe
|
|
129 will allow more mismatches in an alignment if the mismatched bases are
|
|
130 of low quality.
|
|
131
|
|
132 **Scythe only checks as far in as the entire adapter contaminant's
|
|
133 length.** However, some investigation has shown that Illumina
|
|
134 pipelines sometimes produce reads longer than the read length +
|
|
135 adapter length. The extra bases have always been observed to be
|
|
136 A's. Some testing has shown this can be addressed by appending A's to
|
|
137 the adapters in the adapters file. Since Scythe begins by checking for
|
|
138 contamination from the 5'-end of the adapter, this won't affect the
|
|
139 normal adapter contaminant cases.
|
|
140
|
|
141 ### What does the numeric output from Scythe mean?
|
|
142
|
|
143 For each adapter in the file, the contaminants removed by position are
|
|
144 returned via standard error. For example:
|
|
145
|
|
146 Adapter 1 'fake adapter' contamination occurences:
|
|
147 [10, 2, 4, 5, 6]
|
|
148
|
|
149 indicates that "fake adapter" is 5 bases long (the length of the array
|
|
150 returned), and that there were 10 contaminants found of first base (-n
|
|
151 was set to 0 then), 2 of the first two bases, 4 contaminants of the
|
|
152 first 3 bases, 5 of the first 4 bases, etc.
|
|
153
|
|
154 ### Does Scythe work on FASTA files?
|
|
155
|
|
156 No, as these have no quality information.
|
|
157
|
|
158 ### How can I report a bug?
|
|
159
|
|
160 See the section below.
|
|
161
|
|
162 ### How does Scythe compare to program "x"?
|
|
163
|
|
164 As far as I know, Scythe is the only program that employs a Bayesian
|
|
165 model that allows prior contaminant estimates to be used. This prior
|
|
166 is a more realistic approach than setting a fixed number of mismatches
|
|
167 because we can visually estimate it with the Unix tool `less`.
|
|
168
|
|
169 Scythe also looks at base-level qualities, *not* just a fixed level of
|
|
170 mismatches. A fixed number of mismatches is a bad approach with data
|
|
171 our group (the UC Davis Bioinformatics Core) has seen, as a small bad
|
|
172 quality run can quickly exhaust even a high numbers of fixed
|
|
173 mismatches and lead to higher false negatives.
|
|
174
|
|
175 ## Reporting Bugs
|
|
176
|
|
177 Scythe is free software and is proved without a warranty. However, I
|
|
178 am proud of this software and I will do my best to provide updates,
|
|
179 bug fixes, and additional documentation as needed. Please report all
|
|
180 bugs and issues to Github's issue tracker
|
|
181 (http://github.com/vsbuffalo/scythe/issues). If you want to email me,
|
|
182 do so in addition to an issue request.
|
|
183
|
|
184 If you have a suggestion or comment on Scythe's methods, you can email
|
|
185 me directly.
|
|
186
|
|
187 ## Is there a paper about Scythe?
|
|
188
|
|
189 I am currently writing a paper on Scythe's methods. In my preliminary
|
|
190 testing, Scythe has fewew false positives and false negatives than
|
|
191 it competitors. |