Mercurial > repos > jjohnson > scythe
comparison README.md @ 0:08439b004404
Uploaded
author | jjohnson |
---|---|
date | Mon, 13 Jan 2014 14:57:53 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:08439b004404 |
---|---|
1 # Scythe - A very simple adapter trimmer (version 0.981 BETA) | |
2 | |
3 Scythe and all supporting documentation | |
4 Copyright (c) Vince Buffalo, 2011-2012 | |
5 | |
6 Contact: Vince Buffalo <vsbuffaloAAAAA@gmail.com> (with the poly-A tail removed) | |
7 | |
8 If you wish to report a bug, please open an issue on Github | |
9 (http://github.com/vsbuffalo/scythe/issues) so that it can be | |
10 tracked. You can contact me as well, but please open an issue first. | |
11 | |
12 ## About | |
13 | |
14 Scythe uses a Naive Bayesian approach to classify contaminant | |
15 substrings in sequence reads. It considers quality information, which | |
16 can make it robust in picking out 3'-end adapters, which often include | |
17 poor quality bases. | |
18 | |
19 Most next generation sequencing reads have deteriorating quality | |
20 towards the 3'-end. It's common for a quality-based trimmer to be | |
21 employed before mapping, assemblies, and analysis to remove these poor | |
22 quality bases. However, quality-based trimming could remove bases that | |
23 are helpful in identifying (and removing) 3'-end adapter | |
24 contaminants. Thus, it is recommended you run Scythe *before* | |
25 quality-based trimming, as part of a read quality control pipeline. | |
26 | |
27 The Bayesian approach Scythe uses compares two likelihood models: the | |
28 probability of seeing the matches in a sequence given contamination, | |
29 and not given contamination. Given that the read is contaminated, the | |
30 probability of seeing a certain number of matches and mistmatches is a | |
31 function of the quality of the sequence. Given the read is not | |
32 contaminated (and is thus assumed to be random sequence), the | |
33 probability of seeing a certain number of matches and mismatches is | |
34 chance. The posterior is calculated across both these likelihood | |
35 models, and the class (contaminated or not contaminated) with the | |
36 maximum posterior probability is the class selected. | |
37 | |
38 ## Requirements | |
39 | |
40 Scythe can be compiled using GCC or Clang; compilation during | |
41 development used the latter. Scythe relies on Heng Li's kseq.h, which | |
42 is bundled with the source. | |
43 | |
44 Scythe requires Zlib, which can be obtained at <http://www.zlib.net/>. | |
45 | |
46 ## Building and Installing Scythe | |
47 | |
48 To build Scythe, enter: | |
49 | |
50 make build | |
51 | |
52 Then, copy or move "scythe" to a directory in your $PATH. | |
53 | |
54 ## Usage | |
55 | |
56 Scythe can be run minimally with: | |
57 | |
58 scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq | |
59 | |
60 By default, the prior contamination rate is 0.05. This can be changed | |
61 (and one is encouraged to do so!) with: | |
62 | |
63 scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq | |
64 | |
65 If you'd like to use standard out, it is recommended you use the | |
66 --quiet option: | |
67 | |
68 scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq | |
69 | |
70 Also, more detailed output about matches can be obtained with: | |
71 | |
72 scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq | |
73 | |
74 By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger | |
75 or Solexa (pipeline < 1.3) qualities can be specified with -q: | |
76 | |
77 scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq | |
78 | |
79 Lastly, a minimum match length argument can be specified with -n <integer>: | |
80 | |
81 scythe -a adapter_file.fasta -n 0 -o trimmed_sequences.fasta sequences.fastq | |
82 | |
83 The default is 5. If this pre-processing is upstream of assembly on a | |
84 very contaminated lane, decreasing this parameter could lead to *very* | |
85 liberal trimming, i.e. of only a few bases. | |
86 | |
87 ## Notes | |
88 | |
89 Scythe only checks for 3'-end contaminants, up to the adapter's length | |
90 into the 3'-end. For reads with contamination in *any* position, the | |
91 program TagDust (<http://genome.gsc.riken.jp/osc/english/dataresource/>) | |
92 is recommended. Scythe has the advantages of allowing fuzzier matching | |
93 and being base quality-aware, while TagDust has the advantages of very | |
94 fast matching (but allowing few mismatches, and not considering | |
95 quality) and FDR. TagDust also removes contaminated reads *entirely*, while | |
96 Scythe trims off contaminants. | |
97 | |
98 A possible pipeline would run FASTQ reads through Scythe, then | |
99 TagDust, then a quality-based trimmer, and finally through a read | |
100 quality statistics program such as qrqc | |
101 (<http://bioconductor.org/packages/devel/bioc/html/qrqc.html>) or FASTqc | |
102 (<http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/>). | |
103 | |
104 ## FAQ | |
105 | |
106 ### Does Scythe work with paired-end data? | |
107 | |
108 Scythe does work with paired-end data. Each file must be run | |
109 separately, but Scythe will not remove reads entirely leaving | |
110 mismatched pairs. | |
111 | |
112 In some cases, barcodes are ligated to both the 3'-end and 5'-end of | |
113 reads. 5'-end removal is trivial since base calling is near-perfect | |
114 there, but 3'-end removal can be trickier. Some users have created | |
115 Scythe adapter files that contain all possible barcodes concatenated | |
116 with possible adapters, so that both can be recognized and | |
117 removed. This has worked well and is recommended for cases when 3'-end | |
118 quality deteriorates and prevents barcode removal. Newer Illumina | |
119 chemistry has the barcode separated from the fragment, so that it | |
120 appears as an entirely separate read and is used to demultiplex sample | |
121 reads by Illumina's CASAVA pipeline. | |
122 | |
123 ### Does Scythe work on 5'-end or other contaminants? | |
124 | |
125 No. Embracing the Unix tool philosophy that tools should do one thing | |
126 very well, Scythe just removes 3'-end contaminants where there could | |
127 be multiple base mismatches due to poor base quality. N-mismatch | |
128 algorithms (such as TagDust) don't consider base qualities. Scythe | |
129 will allow more mismatches in an alignment if the mismatched bases are | |
130 of low quality. | |
131 | |
132 **Scythe only checks as far in as the entire adapter contaminant's | |
133 length.** However, some investigation has shown that Illumina | |
134 pipelines sometimes produce reads longer than the read length + | |
135 adapter length. The extra bases have always been observed to be | |
136 A's. Some testing has shown this can be addressed by appending A's to | |
137 the adapters in the adapters file. Since Scythe begins by checking for | |
138 contamination from the 5'-end of the adapter, this won't affect the | |
139 normal adapter contaminant cases. | |
140 | |
141 ### What does the numeric output from Scythe mean? | |
142 | |
143 For each adapter in the file, the contaminants removed by position are | |
144 returned via standard error. For example: | |
145 | |
146 Adapter 1 'fake adapter' contamination occurences: | |
147 [10, 2, 4, 5, 6] | |
148 | |
149 indicates that "fake adapter" is 5 bases long (the length of the array | |
150 returned), and that there were 10 contaminants found of first base (-n | |
151 was set to 0 then), 2 of the first two bases, 4 contaminants of the | |
152 first 3 bases, 5 of the first 4 bases, etc. | |
153 | |
154 ### Does Scythe work on FASTA files? | |
155 | |
156 No, as these have no quality information. | |
157 | |
158 ### How can I report a bug? | |
159 | |
160 See the section below. | |
161 | |
162 ### How does Scythe compare to program "x"? | |
163 | |
164 As far as I know, Scythe is the only program that employs a Bayesian | |
165 model that allows prior contaminant estimates to be used. This prior | |
166 is a more realistic approach than setting a fixed number of mismatches | |
167 because we can visually estimate it with the Unix tool `less`. | |
168 | |
169 Scythe also looks at base-level qualities, *not* just a fixed level of | |
170 mismatches. A fixed number of mismatches is a bad approach with data | |
171 our group (the UC Davis Bioinformatics Core) has seen, as a small bad | |
172 quality run can quickly exhaust even a high numbers of fixed | |
173 mismatches and lead to higher false negatives. | |
174 | |
175 ## Reporting Bugs | |
176 | |
177 Scythe is free software and is proved without a warranty. However, I | |
178 am proud of this software and I will do my best to provide updates, | |
179 bug fixes, and additional documentation as needed. Please report all | |
180 bugs and issues to Github's issue tracker | |
181 (http://github.com/vsbuffalo/scythe/issues). If you want to email me, | |
182 do so in addition to an issue request. | |
183 | |
184 If you have a suggestion or comment on Scythe's methods, you can email | |
185 me directly. | |
186 | |
187 ## Is there a paper about Scythe? | |
188 | |
189 I am currently writing a paper on Scythe's methods. In my preliminary | |
190 testing, Scythe has fewew false positives and false negatives than | |
191 it competitors. |