scythe: README.md annotate

annotate README.md @ 1:b0276d1141fe default tip

Fix test case

author	Jim Johnson <jj@umn.edu>
date	Thu, 30 Jan 2014 13:10:12 -0600
parents	08439b004404
children

rev	line source
0 08439b004404 Uploaded jjohnson parents: diff changeset	1 # Scythe - A very simple adapter trimmer (version 0.981 BETA)
08439b004404 Uploaded jjohnson parents: diff changeset	2
08439b004404 Uploaded jjohnson parents: diff changeset	3 Scythe and all supporting documentation
08439b004404 Uploaded jjohnson parents: diff changeset	4 Copyright (c) Vince Buffalo, 2011-2012
08439b004404 Uploaded jjohnson parents: diff changeset	5
08439b004404 Uploaded jjohnson parents: diff changeset	6 Contact: Vince Buffalo <vsbuffaloAAAAA@gmail.com> (with the poly-A tail removed)
08439b004404 Uploaded jjohnson parents: diff changeset	7
08439b004404 Uploaded jjohnson parents: diff changeset	8 If you wish to report a bug, please open an issue on Github
08439b004404 Uploaded jjohnson parents: diff changeset	9 (http://github.com/vsbuffalo/scythe/issues) so that it can be
08439b004404 Uploaded jjohnson parents: diff changeset	10 tracked. You can contact me as well, but please open an issue first.
08439b004404 Uploaded jjohnson parents: diff changeset	11
08439b004404 Uploaded jjohnson parents: diff changeset	12 ## About
08439b004404 Uploaded jjohnson parents: diff changeset	13
08439b004404 Uploaded jjohnson parents: diff changeset	14 Scythe uses a Naive Bayesian approach to classify contaminant
08439b004404 Uploaded jjohnson parents: diff changeset	15 substrings in sequence reads. It considers quality information, which
08439b004404 Uploaded jjohnson parents: diff changeset	16 can make it robust in picking out 3'-end adapters, which often include
08439b004404 Uploaded jjohnson parents: diff changeset	17 poor quality bases.
08439b004404 Uploaded jjohnson parents: diff changeset	18
08439b004404 Uploaded jjohnson parents: diff changeset	19 Most next generation sequencing reads have deteriorating quality
08439b004404 Uploaded jjohnson parents: diff changeset	20 towards the 3'-end. It's common for a quality-based trimmer to be
08439b004404 Uploaded jjohnson parents: diff changeset	21 employed before mapping, assemblies, and analysis to remove these poor
08439b004404 Uploaded jjohnson parents: diff changeset	22 quality bases. However, quality-based trimming could remove bases that
08439b004404 Uploaded jjohnson parents: diff changeset	23 are helpful in identifying (and removing) 3'-end adapter
08439b004404 Uploaded jjohnson parents: diff changeset	24 contaminants. Thus, it is recommended you run Scythe before
08439b004404 Uploaded jjohnson parents: diff changeset	25 quality-based trimming, as part of a read quality control pipeline.
08439b004404 Uploaded jjohnson parents: diff changeset	26
08439b004404 Uploaded jjohnson parents: diff changeset	27 The Bayesian approach Scythe uses compares two likelihood models: the
08439b004404 Uploaded jjohnson parents: diff changeset	28 probability of seeing the matches in a sequence given contamination,
08439b004404 Uploaded jjohnson parents: diff changeset	29 and not given contamination. Given that the read is contaminated, the
08439b004404 Uploaded jjohnson parents: diff changeset	30 probability of seeing a certain number of matches and mistmatches is a
08439b004404 Uploaded jjohnson parents: diff changeset	31 function of the quality of the sequence. Given the read is not
08439b004404 Uploaded jjohnson parents: diff changeset	32 contaminated (and is thus assumed to be random sequence), the
08439b004404 Uploaded jjohnson parents: diff changeset	33 probability of seeing a certain number of matches and mismatches is
08439b004404 Uploaded jjohnson parents: diff changeset	34 chance. The posterior is calculated across both these likelihood
08439b004404 Uploaded jjohnson parents: diff changeset	35 models, and the class (contaminated or not contaminated) with the
08439b004404 Uploaded jjohnson parents: diff changeset	36 maximum posterior probability is the class selected.
08439b004404 Uploaded jjohnson parents: diff changeset	37
08439b004404 Uploaded jjohnson parents: diff changeset	38 ## Requirements
08439b004404 Uploaded jjohnson parents: diff changeset	39
08439b004404 Uploaded jjohnson parents: diff changeset	40 Scythe can be compiled using GCC or Clang; compilation during
08439b004404 Uploaded jjohnson parents: diff changeset	41 development used the latter. Scythe relies on Heng Li's kseq.h, which
08439b004404 Uploaded jjohnson parents: diff changeset	42 is bundled with the source.
08439b004404 Uploaded jjohnson parents: diff changeset	43
08439b004404 Uploaded jjohnson parents: diff changeset	44 Scythe requires Zlib, which can be obtained at <http://www.zlib.net/>.
08439b004404 Uploaded jjohnson parents: diff changeset	45
08439b004404 Uploaded jjohnson parents: diff changeset	46 ## Building and Installing Scythe
08439b004404 Uploaded jjohnson parents: diff changeset	47
08439b004404 Uploaded jjohnson parents: diff changeset	48 To build Scythe, enter:
08439b004404 Uploaded jjohnson parents: diff changeset	49
08439b004404 Uploaded jjohnson parents: diff changeset	50 make build
08439b004404 Uploaded jjohnson parents: diff changeset	51
08439b004404 Uploaded jjohnson parents: diff changeset	52 Then, copy or move "scythe" to a directory in your $PATH.
08439b004404 Uploaded jjohnson parents: diff changeset	53
08439b004404 Uploaded jjohnson parents: diff changeset	54 ## Usage
08439b004404 Uploaded jjohnson parents: diff changeset	55
08439b004404 Uploaded jjohnson parents: diff changeset	56 Scythe can be run minimally with:
08439b004404 Uploaded jjohnson parents: diff changeset	57
08439b004404 Uploaded jjohnson parents: diff changeset	58 scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq
08439b004404 Uploaded jjohnson parents: diff changeset	59
08439b004404 Uploaded jjohnson parents: diff changeset	60 By default, the prior contamination rate is 0.05. This can be changed
08439b004404 Uploaded jjohnson parents: diff changeset	61 (and one is encouraged to do so!) with:
08439b004404 Uploaded jjohnson parents: diff changeset	62
08439b004404 Uploaded jjohnson parents: diff changeset	63 scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq
08439b004404 Uploaded jjohnson parents: diff changeset	64
08439b004404 Uploaded jjohnson parents: diff changeset	65 If you'd like to use standard out, it is recommended you use the
08439b004404 Uploaded jjohnson parents: diff changeset	66 --quiet option:
08439b004404 Uploaded jjohnson parents: diff changeset	67
08439b004404 Uploaded jjohnson parents: diff changeset	68 scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq
08439b004404 Uploaded jjohnson parents: diff changeset	69
08439b004404 Uploaded jjohnson parents: diff changeset	70 Also, more detailed output about matches can be obtained with:
08439b004404 Uploaded jjohnson parents: diff changeset	71
08439b004404 Uploaded jjohnson parents: diff changeset	72 scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq
08439b004404 Uploaded jjohnson parents: diff changeset	73
08439b004404 Uploaded jjohnson parents: diff changeset	74 By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger
08439b004404 Uploaded jjohnson parents: diff changeset	75 or Solexa (pipeline < 1.3) qualities can be specified with -q:
08439b004404 Uploaded jjohnson parents: diff changeset	76
08439b004404 Uploaded jjohnson parents: diff changeset	77 scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq
08439b004404 Uploaded jjohnson parents: diff changeset	78
08439b004404 Uploaded jjohnson parents: diff changeset	79 Lastly, a minimum match length argument can be specified with -n <integer>:
08439b004404 Uploaded jjohnson parents: diff changeset	80
08439b004404 Uploaded jjohnson parents: diff changeset	81 scythe -a adapter_file.fasta -n 0 -o trimmed_sequences.fasta sequences.fastq
08439b004404 Uploaded jjohnson parents: diff changeset	82
08439b004404 Uploaded jjohnson parents: diff changeset	83 The default is 5. If this pre-processing is upstream of assembly on a
08439b004404 Uploaded jjohnson parents: diff changeset	84 very contaminated lane, decreasing this parameter could lead to very
08439b004404 Uploaded jjohnson parents: diff changeset	85 liberal trimming, i.e. of only a few bases.
08439b004404 Uploaded jjohnson parents: diff changeset	86
08439b004404 Uploaded jjohnson parents: diff changeset	87 ## Notes
08439b004404 Uploaded jjohnson parents: diff changeset	88
08439b004404 Uploaded jjohnson parents: diff changeset	89 Scythe only checks for 3'-end contaminants, up to the adapter's length
08439b004404 Uploaded jjohnson parents: diff changeset	90 into the 3'-end. For reads with contamination in any position, the
08439b004404 Uploaded jjohnson parents: diff changeset	91 program TagDust (<http://genome.gsc.riken.jp/osc/english/dataresource/>)
08439b004404 Uploaded jjohnson parents: diff changeset	92 is recommended. Scythe has the advantages of allowing fuzzier matching
08439b004404 Uploaded jjohnson parents: diff changeset	93 and being base quality-aware, while TagDust has the advantages of very
08439b004404 Uploaded jjohnson parents: diff changeset	94 fast matching (but allowing few mismatches, and not considering
08439b004404 Uploaded jjohnson parents: diff changeset	95 quality) and FDR. TagDust also removes contaminated reads entirely, while
08439b004404 Uploaded jjohnson parents: diff changeset	96 Scythe trims off contaminants.
08439b004404 Uploaded jjohnson parents: diff changeset	97
08439b004404 Uploaded jjohnson parents: diff changeset	98 A possible pipeline would run FASTQ reads through Scythe, then
08439b004404 Uploaded jjohnson parents: diff changeset	99 TagDust, then a quality-based trimmer, and finally through a read
08439b004404 Uploaded jjohnson parents: diff changeset	100 quality statistics program such as qrqc
08439b004404 Uploaded jjohnson parents: diff changeset	101 (<http://bioconductor.org/packages/devel/bioc/html/qrqc.html>) or FASTqc
08439b004404 Uploaded jjohnson parents: diff changeset	102 (<http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/>).
08439b004404 Uploaded jjohnson parents: diff changeset	103
08439b004404 Uploaded jjohnson parents: diff changeset	104 ## FAQ
08439b004404 Uploaded jjohnson parents: diff changeset	105
08439b004404 Uploaded jjohnson parents: diff changeset	106 ### Does Scythe work with paired-end data?
08439b004404 Uploaded jjohnson parents: diff changeset	107
08439b004404 Uploaded jjohnson parents: diff changeset	108 Scythe does work with paired-end data. Each file must be run
08439b004404 Uploaded jjohnson parents: diff changeset	109 separately, but Scythe will not remove reads entirely leaving
08439b004404 Uploaded jjohnson parents: diff changeset	110 mismatched pairs.
08439b004404 Uploaded jjohnson parents: diff changeset	111
08439b004404 Uploaded jjohnson parents: diff changeset	112 In some cases, barcodes are ligated to both the 3'-end and 5'-end of
08439b004404 Uploaded jjohnson parents: diff changeset	113 reads. 5'-end removal is trivial since base calling is near-perfect
08439b004404 Uploaded jjohnson parents: diff changeset	114 there, but 3'-end removal can be trickier. Some users have created
08439b004404 Uploaded jjohnson parents: diff changeset	115 Scythe adapter files that contain all possible barcodes concatenated
08439b004404 Uploaded jjohnson parents: diff changeset	116 with possible adapters, so that both can be recognized and
08439b004404 Uploaded jjohnson parents: diff changeset	117 removed. This has worked well and is recommended for cases when 3'-end
08439b004404 Uploaded jjohnson parents: diff changeset	118 quality deteriorates and prevents barcode removal. Newer Illumina
08439b004404 Uploaded jjohnson parents: diff changeset	119 chemistry has the barcode separated from the fragment, so that it
08439b004404 Uploaded jjohnson parents: diff changeset	120 appears as an entirely separate read and is used to demultiplex sample
08439b004404 Uploaded jjohnson parents: diff changeset	121 reads by Illumina's CASAVA pipeline.
08439b004404 Uploaded jjohnson parents: diff changeset	122
08439b004404 Uploaded jjohnson parents: diff changeset	123 ### Does Scythe work on 5'-end or other contaminants?
08439b004404 Uploaded jjohnson parents: diff changeset	124
08439b004404 Uploaded jjohnson parents: diff changeset	125 No. Embracing the Unix tool philosophy that tools should do one thing
08439b004404 Uploaded jjohnson parents: diff changeset	126 very well, Scythe just removes 3'-end contaminants where there could
08439b004404 Uploaded jjohnson parents: diff changeset	127 be multiple base mismatches due to poor base quality. N-mismatch
08439b004404 Uploaded jjohnson parents: diff changeset	128 algorithms (such as TagDust) don't consider base qualities. Scythe
08439b004404 Uploaded jjohnson parents: diff changeset	129 will allow more mismatches in an alignment if the mismatched bases are
08439b004404 Uploaded jjohnson parents: diff changeset	130 of low quality.
08439b004404 Uploaded jjohnson parents: diff changeset	131
08439b004404 Uploaded jjohnson parents: diff changeset	132 **Scythe only checks as far in as the entire adapter contaminant's
08439b004404 Uploaded jjohnson parents: diff changeset	133 length.** However, some investigation has shown that Illumina
08439b004404 Uploaded jjohnson parents: diff changeset	134 pipelines sometimes produce reads longer than the read length +
08439b004404 Uploaded jjohnson parents: diff changeset	135 adapter length. The extra bases have always been observed to be
08439b004404 Uploaded jjohnson parents: diff changeset	136 A's. Some testing has shown this can be addressed by appending A's to
08439b004404 Uploaded jjohnson parents: diff changeset	137 the adapters in the adapters file. Since Scythe begins by checking for
08439b004404 Uploaded jjohnson parents: diff changeset	138 contamination from the 5'-end of the adapter, this won't affect the
08439b004404 Uploaded jjohnson parents: diff changeset	139 normal adapter contaminant cases.
08439b004404 Uploaded jjohnson parents: diff changeset	140
08439b004404 Uploaded jjohnson parents: diff changeset	141 ### What does the numeric output from Scythe mean?
08439b004404 Uploaded jjohnson parents: diff changeset	142
08439b004404 Uploaded jjohnson parents: diff changeset	143 For each adapter in the file, the contaminants removed by position are
08439b004404 Uploaded jjohnson parents: diff changeset	144 returned via standard error. For example:
08439b004404 Uploaded jjohnson parents: diff changeset	145
08439b004404 Uploaded jjohnson parents: diff changeset	146 Adapter 1 'fake adapter' contamination occurences:
08439b004404 Uploaded jjohnson parents: diff changeset	147 [10, 2, 4, 5, 6]
08439b004404 Uploaded jjohnson parents: diff changeset	148
08439b004404 Uploaded jjohnson parents: diff changeset	149 indicates that "fake adapter" is 5 bases long (the length of the array
08439b004404 Uploaded jjohnson parents: diff changeset	150 returned), and that there were 10 contaminants found of first base (-n
08439b004404 Uploaded jjohnson parents: diff changeset	151 was set to 0 then), 2 of the first two bases, 4 contaminants of the
08439b004404 Uploaded jjohnson parents: diff changeset	152 first 3 bases, 5 of the first 4 bases, etc.
08439b004404 Uploaded jjohnson parents: diff changeset	153
08439b004404 Uploaded jjohnson parents: diff changeset	154 ### Does Scythe work on FASTA files?
08439b004404 Uploaded jjohnson parents: diff changeset	155
08439b004404 Uploaded jjohnson parents: diff changeset	156 No, as these have no quality information.
08439b004404 Uploaded jjohnson parents: diff changeset	157
08439b004404 Uploaded jjohnson parents: diff changeset	158 ### How can I report a bug?
08439b004404 Uploaded jjohnson parents: diff changeset	159
08439b004404 Uploaded jjohnson parents: diff changeset	160 See the section below.
08439b004404 Uploaded jjohnson parents: diff changeset	161
08439b004404 Uploaded jjohnson parents: diff changeset	162 ### How does Scythe compare to program "x"?
08439b004404 Uploaded jjohnson parents: diff changeset	163
08439b004404 Uploaded jjohnson parents: diff changeset	164 As far as I know, Scythe is the only program that employs a Bayesian
08439b004404 Uploaded jjohnson parents: diff changeset	165 model that allows prior contaminant estimates to be used. This prior
08439b004404 Uploaded jjohnson parents: diff changeset	166 is a more realistic approach than setting a fixed number of mismatches
08439b004404 Uploaded jjohnson parents: diff changeset	167 because we can visually estimate it with the Unix tool `less`.
08439b004404 Uploaded jjohnson parents: diff changeset	168
08439b004404 Uploaded jjohnson parents: diff changeset	169 Scythe also looks at base-level qualities, not just a fixed level of
08439b004404 Uploaded jjohnson parents: diff changeset	170 mismatches. A fixed number of mismatches is a bad approach with data
08439b004404 Uploaded jjohnson parents: diff changeset	171 our group (the UC Davis Bioinformatics Core) has seen, as a small bad
08439b004404 Uploaded jjohnson parents: diff changeset	172 quality run can quickly exhaust even a high numbers of fixed
08439b004404 Uploaded jjohnson parents: diff changeset	173 mismatches and lead to higher false negatives.
08439b004404 Uploaded jjohnson parents: diff changeset	174
08439b004404 Uploaded jjohnson parents: diff changeset	175 ## Reporting Bugs
08439b004404 Uploaded jjohnson parents: diff changeset	176
08439b004404 Uploaded jjohnson parents: diff changeset	177 Scythe is free software and is proved without a warranty. However, I
08439b004404 Uploaded jjohnson parents: diff changeset	178 am proud of this software and I will do my best to provide updates,
08439b004404 Uploaded jjohnson parents: diff changeset	179 bug fixes, and additional documentation as needed. Please report all
08439b004404 Uploaded jjohnson parents: diff changeset	180 bugs and issues to Github's issue tracker
08439b004404 Uploaded jjohnson parents: diff changeset	181 (http://github.com/vsbuffalo/scythe/issues). If you want to email me,
08439b004404 Uploaded jjohnson parents: diff changeset	182 do so in addition to an issue request.
08439b004404 Uploaded jjohnson parents: diff changeset	183
08439b004404 Uploaded jjohnson parents: diff changeset	184 If you have a suggestion or comment on Scythe's methods, you can email
08439b004404 Uploaded jjohnson parents: diff changeset	185 me directly.
08439b004404 Uploaded jjohnson parents: diff changeset	186
08439b004404 Uploaded jjohnson parents: diff changeset	187 ## Is there a paper about Scythe?
08439b004404 Uploaded jjohnson parents: diff changeset	188
08439b004404 Uploaded jjohnson parents: diff changeset	189 I am currently writing a paper on Scythe's methods. In my preliminary
08439b004404 Uploaded jjohnson parents: diff changeset	190 testing, Scythe has fewew false positives and false negatives than
08439b004404 Uploaded jjohnson parents: diff changeset	191 it competitors.

Mercurial > repos > jjohnson > scythe

annotate README.md @ 1:b0276d1141fe default tip