Mercurial > repos > jjohnson > scythe
diff README.md @ 0:08439b004404
Uploaded
author | jjohnson |
---|---|
date | Mon, 13 Jan 2014 14:57:53 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Mon Jan 13 14:57:53 2014 -0500 @@ -0,0 +1,191 @@ +# Scythe - A very simple adapter trimmer (version 0.981 BETA) + +Scythe and all supporting documentation +Copyright (c) Vince Buffalo, 2011-2012 + +Contact: Vince Buffalo <vsbuffaloAAAAA@gmail.com> (with the poly-A tail removed) + +If you wish to report a bug, please open an issue on Github +(http://github.com/vsbuffalo/scythe/issues) so that it can be +tracked. You can contact me as well, but please open an issue first. + +## About + +Scythe uses a Naive Bayesian approach to classify contaminant +substrings in sequence reads. It considers quality information, which +can make it robust in picking out 3'-end adapters, which often include +poor quality bases. + +Most next generation sequencing reads have deteriorating quality +towards the 3'-end. It's common for a quality-based trimmer to be +employed before mapping, assemblies, and analysis to remove these poor +quality bases. However, quality-based trimming could remove bases that +are helpful in identifying (and removing) 3'-end adapter +contaminants. Thus, it is recommended you run Scythe *before* +quality-based trimming, as part of a read quality control pipeline. + +The Bayesian approach Scythe uses compares two likelihood models: the +probability of seeing the matches in a sequence given contamination, +and not given contamination. Given that the read is contaminated, the +probability of seeing a certain number of matches and mistmatches is a +function of the quality of the sequence. Given the read is not +contaminated (and is thus assumed to be random sequence), the +probability of seeing a certain number of matches and mismatches is +chance. The posterior is calculated across both these likelihood +models, and the class (contaminated or not contaminated) with the +maximum posterior probability is the class selected. + +## Requirements + +Scythe can be compiled using GCC or Clang; compilation during +development used the latter. Scythe relies on Heng Li's kseq.h, which +is bundled with the source. + +Scythe requires Zlib, which can be obtained at <http://www.zlib.net/>. + +## Building and Installing Scythe + +To build Scythe, enter: + + make build + +Then, copy or move "scythe" to a directory in your $PATH. + +## Usage + +Scythe can be run minimally with: + + scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq + +By default, the prior contamination rate is 0.05. This can be changed +(and one is encouraged to do so!) with: + + scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq + +If you'd like to use standard out, it is recommended you use the +--quiet option: + + scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq + +Also, more detailed output about matches can be obtained with: + + scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq + +By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger +or Solexa (pipeline < 1.3) qualities can be specified with -q: + + scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq + +Lastly, a minimum match length argument can be specified with -n <integer>: + + scythe -a adapter_file.fasta -n 0 -o trimmed_sequences.fasta sequences.fastq + +The default is 5. If this pre-processing is upstream of assembly on a +very contaminated lane, decreasing this parameter could lead to *very* +liberal trimming, i.e. of only a few bases. + +## Notes + +Scythe only checks for 3'-end contaminants, up to the adapter's length +into the 3'-end. For reads with contamination in *any* position, the +program TagDust (<http://genome.gsc.riken.jp/osc/english/dataresource/>) +is recommended. Scythe has the advantages of allowing fuzzier matching +and being base quality-aware, while TagDust has the advantages of very +fast matching (but allowing few mismatches, and not considering +quality) and FDR. TagDust also removes contaminated reads *entirely*, while +Scythe trims off contaminants. + +A possible pipeline would run FASTQ reads through Scythe, then +TagDust, then a quality-based trimmer, and finally through a read +quality statistics program such as qrqc +(<http://bioconductor.org/packages/devel/bioc/html/qrqc.html>) or FASTqc +(<http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/>). + +## FAQ + +### Does Scythe work with paired-end data? + +Scythe does work with paired-end data. Each file must be run +separately, but Scythe will not remove reads entirely leaving +mismatched pairs. + +In some cases, barcodes are ligated to both the 3'-end and 5'-end of +reads. 5'-end removal is trivial since base calling is near-perfect +there, but 3'-end removal can be trickier. Some users have created +Scythe adapter files that contain all possible barcodes concatenated +with possible adapters, so that both can be recognized and +removed. This has worked well and is recommended for cases when 3'-end +quality deteriorates and prevents barcode removal. Newer Illumina +chemistry has the barcode separated from the fragment, so that it +appears as an entirely separate read and is used to demultiplex sample +reads by Illumina's CASAVA pipeline. + +### Does Scythe work on 5'-end or other contaminants? + +No. Embracing the Unix tool philosophy that tools should do one thing +very well, Scythe just removes 3'-end contaminants where there could +be multiple base mismatches due to poor base quality. N-mismatch +algorithms (such as TagDust) don't consider base qualities. Scythe +will allow more mismatches in an alignment if the mismatched bases are +of low quality. + +**Scythe only checks as far in as the entire adapter contaminant's +length.** However, some investigation has shown that Illumina +pipelines sometimes produce reads longer than the read length + +adapter length. The extra bases have always been observed to be +A's. Some testing has shown this can be addressed by appending A's to +the adapters in the adapters file. Since Scythe begins by checking for +contamination from the 5'-end of the adapter, this won't affect the +normal adapter contaminant cases. + +### What does the numeric output from Scythe mean? + +For each adapter in the file, the contaminants removed by position are +returned via standard error. For example: + + Adapter 1 'fake adapter' contamination occurences: + [10, 2, 4, 5, 6] + +indicates that "fake adapter" is 5 bases long (the length of the array +returned), and that there were 10 contaminants found of first base (-n +was set to 0 then), 2 of the first two bases, 4 contaminants of the +first 3 bases, 5 of the first 4 bases, etc. + +### Does Scythe work on FASTA files? + +No, as these have no quality information. + +### How can I report a bug? + +See the section below. + +### How does Scythe compare to program "x"? + +As far as I know, Scythe is the only program that employs a Bayesian +model that allows prior contaminant estimates to be used. This prior +is a more realistic approach than setting a fixed number of mismatches +because we can visually estimate it with the Unix tool `less`. + +Scythe also looks at base-level qualities, *not* just a fixed level of +mismatches. A fixed number of mismatches is a bad approach with data +our group (the UC Davis Bioinformatics Core) has seen, as a small bad +quality run can quickly exhaust even a high numbers of fixed +mismatches and lead to higher false negatives. + +## Reporting Bugs + +Scythe is free software and is proved without a warranty. However, I +am proud of this software and I will do my best to provide updates, +bug fixes, and additional documentation as needed. Please report all +bugs and issues to Github's issue tracker +(http://github.com/vsbuffalo/scythe/issues). If you want to email me, +do so in addition to an issue request. + +If you have a suggestion or comment on Scythe's methods, you can email +me directly. + +## Is there a paper about Scythe? + +I am currently writing a paper on Scythe's methods. In my preliminary +testing, Scythe has fewew false positives and false negatives than +it competitors. \ No newline at end of file