Mercurial > repos > jjohnson > package_scythe_0_991
changeset 0:3e82f8dbfdf0 draft default tip
Uploaded
author | jjohnson |
---|---|
date | Mon, 13 Jan 2014 14:32:36 -0500 |
parents | |
children | |
files | tool_dependencies.xml |
diffstat | 1 files changed, 32 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Mon Jan 13 14:32:36 2014 -0500 @@ -0,0 +1,32 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="scythe" version="0.991"> + <install version="1.0"> + <actions> + <action type="shell_command">git clone git://github.com/vsbuffalo/scythe.git</action> + <action type="shell_command">git reset --hard 9b965ee399a18caf1f96e433f78d405620e3a1df</action> + <action type="shell_command">make</action> + <action type="move_file"> + <source>scythe</source> + <destination>$INSTALL_DIR</destination> + </action> + <action type="set_environment"> + <environment_variable name="PATH" action="prepend_to">$INSTALL_DIR</environment_variable> + </action> + </actions> + </install> + <readme> +Scythe - A Bayesian adapter trimmer +Scythe and all supporting documentation Copyright (c) Vince Buffalo, 2011-2012 +https://github.com/vsbuffalo/scythe + +Scythe uses a Naive Bayesian approach to classify contaminant substrings in sequence reads. It considers quality information, which can make it robust in picking out 3'-end adapters, which often include poor quality bases. + +Most next generation sequencing reads have deteriorating quality towards the 3'-end. It's common for a quality-based trimmer to be employed before mapping, assemblies, and analysis to remove these poor quality bases. However, quality-based trimming could remove bases that are helpful in identifying (and removing) 3'-end adapter contaminants. Thus, it is recommended you run Scythe before quality-based trimming, as part of a read quality control pipeline. + +The Bayesian approach Scythe uses compares two likelihood models: the probability of seeing the matches in a sequence given contamination, and not given contamination. Given that the read is contaminated, the probability of seeing a certain number of matches and mismatches is a function of the quality of the sequence. Given the read is not contaminated (and is thus assumed to be random sequence), the probability of seeing a certain number of matches and mismatches is chance. The posterior is calculated across both these likelihood models, and the class (contaminated or not contaminated) with the maximum posterior probability is the class selected. + + </readme> + </package> +</tool_dependency> +