Mercurial > repos > dave > dynamic_downsample
comparison downsample.xml @ 1:03aeb837e398 draft default tip
Uploaded
author | dave |
---|---|
date | Tue, 01 Oct 2019 16:25:02 -0400 |
parents | 20823bce09e7 |
children |
comparison
equal
deleted
inserted
replaced
0:20823bce09e7 | 1:03aeb837e398 |
---|---|
1 <?xml version="1.0"?> | 1 <?xml version="1.0"?> |
2 <tool id="dynamic_downsample" name="Dynamically downsample" version="1.0.0"> | 2 <tool id="dynamic_downsample" name="Downsample" version="1.0.0"> |
3 <description>reads to desired coverage</description> | 3 <description>reads to desired coverage</description> |
4 <requirements> | 4 <requirements> |
5 <requirement type="package" version="1.9">samtools</requirement> | 5 <requirement type="package" version="1.9">samtools</requirement> |
6 <requirement type="package" version="5.0.1">gawk</requirement> | 6 <requirement type="package" version="5.0.1">gawk</requirement> |
7 </requirements> | 7 </requirements> |
8 <command><![CDATA[ | 8 <command><![CDATA[ |
9 if FACTOR=\$(samtools depth '$reads' | awk '{ a[i++]=\$3; } END { x=int((i+1)/2); if (x < (i+1)/2) y=(a[x-1]+a[x])/2; else y=a[x-1]; f = 1/(y/$coverage) ; if (f >= 1) exit 1 ; else print f }') ; | 9 if FACTOR=\$(samtools depth '$reads' | awk '{ readcovs[x++]=\$3; } END { n = asort(readcovs) ; idx=int((x+1)/2) ; coverage = ((idx==(x+1)/2) ? readcovs[idx] : (readcovs[idx]+readcovs[idx+1])/2) ; factor = 1/(coverage/$target_coverage) ; if (factor >= 1) exit 1 ; else print factor }') ; |
10 then samtools view '$reads' -s \$FACTOR -O $reads.datatype -o '$output' ; | 10 then samtools view '$reads' -s \$FACTOR -O BAM -o '$output' -@ \${GALAXY_SLOTS:-1} ; |
11 else ; | 11 else samtools view -O BAM '$reads' -o '$output' ; |
12 cp '$reads' '$output' | |
13 fi | 12 fi |
14 ]]> | 13 ]]> |
15 </command> | 14 </command> |
16 <inputs> | 15 <inputs> |
17 <param name="reads" type="data" format="sam,bam" label="Reads to downsample" /> | 16 <param name="reads" type="data" format="sam,bam" label="Reads to downsample" /> |
18 <param name="coverage" type="integer" value="1000" label="Target coverage" /> | 17 <param name="target_coverage" type="integer" value="1000" label="Target coverage" /> |
19 </inputs> | 18 </inputs> |
20 <outputs> | 19 <outputs> |
21 <data format="bam" name="output" label="${tool.name} on ${on_string} (Downsampled to ${coverage}x coverage)"> | 20 <data format="bam" name="output" label="Downsample ${on_string} to ${target_coverage}x coverage" /> |
22 <change_format> | |
23 <when input="reads" value="sam" format="sam" /> | |
24 </change_format> | |
25 </data> | |
26 </outputs> | 21 </outputs> |
27 <tests> | 22 <tests> |
23 <test> | |
24 <param name="reads" ftype="bam" value="downsample-in1.bam" /> | |
25 <param name="target_coverage" value="100" /> | |
26 <output name="output" file="downsample-out1.bam" /> | |
27 </test> | |
28 </tests> | 28 </tests> |
29 <help> | 29 <help><![CDATA[ |
30 .. role:: bash(code) | |
31 :language: bash | |
32 | |
33 | |
34 Dynamic Downsampling | |
35 ~~~~~~~~~~~~~~~~~~~~ | |
36 | |
37 A known issue with variant analysis is that when small genomes are sequenced, | |
38 e.g. HIV at 9.7 kilobases or the human mitochondria at 16.6kb, the resulting | |
39 coverage can easily exceed 10,000x. This can cause performance issues for some | |
40 variant callers, especially those that employ a haplotyping approach to variant | |
41 detection. | |
42 | |
43 This tool attempts to ameliorate that issue by downsampling its input files to | |
44 the target coverage using :bash:`samtools depth` to determine the median | |
45 coverage for a given BAM file, then running :bash:`samtools view -s` on the file | |
46 if 1 / (median coverage / desired coverage) is less than 1. | |
47 | |
48 .. code-block:: bash | |
49 | |
50 -s FLOAT subsample reads (given INT.FRAC option value, 0.FRAC is the fraction of templates/read pairs to keep; INT part sets seed) | |
51 | |
52 The median coverage is determined by passing the :bash:`samtools depth` command | |
53 through the following :bash:`awk` script, where :bash:`$target_coverage` is the | |
54 value specified in the tool form: | |
55 | |
56 .. code-block:: awk | |
57 | |
58 '{ readcovs[x++]=$3; } END | |
59 { | |
60 n = asort(readcovs) ; | |
61 idx=int((x+1)/2) ; | |
62 coverage = ((idx==(x+1)/2) ? readcovs[idx] : (readcovs[idx]+readcovs[idx+1])/2) ; | |
63 factor = 1/(coverage/$target_coverage) ; | |
64 if (factor >= 1) exit 1 ; | |
65 else print factor | |
66 }' | |
67 | |
68 On an exit code of 1, the tool will simply copy the input to the output without | |
69 altering it. If the :bash:`awk` step returns a value instead, the tool then runs | |
70 :bash:`samtools view -s 1 / (median coverage / desired coverage)` | |
71 | |
72 ]]> | |
30 </help> | 73 </help> |
31 <citations> | 74 <citations> |
32 </citations> | 75 </citations> |
33 </tool> | 76 </tool> |