annotate allele-counts.py @ 14:e5d39283fe4d

allele-counts.py: Document output columns.
author nicksto <nmapsy@gmail.com>
date Wed, 09 Dec 2015 11:03:33 -0500
parents df1fb577db0d
children 44c3abd1b767
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
1 #!/usr/bin/python
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
2 """
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
3 Run with -h option or see DESCRIPTION for description.
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
4 This script's functionality is being obsoleted by the new, and much more sanely
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
5 written, nvc-filter.py.
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
6
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
7 New in this version:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
8 - Calculate strand bias
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
9 - Rename MINOR.FREQ.PERC to MAF
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
10
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
11 Naive Variant Caller variant count parsing one-liner:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
12 $ cat variants.vcf | grep -v '^#' | cut -f 10 | cut -d ':' -f 4 | tr ',=' '\t:'
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
13 """
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
14 from __future__ import division
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
15 import os
57e85db3f13c Uploaded script
nick
parents:
diff changeset
16 import sys
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
17 import errno
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
18 import random
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
19 from optparse import OptionParser
57e85db3f13c Uploaded script
nick
parents:
diff changeset
20
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
21 COLUMNS = ['sample', 'chr', 'pos', 'A', 'C', 'G', 'T', 'coverage', 'alleles',
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
22 'major', 'minor', 'freq', 'bias']
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
23 COLUMN_LABELS = ['SAMPLE', 'CHR', 'POS', 'A', 'C', 'G', 'T', 'CVRG', 'ALLELES',
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
24 'MAJOR', 'MINOR', 'MAF', 'BIAS']
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
25 CANONICAL_VARIANTS = ['A', 'C', 'G', 'T']
14
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
26 USAGE = """Usage: %prog [options] -i variants.vcf -o alleles.tsv
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
27 cat variants.vcf | %prog [options] > alleles.tsv"""
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
28 OPT_DEFAULTS = {'infile':'-', 'outfile':'-', 'freq_thres':1.0, 'covg_thres':100,
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
29 'print_header':False, 'stdin':False, 'stranded':False, 'no_filter':False,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
30 'debug_loc':'', 'seed':''}
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
31 DESCRIPTION = """This will parse the VCF output of the "Naive Variant Caller"
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
32 (aka "BAM Coverage") Galaxy tool. For each position reported, it counts the
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
33 number of reads of each base, determines the major allele, minor allele (second
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
34 most frequent variant), and number of alleles above a threshold. So currently
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
35 it only considers SNVs (ACGT), including in the coverage figure. By default it
14
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
36 reads from stdin and prints to stdout.
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
37 Prints a tab-delimited set of statistics to stdout.
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
38 To print output column labels, run "$ echo -n | ./allele-counts.py -H".
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
39 The columns are: 1:SAMPLE 2:CHR 3:POS 4:A 5:C 6:G 7:T 8:CVRG 9:ALLELES 10:MAJOR
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
40 11:MINOR 12:MAF 13:BIAS,
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
41 unless the --stranded option is used, in which case they are:
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
42 1:SAMPLE 2:CHR 3:POS 4:+A 5:+C 6:+G 7:+T 8:-A 9:-C 10:-G 11:-T 12:CVRG
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
43 13:ALLELES 14:MAJOR 15:MINOR 16:MAF 17:BIAS.
e5d39283fe4d allele-counts.py: Document output columns.
nicksto <nmapsy@gmail.com>
parents: 13
diff changeset
44 """
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
45 EPILOG = """Requirements:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
46 The input VCF must report the variants for each strand.
57e85db3f13c Uploaded script
nick
parents:
diff changeset
47 The variants should be case-sensitive (e.g. all capital base letters).
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
48 Strand bias: Both strands must show the same bases passing the frequency
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
49 threshold (but not necessarily in the same order). If the site fails this test,
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
50 the number of alleles is reported as 0."""
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
51
57e85db3f13c Uploaded script
nick
parents:
diff changeset
52 def get_options(defaults, usage, description='', epilog=''):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
53 """Get options, print usage text."""
57e85db3f13c Uploaded script
nick
parents:
diff changeset
54
57e85db3f13c Uploaded script
nick
parents:
diff changeset
55 parser = OptionParser(usage=usage, description=description, epilog=epilog)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
56
57e85db3f13c Uploaded script
nick
parents:
diff changeset
57 parser.add_option('-i', '--infile', dest='infile',
57e85db3f13c Uploaded script
nick
parents:
diff changeset
58 default=defaults.get('infile'),
57e85db3f13c Uploaded script
nick
parents:
diff changeset
59 help='Read input VCF data from this file instead of stdin.')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
60 parser.add_option('-o', '--outfile', dest='outfile',
57e85db3f13c Uploaded script
nick
parents:
diff changeset
61 default=defaults.get('outfile'),
57e85db3f13c Uploaded script
nick
parents:
diff changeset
62 help='Print output data to this file instead of stdout.')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
63 parser.add_option('-f', '--freq-thres', dest='freq_thres', type='float',
57e85db3f13c Uploaded script
nick
parents:
diff changeset
64 default=defaults.get('freq_thres'),
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
65 help=('Frequency threshold for counting alleles, given in percentage: -f 1 '
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
66 +'= 1% frequency. Default is %default%.'))
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
67 parser.add_option('-c', '--covg-thres', dest='covg_thres', type='int',
57e85db3f13c Uploaded script
nick
parents:
diff changeset
68 default=defaults.get('covg_thres'),
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
69 help=('Coverage threshold. Each site must be supported by at least this '
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
70 +'many reads on each strand. Otherwise the site will not be printed in '
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
71 +'the output. The default is %default reads per strand.'))
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
72 parser.add_option('-n', '--no-filter', dest='no_filter', action='store_const',
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
73 const=not(defaults.get('no_filter')), default=defaults.get('no_filter'),
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
74 help=('Operate without a frequency threshold or coverage threshold. '
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
75 +'Equivalent to "-c 0 -f 0".'))
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
76 parser.add_option('-H', '--header', dest='print_header', action='store_const',
57e85db3f13c Uploaded script
nick
parents:
diff changeset
77 const=not(defaults.get('print_header')), default=defaults.get('print_header'),
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
78 help=('Print header line. This is a #-commented line with the column '
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
79 +'labels. Off by default.'))
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
80 parser.add_option('-s', '--stranded', dest='stranded', action='store_const',
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
81 const=not(defaults.get('stranded')), default=defaults.get('stranded'),
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
82 help='Report variant counts by strand, in separate columns. Off by default.')
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
83 parser.add_option('-r', '--rand-seed', dest='seed',
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
84 default=defaults.get('seed'), help=('Seed for random number generator.'))
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
85 parser.add_option('-d', '--debug', dest='debug_loc',
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
86 default=defaults.get('debug_loc'),
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
87 help=('Turn on debug mode and specify a single site to process using UCSC '
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
88 +'coordinate format. You can also append a sample ID after another ":" '
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
89 +'to restrict it further.'))
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
90
57e85db3f13c Uploaded script
nick
parents:
diff changeset
91 (options, args) = parser.parse_args()
57e85db3f13c Uploaded script
nick
parents:
diff changeset
92
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
93 return (options, args)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
94
57e85db3f13c Uploaded script
nick
parents:
diff changeset
95
57e85db3f13c Uploaded script
nick
parents:
diff changeset
96 def main():
57e85db3f13c Uploaded script
nick
parents:
diff changeset
97
57e85db3f13c Uploaded script
nick
parents:
diff changeset
98 (options, args) = get_options(OPT_DEFAULTS, USAGE, DESCRIPTION, EPILOG)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
99
57e85db3f13c Uploaded script
nick
parents:
diff changeset
100 infile = options.infile
57e85db3f13c Uploaded script
nick
parents:
diff changeset
101 outfile = options.outfile
57e85db3f13c Uploaded script
nick
parents:
diff changeset
102 print_header = options.print_header
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
103 freq_thres = options.freq_thres / 100
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
104 covg_thres = options.covg_thres
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
105 stranded = options.stranded
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
106 debug_loc = options.debug_loc
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
107 seed = options.seed
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
108
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
109 if options.no_filter:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
110 freq_thres = 0
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
111 covg_thres = 0
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
112
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
113 if seed:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
114 random.seed(seed)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
115
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
116 debug = False
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
117 print_sample = ''
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
118 if debug_loc:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
119 debug = True
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
120 coords = debug_loc.split(':')
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
121 print_chr = coords[0]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
122 print_pos = ''
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
123 if len(coords) > 1: print_pos = coords[1]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
124 if len(coords) > 2: print_sample = coords[2]
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
125
57e85db3f13c Uploaded script
nick
parents:
diff changeset
126 # set infile_handle to either stdin or the input file
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
127 global infile_handle
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
128 if infile == OPT_DEFAULTS.get('infile'):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
129 infile_handle = sys.stdin
57e85db3f13c Uploaded script
nick
parents:
diff changeset
130 sys.stderr.write("Reading from standard input..\n")
57e85db3f13c Uploaded script
nick
parents:
diff changeset
131 else:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
132 if os.path.exists(infile):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
133 infile_handle = open(infile, 'r')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
134 else:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
135 fail('Error: Input VCF file '+infile+' not found.')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
136
57e85db3f13c Uploaded script
nick
parents:
diff changeset
137 # set outfile_handle to either stdout or the output file
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
138 global outfile_handle
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
139 if outfile == OPT_DEFAULTS.get('outfile'):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
140 outfile_handle = sys.stdout
57e85db3f13c Uploaded script
nick
parents:
diff changeset
141 else:
2
d83368b907f7 No error on existing output file
nick
parents: 0
diff changeset
142 try:
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
143 outfile_handle = open(outfile, 'w')
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
144 except IOError:
2
d83368b907f7 No error on existing output file
nick
parents: 0
diff changeset
145 fail('Error: The given output filename '+outfile+' could not be opened.')
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
146
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
147 # Take care of column names, print header
3
4627d99aa105 New script version - change in header
nick
parents: 2
diff changeset
148 if len(COLUMNS) != len(COLUMN_LABELS):
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
149 fail('Error: Internal column names list do not match column labels list.')
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
150 if stranded:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
151 COLUMNS[3:7] = ['+A', '+C', '+G', '+T', '-A', '-C', '-G', '-T']
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
152 COLUMN_LABELS[3:7] = ['+A', '+C', '+G', '+T', '-A', '-C', '-G', '-T']
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
153 if print_header:
4
900d91d653cb Add '#' back to header line
nick
parents: 3
diff changeset
154 outfile_handle.write('#'+'\t'.join(COLUMN_LABELS)+"\n")
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
155
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
156 # main loop
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
157 # each iteration processes one VCF line and prints one or more output lines
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
158 # one VCF line = one site, one or more samples
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
159 # one output line = one site, one sample
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
160 sample_names = []
57e85db3f13c Uploaded script
nick
parents:
diff changeset
161 for line in infile_handle:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
162 line = line.rstrip('\r\n')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
163
57e85db3f13c Uploaded script
nick
parents:
diff changeset
164 # header lines
57e85db3f13c Uploaded script
nick
parents:
diff changeset
165 if line[0] == '#':
57e85db3f13c Uploaded script
nick
parents:
diff changeset
166 if line[0:6].upper() == '#CHROM':
57e85db3f13c Uploaded script
nick
parents:
diff changeset
167 sample_names = line.split('\t')[9:]
57e85db3f13c Uploaded script
nick
parents:
diff changeset
168 continue
57e85db3f13c Uploaded script
nick
parents:
diff changeset
169
57e85db3f13c Uploaded script
nick
parents:
diff changeset
170 if not sample_names:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
171 fail("Error in input VCF: Data line encountered before header line. "
57e85db3f13c Uploaded script
nick
parents:
diff changeset
172 +"Failed on line:\n"+line)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
173
57e85db3f13c Uploaded script
nick
parents:
diff changeset
174 site_data = read_site(line, sample_names, CANONICAL_VARIANTS)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
175
57e85db3f13c Uploaded script
nick
parents:
diff changeset
176 if debug:
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
177 if print_pos != site_data['pos']:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
178 continue
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
179 if print_chr != site_data['chr'] and print_chr != '':
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
180 continue
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
181 if print_sample != '':
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
182 for sample in site_data['samples'].keys():
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
183 if sample.lower() != print_sample.lower():
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
184 site_data['samples'].pop(sample, None)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
185 if len(site_data['samples']) == 0:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
186 sys.stderr.write("Error: Sample '"+print_sample+"' not found.\n")
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
187 sys.exit(1)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
188
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
189
57e85db3f13c Uploaded script
nick
parents:
diff changeset
190 site_summary = summarize_site(site_data, sample_names, CANONICAL_VARIANTS,
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
191 freq_thres, covg_thres, stranded, debug=debug)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
192
57e85db3f13c Uploaded script
nick
parents:
diff changeset
193 if debug and site_summary[0]['print']:
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
194 print line.split('\t')[9].split(':')[-1]
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
195
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
196 try:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
197 print_site(outfile_handle, site_summary, COLUMNS)
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
198 except IOError as ioe:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
199 if ioe.errno == errno.EPIPE:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
200 cleanup()
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
201 sys.exit(0)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
202
57e85db3f13c Uploaded script
nick
parents:
diff changeset
203 # close any open filehandles
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
204 cleanup()
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
205
57e85db3f13c Uploaded script
nick
parents:
diff changeset
206 # keeps Galaxy from giving an error if there were messages on stderr
57e85db3f13c Uploaded script
nick
parents:
diff changeset
207 sys.exit(0)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
208
57e85db3f13c Uploaded script
nick
parents:
diff changeset
209
57e85db3f13c Uploaded script
nick
parents:
diff changeset
210
57e85db3f13c Uploaded script
nick
parents:
diff changeset
211 def read_site(line, sample_names, canonical):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
212 """Read in a line, parse the variants into a data structure, and return it.
57e85db3f13c Uploaded script
nick
parents:
diff changeset
213 The line should be actual site data, not a header line, so check beforehand.
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
214 Only the variants in 'canonical' will be read; all others are ignored.
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
215 Note: the line is assumed to have been chomped.
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
216 The returned data is stored in a dict, with the following structure:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
217 {
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
218 'chr': 'chr1',
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
219 'pos': '2617',
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
220 'samples': {
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
221 'THYROID': {
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
222 '+A': 32,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
223 '-A': 45,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
224 '-G': 1,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
225 },
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
226 'BLOOD': {
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
227 '+A': 2,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
228 '-C': 1,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
229 '+G': 37,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
230 '-G': 42,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
231 },
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
232 },
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
233 }
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
234 """
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
235
57e85db3f13c Uploaded script
nick
parents:
diff changeset
236 site = {}
57e85db3f13c Uploaded script
nick
parents:
diff changeset
237 fields = line.split('\t')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
238
57e85db3f13c Uploaded script
nick
parents:
diff changeset
239 if len(fields) < 9:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
240 fail("Error in input VCF: wrong number of fields in data line. "
57e85db3f13c Uploaded script
nick
parents:
diff changeset
241 +"Failed on line:\n"+line)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
242
57e85db3f13c Uploaded script
nick
parents:
diff changeset
243 site['chr'] = fields[0]
57e85db3f13c Uploaded script
nick
parents:
diff changeset
244 site['pos'] = fields[1]
57e85db3f13c Uploaded script
nick
parents:
diff changeset
245 samples = fields[9:]
57e85db3f13c Uploaded script
nick
parents:
diff changeset
246
57e85db3f13c Uploaded script
nick
parents:
diff changeset
247 if len(samples) < len(sample_names):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
248 fail("Error in input VCF: missing sample fields in data line. "
57e85db3f13c Uploaded script
nick
parents:
diff changeset
249 +"Failed on line:\n"+line)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
250 elif len(samples) > len(sample_names):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
251 fail("Error in input VCF: more sample fields in data line than in header. "
57e85db3f13c Uploaded script
nick
parents:
diff changeset
252 +"Failed on line:\n"+line)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
253
57e85db3f13c Uploaded script
nick
parents:
diff changeset
254 sample_counts = {}
57e85db3f13c Uploaded script
nick
parents:
diff changeset
255 for i in range(len(samples)):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
256
57e85db3f13c Uploaded script
nick
parents:
diff changeset
257 variant_counts = {}
57e85db3f13c Uploaded script
nick
parents:
diff changeset
258 counts = samples[i].split(':')[-1]
57e85db3f13c Uploaded script
nick
parents:
diff changeset
259 counts = counts.split(',')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
260
57e85db3f13c Uploaded script
nick
parents:
diff changeset
261 for count in counts:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
262 if not count:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
263 continue
57e85db3f13c Uploaded script
nick
parents:
diff changeset
264 fields = count.split('=')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
265 if len(fields) != 2:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
266 fail("Error in input VCF: Incorrect variant data format (must contain "
57e85db3f13c Uploaded script
nick
parents:
diff changeset
267 +"a single '='). Failed on line:\n"+line)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
268 (variant, reads) = fields
57e85db3f13c Uploaded script
nick
parents:
diff changeset
269 if variant[1:] not in canonical:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
270 continue
57e85db3f13c Uploaded script
nick
parents:
diff changeset
271 if variant[0] != '-' and variant[0] != '+':
57e85db3f13c Uploaded script
nick
parents:
diff changeset
272 fail("Error in input VCF: variant data not strand-specific. "
57e85db3f13c Uploaded script
nick
parents:
diff changeset
273 +"Failed on line:\n"+line)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
274 try:
6
f6d9742c1da0 fixed error on decimal variant counts
nick
parents: 4
diff changeset
275 variant_counts[variant] = int(float(reads))
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
276 except ValueError:
7
34892857567b Ready for release - script
nick
parents: 6
diff changeset
277 fail("Error in input VCF: Variant count not a valid number. Failed on variant count string '"+reads+"'\nIn the following line:\n"+line)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
278
57e85db3f13c Uploaded script
nick
parents:
diff changeset
279 sample_counts[sample_names[i]] = variant_counts
57e85db3f13c Uploaded script
nick
parents:
diff changeset
280
57e85db3f13c Uploaded script
nick
parents:
diff changeset
281 site['samples'] = sample_counts
57e85db3f13c Uploaded script
nick
parents:
diff changeset
282
57e85db3f13c Uploaded script
nick
parents:
diff changeset
283 return site
57e85db3f13c Uploaded script
nick
parents:
diff changeset
284
57e85db3f13c Uploaded script
nick
parents:
diff changeset
285
57e85db3f13c Uploaded script
nick
parents:
diff changeset
286 def summarize_site(site, sample_names, canonical, freq_thres, covg_thres,
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
287 stranded=False, debug=False):
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
288 """Take the raw data from the VCF line and transform it into the summary data
57e85db3f13c Uploaded script
nick
parents:
diff changeset
289 to be printed in the output format."""
57e85db3f13c Uploaded script
nick
parents:
diff changeset
290
57e85db3f13c Uploaded script
nick
parents:
diff changeset
291 site_summary = []
57e85db3f13c Uploaded script
nick
parents:
diff changeset
292 for sample_name in sample_names:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
293
57e85db3f13c Uploaded script
nick
parents:
diff changeset
294 sample = {'print':False}
57e85db3f13c Uploaded script
nick
parents:
diff changeset
295 variants = site['samples'].get(sample_name)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
296
57e85db3f13c Uploaded script
nick
parents:
diff changeset
297 sample['sample'] = sample_name
57e85db3f13c Uploaded script
nick
parents:
diff changeset
298 sample['chr'] = site['chr']
57e85db3f13c Uploaded script
nick
parents:
diff changeset
299 sample['pos'] = site['pos']
57e85db3f13c Uploaded script
nick
parents:
diff changeset
300
57e85db3f13c Uploaded script
nick
parents:
diff changeset
301 coverage = sum(variants.values())
57e85db3f13c Uploaded script
nick
parents:
diff changeset
302
57e85db3f13c Uploaded script
nick
parents:
diff changeset
303 # get stranded coverage
57e85db3f13c Uploaded script
nick
parents:
diff changeset
304 covg_plus = 0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
305 covg_minus = 0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
306 for variant in variants:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
307 if variant[0] == '+':
57e85db3f13c Uploaded script
nick
parents:
diff changeset
308 covg_plus += variants[variant]
57e85db3f13c Uploaded script
nick
parents:
diff changeset
309 elif variant[0] == '-':
57e85db3f13c Uploaded script
nick
parents:
diff changeset
310 covg_minus += variants[variant]
57e85db3f13c Uploaded script
nick
parents:
diff changeset
311 # stranded coverage threshold
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
312 if covg_plus < covg_thres or covg_minus < covg_thres:
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
313 site_summary.append(sample)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
314 continue
57e85db3f13c Uploaded script
nick
parents:
diff changeset
315 else:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
316 sample['print'] = True
57e85db3f13c Uploaded script
nick
parents:
diff changeset
317
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
318 # get an ordered list of read counts for all variants (both strands)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
319 bases = get_read_counts(variants, '+-')
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
320 ranked_bases = process_read_counts(bases, sort=True, debug=debug)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
321
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
322 # prepare stranded or unstranded lists of base counts
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
323 base_count_lists = []
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
324 if stranded:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
325 strands = ['+', '-']
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
326 base_count_lists.append(get_read_counts(variants, '+'))
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
327 base_count_lists.append(get_read_counts(variants, '-'))
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
328 else:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
329 strands = ['']
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
330 base_count_lists.append(ranked_bases)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
331
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
332 # record read counts into output dict
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
333 # If stranded, this will loop twice, once for each strand, and prepend '+'
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
334 # or '-' to the base name. If not stranded, it will loop once, and prepend
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
335 # nothing ('').
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
336 for (strand, base_count_list) in zip(strands, base_count_lists):
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
337 for base_count in base_count_list:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
338 sample[strand+base_count[0]] = base_count[1]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
339 # fill in any zeros
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
340 for base in canonical:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
341 if not sample.has_key(strand+base):
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
342 sample[strand+base] = 0
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
343
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
344 sample['alleles'] = count_alleles(variants, freq_thres, debug=debug)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
345
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
346 # If there's a tie for 2nd, randomly choose one to be 2nd
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
347 if len(ranked_bases) >= 3 and ranked_bases[1][1] == ranked_bases[2][1]:
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
348 swap = random.choice([True,False])
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
349 if swap:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
350 tmp_base = ranked_bases[1]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
351 ranked_bases[1] = ranked_bases[2]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
352 ranked_bases[2] = tmp_base
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
353
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
354 if debug: print "ranked +-: "+str(ranked_bases)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
355
57e85db3f13c Uploaded script
nick
parents:
diff changeset
356 sample['coverage'] = coverage
57e85db3f13c Uploaded script
nick
parents:
diff changeset
357 try:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
358 sample['major'] = ranked_bases[0][0]
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
359 except IndexError:
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
360 sample['major'] = '.'
57e85db3f13c Uploaded script
nick
parents:
diff changeset
361 try:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
362 sample['minor'] = ranked_bases[1][0]
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
363 sample['freq'] = round(ranked_bases[1][1]/coverage, 5)
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
364 except IndexError:
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
365 sample['minor'] = '.'
57e85db3f13c Uploaded script
nick
parents:
diff changeset
366 sample['freq'] = 0.0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
367
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
368 # Now that we've decided major and minor, we can calculate strand bias
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
369 bias = get_strand_bias(sample, variants)
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
370 if bias is None:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
371 sample['bias'] = '.'
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
372 else:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
373 sample['bias'] = round(bias, 5)
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
374
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
375 site_summary.append(sample)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
376
57e85db3f13c Uploaded script
nick
parents:
diff changeset
377 return site_summary
57e85db3f13c Uploaded script
nick
parents:
diff changeset
378
57e85db3f13c Uploaded script
nick
parents:
diff changeset
379
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
380 def get_read_counts(stranded_counts, strands='+-'):
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
381 """Do a simple sum of the read counts per variant, on the specified strands.
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
382 Arguments:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
383 stranded_counts: Dict of the stranded variants (keys) and their read counts
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
384 (values).
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
385 strands: Which strand(s) to count. Can be '+', '-', or '+-' for both (default).
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
386 Return value:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
387 summed_counts: A list of the alleles and their read counts. The elements are
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
388 tuples (variant, read count)."""
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
389
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
390 variants = stranded_counts.keys()
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
391
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
392 summed_counts = {}
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
393 for variant in variants:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
394 strand = variant[0]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
395 base = variant[1:]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
396 if strand in strands:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
397 summed_counts[base] = stranded_counts[variant] + summed_counts.get(base, 0)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
398
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
399 return summed_counts.items()
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
400
57e85db3f13c Uploaded script
nick
parents:
diff changeset
401
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
402 def process_read_counts(variant_counts, freq_thres=0, sort=False, debug=False):
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
403 """Process a list of read counts by frequency filtering and/or sorting.
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
404 Arguments:
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
405 variant_counts: List of the non-stranded variants and their read counts. The
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
406 elements are tuples (variant, read count).
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
407 freq_thres: The frequency threshold each allele needs to pass to be included.
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
408 sort: Whether to sort the list in descending order of read counts.
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
409 Return value:
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
410 variant_counts: A list of the processed alleles and their read counts. The
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
411 elements are tuples (variant, read count)."""
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
412
57e85db3f13c Uploaded script
nick
parents:
diff changeset
413 # get coverage for the specified strands
57e85db3f13c Uploaded script
nick
parents:
diff changeset
414 coverage = 0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
415 for variant in variant_counts:
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
416 coverage += variant[1]
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
417
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
418 if coverage <= 0:
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
419 return []
57e85db3f13c Uploaded script
nick
parents:
diff changeset
420
57e85db3f13c Uploaded script
nick
parents:
diff changeset
421 # sort the list of alleles by read count
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
422 if sort:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
423 variant_counts.sort(reverse=True, key=lambda variant: variant[1])
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
424
57e85db3f13c Uploaded script
nick
parents:
diff changeset
425 if debug:
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
426 print 'coverage: '+str(coverage)+', freq_thres: '+str(freq_thres)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
427 for variant in variant_counts:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
428 print (variant[0]+': '+str(variant[1])+'/'+str(float(coverage))+' = '+
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
429 str(variant[1]/coverage))
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
430
57e85db3f13c Uploaded script
nick
parents:
diff changeset
431 # remove bases below the frequency threshold
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
432 if freq_thres > 0:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
433 variant_counts = [variant for variant in variant_counts
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
434 if variant[1]/coverage >= freq_thres]
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
435
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
436 return variant_counts
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
437
57e85db3f13c Uploaded script
nick
parents:
diff changeset
438
57e85db3f13c Uploaded script
nick
parents:
diff changeset
439 def count_alleles(variant_counts, freq_thres, debug=False):
57e85db3f13c Uploaded script
nick
parents:
diff changeset
440 """Determine how many alleles to report, based on filtering rules.
57e85db3f13c Uploaded script
nick
parents:
diff changeset
441 The current rule determines which bases pass the frequency threshold on each
57e85db3f13c Uploaded script
nick
parents:
diff changeset
442 strand individually, then compares the two sets of bases. If they are the same
57e85db3f13c Uploaded script
nick
parents:
diff changeset
443 (regardless of order), the allele count is the number of bases. Otherwise it
57e85db3f13c Uploaded script
nick
parents:
diff changeset
444 is zero."""
57e85db3f13c Uploaded script
nick
parents:
diff changeset
445 allele_count = 0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
446
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
447 alleles_plus = get_read_counts(variant_counts, '+')
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
448 alleles_plus = process_read_counts(alleles_plus, freq_thres=freq_thres,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
449 sort=False, debug=debug)
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
450 alleles_minus = get_read_counts(variant_counts, '-')
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
451 alleles_minus = process_read_counts(alleles_minus, freq_thres=freq_thres,
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
452 sort=False, debug=debug)
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
453
57e85db3f13c Uploaded script
nick
parents:
diff changeset
454 if debug:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
455 print '+ '+str(alleles_plus)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
456 print '- '+str(alleles_minus)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
457
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
458 # Check if each strand reports the same set of alleles.
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
459 # Sorting by base is to compare lists without regard to order (as sets).
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
460 alleles_plus_sorted = sorted([base[0] for base in alleles_plus if base[1]])
57e85db3f13c Uploaded script
nick
parents:
diff changeset
461 alleles_minus_sorted = sorted([base[0] for base in alleles_minus if base[1]])
57e85db3f13c Uploaded script
nick
parents:
diff changeset
462 if alleles_plus_sorted == alleles_minus_sorted:
57e85db3f13c Uploaded script
nick
parents:
diff changeset
463 allele_count = len(alleles_plus)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
464
57e85db3f13c Uploaded script
nick
parents:
diff changeset
465 return allele_count
57e85db3f13c Uploaded script
nick
parents:
diff changeset
466
57e85db3f13c Uploaded script
nick
parents:
diff changeset
467
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
468 def get_strand_bias(sample, variants):
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
469 """Based on method 1 (SB) of Guo et al., 2012
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
470 If there a denominator would be 0, there is no valid result and this will
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
471 return None. This occurs when there are no reads on one of the strands, or
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
472 when there are no minor allele reads."""
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
473 # using same variable names as in paper
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
474 a = variants.get('+'+sample['major'], 0) # forward major allele
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
475 b = variants.get('+'+sample['minor'], 0) # forward minor allele
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
476 c = variants.get('-'+sample['major'], 0) # reverse major allele
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
477 d = variants.get('-'+sample['minor'], 0) # reverse minor allele
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
478 # no minor allele reads
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
479 try:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
480 return abs(b/(a+b) - d/(c+d)) / ((b+d) / (a+b+c+d))
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
481 except ZeroDivisionError:
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
482 return None
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
483
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
484
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
485 def print_site(filehandle, site, columns):
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
486 """Print the output lines for one site (one per sample).
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
487 filehandle must be open."""
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
488 for sample in site:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
489 if sample['print']:
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
490 fields = [str(sample.get(column)) for column in columns]
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
491 filehandle.write('\t'.join(fields)+"\n")
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
492
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
493
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
494 def fail(message):
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
495 cleanup()
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
496 sys.stderr.write(message+'\n')
57e85db3f13c Uploaded script
nick
parents:
diff changeset
497 sys.exit(1)
57e85db3f13c Uploaded script
nick
parents:
diff changeset
498
10
db6f217dc45a Uploaded tarball.
nick
parents: 7
diff changeset
499
13
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
500 def cleanup():
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
501 if isinstance(infile_handle, file):
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
502 infile_handle.close()
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
503 if isinstance(outfile_handle, file):
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
504 outfile_handle.close()
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
505
df1fb577db0d Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents: 10
diff changeset
506
0
57e85db3f13c Uploaded script
nick
parents:
diff changeset
507 if __name__ == "__main__":
57e85db3f13c Uploaded script
nick
parents:
diff changeset
508 main()