Mercurial > repos > nick > allele_counts_1
annotate allele-counts.py @ 13:df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
| author | me <nmapsy@gmail.com> |
|---|---|
| date | Tue, 03 Dec 2013 13:45:16 -0500 |
| parents | db6f217dc45a |
| children | e5d39283fe4d |
| rev | line source |
|---|---|
| 0 | 1 #!/usr/bin/python |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
2 """ |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
3 Run with -h option or see DESCRIPTION for description. |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
4 This script's functionality is being obsoleted by the new, and much more sanely |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
5 written, nvc-filter.py. |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
6 |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
7 New in this version: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
8 - Calculate strand bias |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
9 - Rename MINOR.FREQ.PERC to MAF |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
10 |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
11 Naive Variant Caller variant count parsing one-liner: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
12 $ cat variants.vcf | grep -v '^#' | cut -f 10 | cut -d ':' -f 4 | tr ',=' '\t:' |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
13 """ |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
14 from __future__ import division |
| 0 | 15 import os |
| 16 import sys | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
17 import errno |
| 10 | 18 import random |
| 0 | 19 from optparse import OptionParser |
| 20 | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
21 COLUMNS = ['sample', 'chr', 'pos', 'A', 'C', 'G', 'T', 'coverage', 'alleles', |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
22 'major', 'minor', 'freq', 'bias'] |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
23 COLUMN_LABELS = ['SAMPLE', 'CHR', 'POS', 'A', 'C', 'G', 'T', 'CVRG', 'ALLELES', |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
24 'MAJOR', 'MINOR', 'MAF', 'BIAS'] |
| 0 | 25 CANONICAL_VARIANTS = ['A', 'C', 'G', 'T'] |
| 10 | 26 USAGE = """Usage: %prog [options] -i variants.vcf -o alleles.csv |
| 27 cat variants.vcf | %prog [options] > alleles.csv""" | |
| 0 | 28 OPT_DEFAULTS = {'infile':'-', 'outfile':'-', 'freq_thres':1.0, 'covg_thres':100, |
| 10 | 29 'print_header':False, 'stdin':False, 'stranded':False, 'no_filter':False, |
| 30 'debug_loc':'', 'seed':''} | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
31 DESCRIPTION = """This will parse the VCF output of the "Naive Variant Caller" |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
32 (aka "BAM Coverage") Galaxy tool. For each position reported, it counts the |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
33 number of reads of each base, determines the major allele, minor allele (second |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
34 most frequent variant), and number of alleles above a threshold. So currently |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
35 it only considers SNVs (ACGT), including in the coverage figure. By default it |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
36 reads from stdin and prints to stdout.""" |
| 0 | 37 EPILOG = """Requirements: |
| 38 The input VCF must report the variants for each strand. | |
| 39 The variants should be case-sensitive (e.g. all capital base letters). | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
40 Strand bias: Both strands must show the same bases passing the frequency |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
41 threshold (but not necessarily in the same order). If the site fails this test, |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
42 the number of alleles is reported as 0.""" |
| 0 | 43 |
| 44 def get_options(defaults, usage, description='', epilog=''): | |
| 45 """Get options, print usage text.""" | |
| 46 | |
| 47 parser = OptionParser(usage=usage, description=description, epilog=epilog) | |
| 48 | |
| 49 parser.add_option('-i', '--infile', dest='infile', | |
| 50 default=defaults.get('infile'), | |
| 51 help='Read input VCF data from this file instead of stdin.') | |
| 52 parser.add_option('-o', '--outfile', dest='outfile', | |
| 53 default=defaults.get('outfile'), | |
| 54 help='Print output data to this file instead of stdout.') | |
| 55 parser.add_option('-f', '--freq-thres', dest='freq_thres', type='float', | |
| 56 default=defaults.get('freq_thres'), | |
| 10 | 57 help=('Frequency threshold for counting alleles, given in percentage: -f 1 ' |
| 58 +'= 1% frequency. Default is %default%.')) | |
| 0 | 59 parser.add_option('-c', '--covg-thres', dest='covg_thres', type='int', |
| 60 default=defaults.get('covg_thres'), | |
| 10 | 61 help=('Coverage threshold. Each site must be supported by at least this ' |
| 62 +'many reads on each strand. Otherwise the site will not be printed in ' | |
| 63 +'the output. The default is %default reads per strand.')) | |
| 64 parser.add_option('-n', '--no-filter', dest='no_filter', action='store_const', | |
| 65 const=not(defaults.get('no_filter')), default=defaults.get('no_filter'), | |
| 66 help=('Operate without a frequency threshold or coverage threshold. ' | |
| 67 +'Equivalent to "-c 0 -f 0".')) | |
| 0 | 68 parser.add_option('-H', '--header', dest='print_header', action='store_const', |
| 69 const=not(defaults.get('print_header')), default=defaults.get('print_header'), | |
| 10 | 70 help=('Print header line. This is a #-commented line with the column ' |
| 71 +'labels. Off by default.')) | |
| 72 parser.add_option('-s', '--stranded', dest='stranded', action='store_const', | |
| 73 const=not(defaults.get('stranded')), default=defaults.get('stranded'), | |
| 74 help='Report variant counts by strand, in separate columns. Off by default.') | |
| 75 parser.add_option('-r', '--rand-seed', dest='seed', | |
| 76 default=defaults.get('seed'), help=('Seed for random number generator.')) | |
| 77 parser.add_option('-d', '--debug', dest='debug_loc', | |
| 78 default=defaults.get('debug_loc'), | |
| 79 help=('Turn on debug mode and specify a single site to process using UCSC ' | |
| 80 +'coordinate format. You can also append a sample ID after another ":" ' | |
| 81 +'to restrict it further.')) | |
| 0 | 82 |
| 83 (options, args) = parser.parse_args() | |
| 84 | |
| 10 | 85 return (options, args) |
| 0 | 86 |
| 87 | |
| 88 def main(): | |
| 89 | |
| 90 (options, args) = get_options(OPT_DEFAULTS, USAGE, DESCRIPTION, EPILOG) | |
| 91 | |
| 92 infile = options.infile | |
| 93 outfile = options.outfile | |
| 94 print_header = options.print_header | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
95 freq_thres = options.freq_thres / 100 |
| 0 | 96 covg_thres = options.covg_thres |
| 10 | 97 stranded = options.stranded |
| 98 debug_loc = options.debug_loc | |
| 99 seed = options.seed | |
| 100 | |
| 101 if options.no_filter: | |
| 102 freq_thres = 0 | |
| 103 covg_thres = 0 | |
| 0 | 104 |
| 10 | 105 if seed: |
| 106 random.seed(seed) | |
| 107 | |
| 108 debug = False | |
| 109 print_sample = '' | |
| 110 if debug_loc: | |
| 111 debug = True | |
| 112 coords = debug_loc.split(':') | |
| 113 print_chr = coords[0] | |
| 114 print_pos = '' | |
| 115 if len(coords) > 1: print_pos = coords[1] | |
| 116 if len(coords) > 2: print_sample = coords[2] | |
| 0 | 117 |
| 118 # set infile_handle to either stdin or the input file | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
119 global infile_handle |
| 0 | 120 if infile == OPT_DEFAULTS.get('infile'): |
| 121 infile_handle = sys.stdin | |
| 122 sys.stderr.write("Reading from standard input..\n") | |
| 123 else: | |
| 124 if os.path.exists(infile): | |
| 125 infile_handle = open(infile, 'r') | |
| 126 else: | |
| 127 fail('Error: Input VCF file '+infile+' not found.') | |
| 128 | |
| 129 # set outfile_handle to either stdout or the output file | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
130 global outfile_handle |
| 0 | 131 if outfile == OPT_DEFAULTS.get('outfile'): |
| 132 outfile_handle = sys.stdout | |
| 133 else: | |
| 2 | 134 try: |
| 0 | 135 outfile_handle = open(outfile, 'w') |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
136 except IOError: |
| 2 | 137 fail('Error: The given output filename '+outfile+' could not be opened.') |
| 0 | 138 |
| 10 | 139 # Take care of column names, print header |
| 3 | 140 if len(COLUMNS) != len(COLUMN_LABELS): |
| 10 | 141 fail('Error: Internal column names list do not match column labels list.') |
| 142 if stranded: | |
| 143 COLUMNS[3:7] = ['+A', '+C', '+G', '+T', '-A', '-C', '-G', '-T'] | |
| 144 COLUMN_LABELS[3:7] = ['+A', '+C', '+G', '+T', '-A', '-C', '-G', '-T'] | |
| 0 | 145 if print_header: |
| 4 | 146 outfile_handle.write('#'+'\t'.join(COLUMN_LABELS)+"\n") |
| 0 | 147 |
| 10 | 148 # main loop |
| 149 # each iteration processes one VCF line and prints one or more output lines | |
| 150 # one VCF line = one site, one or more samples | |
| 151 # one output line = one site, one sample | |
| 0 | 152 sample_names = [] |
| 153 for line in infile_handle: | |
| 154 line = line.rstrip('\r\n') | |
| 155 | |
| 156 # header lines | |
| 157 if line[0] == '#': | |
| 158 if line[0:6].upper() == '#CHROM': | |
| 159 sample_names = line.split('\t')[9:] | |
| 160 continue | |
| 161 | |
| 162 if not sample_names: | |
| 163 fail("Error in input VCF: Data line encountered before header line. " | |
| 164 +"Failed on line:\n"+line) | |
| 165 | |
| 166 site_data = read_site(line, sample_names, CANONICAL_VARIANTS) | |
| 167 | |
| 168 if debug: | |
| 10 | 169 if print_pos != site_data['pos']: |
| 170 continue | |
| 171 if print_chr != site_data['chr'] and print_chr != '': | |
| 0 | 172 continue |
| 10 | 173 if print_sample != '': |
| 174 for sample in site_data['samples'].keys(): | |
| 175 if sample.lower() != print_sample.lower(): | |
| 176 site_data['samples'].pop(sample, None) | |
| 177 if len(site_data['samples']) == 0: | |
| 178 sys.stderr.write("Error: Sample '"+print_sample+"' not found.\n") | |
| 179 sys.exit(1) | |
| 180 | |
| 0 | 181 |
| 182 site_summary = summarize_site(site_data, sample_names, CANONICAL_VARIANTS, | |
| 10 | 183 freq_thres, covg_thres, stranded, debug=debug) |
| 0 | 184 |
| 185 if debug and site_summary[0]['print']: | |
| 10 | 186 print line.split('\t')[9].split(':')[-1] |
| 0 | 187 |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
188 try: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
189 print_site(outfile_handle, site_summary, COLUMNS) |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
190 except IOError as ioe: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
191 if ioe.errno == errno.EPIPE: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
192 cleanup() |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
193 sys.exit(0) |
| 0 | 194 |
| 195 # close any open filehandles | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
196 cleanup() |
| 0 | 197 |
| 198 # keeps Galaxy from giving an error if there were messages on stderr | |
| 199 sys.exit(0) | |
| 200 | |
| 201 | |
| 202 | |
| 203 def read_site(line, sample_names, canonical): | |
| 204 """Read in a line, parse the variants into a data structure, and return it. | |
| 205 The line should be actual site data, not a header line, so check beforehand. | |
| 10 | 206 Only the variants in 'canonical' will be read; all others are ignored. |
| 207 Note: the line is assumed to have been chomped. | |
| 208 The returned data is stored in a dict, with the following structure: | |
| 209 { | |
| 210 'chr': 'chr1', | |
| 211 'pos': '2617', | |
| 212 'samples': { | |
| 213 'THYROID': { | |
| 214 '+A': 32, | |
| 215 '-A': 45, | |
| 216 '-G': 1, | |
| 217 }, | |
| 218 'BLOOD': { | |
| 219 '+A': 2, | |
| 220 '-C': 1, | |
| 221 '+G': 37, | |
| 222 '-G': 42, | |
| 223 }, | |
| 224 }, | |
| 225 } | |
| 226 """ | |
| 0 | 227 |
| 228 site = {} | |
| 229 fields = line.split('\t') | |
| 230 | |
| 231 if len(fields) < 9: | |
| 232 fail("Error in input VCF: wrong number of fields in data line. " | |
| 233 +"Failed on line:\n"+line) | |
| 234 | |
| 235 site['chr'] = fields[0] | |
| 236 site['pos'] = fields[1] | |
| 237 samples = fields[9:] | |
| 238 | |
| 239 if len(samples) < len(sample_names): | |
| 240 fail("Error in input VCF: missing sample fields in data line. " | |
| 241 +"Failed on line:\n"+line) | |
| 242 elif len(samples) > len(sample_names): | |
| 243 fail("Error in input VCF: more sample fields in data line than in header. " | |
| 244 +"Failed on line:\n"+line) | |
| 245 | |
| 246 sample_counts = {} | |
| 247 for i in range(len(samples)): | |
| 248 | |
| 249 variant_counts = {} | |
| 250 counts = samples[i].split(':')[-1] | |
| 251 counts = counts.split(',') | |
| 252 | |
| 253 for count in counts: | |
| 254 if not count: | |
| 255 continue | |
| 256 fields = count.split('=') | |
| 257 if len(fields) != 2: | |
| 258 fail("Error in input VCF: Incorrect variant data format (must contain " | |
| 259 +"a single '='). Failed on line:\n"+line) | |
| 260 (variant, reads) = fields | |
| 261 if variant[1:] not in canonical: | |
| 262 continue | |
| 263 if variant[0] != '-' and variant[0] != '+': | |
| 264 fail("Error in input VCF: variant data not strand-specific. " | |
| 265 +"Failed on line:\n"+line) | |
| 266 try: | |
| 6 | 267 variant_counts[variant] = int(float(reads)) |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
268 except ValueError: |
| 7 | 269 fail("Error in input VCF: Variant count not a valid number. Failed on variant count string '"+reads+"'\nIn the following line:\n"+line) |
| 0 | 270 |
| 271 sample_counts[sample_names[i]] = variant_counts | |
| 272 | |
| 273 site['samples'] = sample_counts | |
| 274 | |
| 275 return site | |
| 276 | |
| 277 | |
| 278 def summarize_site(site, sample_names, canonical, freq_thres, covg_thres, | |
| 10 | 279 stranded=False, debug=False): |
| 0 | 280 """Take the raw data from the VCF line and transform it into the summary data |
| 281 to be printed in the output format.""" | |
| 282 | |
| 283 site_summary = [] | |
| 284 for sample_name in sample_names: | |
| 285 | |
| 286 sample = {'print':False} | |
| 287 variants = site['samples'].get(sample_name) | |
| 288 | |
| 289 sample['sample'] = sample_name | |
| 290 sample['chr'] = site['chr'] | |
| 291 sample['pos'] = site['pos'] | |
| 292 | |
| 293 coverage = sum(variants.values()) | |
| 294 | |
| 295 # get stranded coverage | |
| 296 covg_plus = 0 | |
| 297 covg_minus = 0 | |
| 298 for variant in variants: | |
| 299 if variant[0] == '+': | |
| 300 covg_plus += variants[variant] | |
| 301 elif variant[0] == '-': | |
| 302 covg_minus += variants[variant] | |
| 303 # stranded coverage threshold | |
| 10 | 304 if covg_plus < covg_thres or covg_minus < covg_thres: |
| 0 | 305 site_summary.append(sample) |
| 306 continue | |
| 307 else: | |
| 308 sample['print'] = True | |
| 309 | |
| 10 | 310 # get an ordered list of read counts for all variants (both strands) |
| 311 bases = get_read_counts(variants, '+-') | |
| 312 ranked_bases = process_read_counts(bases, sort=True, debug=debug) | |
| 313 | |
| 314 # prepare stranded or unstranded lists of base counts | |
| 315 base_count_lists = [] | |
| 316 if stranded: | |
| 317 strands = ['+', '-'] | |
| 318 base_count_lists.append(get_read_counts(variants, '+')) | |
| 319 base_count_lists.append(get_read_counts(variants, '-')) | |
| 320 else: | |
| 321 strands = [''] | |
| 322 base_count_lists.append(ranked_bases) | |
| 0 | 323 |
| 10 | 324 # record read counts into output dict |
| 325 # If stranded, this will loop twice, once for each strand, and prepend '+' | |
| 326 # or '-' to the base name. If not stranded, it will loop once, and prepend | |
| 327 # nothing (''). | |
| 328 for (strand, base_count_list) in zip(strands, base_count_lists): | |
| 329 for base_count in base_count_list: | |
| 330 sample[strand+base_count[0]] = base_count[1] | |
| 331 # fill in any zeros | |
| 332 for base in canonical: | |
| 333 if not sample.has_key(strand+base): | |
| 334 sample[strand+base] = 0 | |
| 0 | 335 |
| 10 | 336 sample['alleles'] = count_alleles(variants, freq_thres, debug=debug) |
| 0 | 337 |
| 10 | 338 # If there's a tie for 2nd, randomly choose one to be 2nd |
| 0 | 339 if len(ranked_bases) >= 3 and ranked_bases[1][1] == ranked_bases[2][1]: |
| 10 | 340 swap = random.choice([True,False]) |
| 341 if swap: | |
| 342 tmp_base = ranked_bases[1] | |
| 343 ranked_bases[1] = ranked_bases[2] | |
| 344 ranked_bases[2] = tmp_base | |
| 0 | 345 |
| 10 | 346 if debug: print "ranked +-: "+str(ranked_bases) |
| 0 | 347 |
| 348 sample['coverage'] = coverage | |
| 349 try: | |
| 350 sample['major'] = ranked_bases[0][0] | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
351 except IndexError: |
| 0 | 352 sample['major'] = '.' |
| 353 try: | |
| 354 sample['minor'] = ranked_bases[1][0] | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
355 sample['freq'] = round(ranked_bases[1][1]/coverage, 5) |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
356 except IndexError: |
| 0 | 357 sample['minor'] = '.' |
| 358 sample['freq'] = 0.0 | |
| 359 | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
360 # Now that we've decided major and minor, we can calculate strand bias |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
361 bias = get_strand_bias(sample, variants) |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
362 if bias is None: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
363 sample['bias'] = '.' |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
364 else: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
365 sample['bias'] = round(bias, 5) |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
366 |
| 0 | 367 site_summary.append(sample) |
| 368 | |
| 369 return site_summary | |
| 370 | |
| 371 | |
| 10 | 372 def get_read_counts(stranded_counts, strands='+-'): |
| 373 """Do a simple sum of the read counts per variant, on the specified strands. | |
| 374 Arguments: | |
| 375 stranded_counts: Dict of the stranded variants (keys) and their read counts | |
| 376 (values). | |
| 377 strands: Which strand(s) to count. Can be '+', '-', or '+-' for both (default). | |
| 378 Return value: | |
| 379 summed_counts: A list of the alleles and their read counts. The elements are | |
| 380 tuples (variant, read count).""" | |
| 381 | |
| 382 variants = stranded_counts.keys() | |
| 383 | |
| 384 summed_counts = {} | |
| 385 for variant in variants: | |
| 386 strand = variant[0] | |
| 387 base = variant[1:] | |
| 388 if strand in strands: | |
| 389 summed_counts[base] = stranded_counts[variant] + summed_counts.get(base, 0) | |
| 390 | |
| 391 return summed_counts.items() | |
| 0 | 392 |
| 393 | |
| 10 | 394 def process_read_counts(variant_counts, freq_thres=0, sort=False, debug=False): |
| 395 """Process a list of read counts by frequency filtering and/or sorting. | |
| 0 | 396 Arguments: |
| 10 | 397 variant_counts: List of the non-stranded variants and their read counts. The |
| 398 elements are tuples (variant, read count). | |
| 0 | 399 freq_thres: The frequency threshold each allele needs to pass to be included. |
| 10 | 400 sort: Whether to sort the list in descending order of read counts. |
| 0 | 401 Return value: |
| 10 | 402 variant_counts: A list of the processed alleles and their read counts. The |
| 403 elements are tuples (variant, read count).""" | |
| 0 | 404 |
| 405 # get coverage for the specified strands | |
| 406 coverage = 0 | |
| 407 for variant in variant_counts: | |
| 10 | 408 coverage += variant[1] |
| 0 | 409 |
| 10 | 410 if coverage <= 0: |
| 0 | 411 return [] |
| 412 | |
| 413 # sort the list of alleles by read count | |
| 10 | 414 if sort: |
| 415 variant_counts.sort(reverse=True, key=lambda variant: variant[1]) | |
| 0 | 416 |
| 417 if debug: | |
| 10 | 418 print 'coverage: '+str(coverage)+', freq_thres: '+str(freq_thres) |
| 419 for variant in variant_counts: | |
| 420 print (variant[0]+': '+str(variant[1])+'/'+str(float(coverage))+' = '+ | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
421 str(variant[1]/coverage)) |
| 0 | 422 |
| 423 # remove bases below the frequency threshold | |
| 10 | 424 if freq_thres > 0: |
| 425 variant_counts = [variant for variant in variant_counts | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
426 if variant[1]/coverage >= freq_thres] |
| 0 | 427 |
| 10 | 428 return variant_counts |
| 0 | 429 |
| 430 | |
| 431 def count_alleles(variant_counts, freq_thres, debug=False): | |
| 432 """Determine how many alleles to report, based on filtering rules. | |
| 433 The current rule determines which bases pass the frequency threshold on each | |
| 434 strand individually, then compares the two sets of bases. If they are the same | |
| 435 (regardless of order), the allele count is the number of bases. Otherwise it | |
| 436 is zero.""" | |
| 437 allele_count = 0 | |
| 438 | |
| 10 | 439 alleles_plus = get_read_counts(variant_counts, '+') |
| 440 alleles_plus = process_read_counts(alleles_plus, freq_thres=freq_thres, | |
| 441 sort=False, debug=debug) | |
| 442 alleles_minus = get_read_counts(variant_counts, '-') | |
| 443 alleles_minus = process_read_counts(alleles_minus, freq_thres=freq_thres, | |
| 444 sort=False, debug=debug) | |
| 0 | 445 |
| 446 if debug: | |
| 447 print '+ '+str(alleles_plus) | |
| 448 print '- '+str(alleles_minus) | |
| 449 | |
| 10 | 450 # Check if each strand reports the same set of alleles. |
| 451 # Sorting by base is to compare lists without regard to order (as sets). | |
| 0 | 452 alleles_plus_sorted = sorted([base[0] for base in alleles_plus if base[1]]) |
| 453 alleles_minus_sorted = sorted([base[0] for base in alleles_minus if base[1]]) | |
| 454 if alleles_plus_sorted == alleles_minus_sorted: | |
| 455 allele_count = len(alleles_plus) | |
| 456 | |
| 457 return allele_count | |
| 458 | |
| 459 | |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
460 def get_strand_bias(sample, variants): |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
461 """Based on method 1 (SB) of Guo et al., 2012 |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
462 If there a denominator would be 0, there is no valid result and this will |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
463 return None. This occurs when there are no reads on one of the strands, or |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
464 when there are no minor allele reads.""" |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
465 # using same variable names as in paper |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
466 a = variants.get('+'+sample['major'], 0) # forward major allele |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
467 b = variants.get('+'+sample['minor'], 0) # forward minor allele |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
468 c = variants.get('-'+sample['major'], 0) # reverse major allele |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
469 d = variants.get('-'+sample['minor'], 0) # reverse minor allele |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
470 # no minor allele reads |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
471 try: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
472 return abs(b/(a+b) - d/(c+d)) / ((b+d) / (a+b+c+d)) |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
473 except ZeroDivisionError: |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
474 return None |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
475 |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
476 |
| 10 | 477 def print_site(filehandle, site, columns): |
| 478 """Print the output lines for one site (one per sample). | |
| 479 filehandle must be open.""" | |
| 480 for sample in site: | |
| 481 if sample['print']: | |
| 482 fields = [str(sample.get(column)) for column in columns] | |
| 483 filehandle.write('\t'.join(fields)+"\n") | |
| 484 | |
| 485 | |
| 0 | 486 def fail(message): |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
487 cleanup() |
| 0 | 488 sys.stderr.write(message+'\n') |
| 489 sys.exit(1) | |
| 490 | |
| 10 | 491 |
|
13
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
492 def cleanup(): |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
493 if isinstance(infile_handle, file): |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
494 infile_handle.close() |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
495 if isinstance(outfile_handle, file): |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
496 outfile_handle.close() |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
497 |
|
df1fb577db0d
Version 1.2: Add strand bias column, rename minor allele frequency column.
me <nmapsy@gmail.com>
parents:
10
diff
changeset
|
498 |
| 0 | 499 if __name__ == "__main__": |
| 500 main() |
