annotate vcf.py @ 7:1d150e860c4d

Expanded the functionality of the merge genomic datasets tool, to generate an output dataset with the file (or label) indicating where each column came from
author melissacline
date Mon, 09 Mar 2015 19:58:03 -0700
parents 60efb9214eaa
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
1 #!/usr/bin/env python
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
2 '''A VCFv4.0 parser for Python.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
3
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
4 The intent of this module is to mimic the ``csv`` module in the Python stdlib,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
5 as opposed to more flexible serialization formats like JSON or YAML. ``vcf``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
6 will attempt to parse the content of each record based on the data types
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
7 specified in the meta-information lines -- specifically the ##INFO and
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
8 ##FORMAT lines. If these lines are missing or incomplete, it will check
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
9 against the reserved types mentioned in the spec. Failing that, it will just
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
10 return strings.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
11
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
12 There is currently one piece of interface: ``VCFReader``. It takes a file-like
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
13 object and acts as a reader::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
14
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
15 >>> import vcf
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
16 >>> vcf_reader = vcf.VCFReader(open('example.vcf', 'rb'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
17 >>> for record in vcf_reader:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
18 ... print record
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
19 Record(CHROM='20', POS=14370, ID='rs6054257', REF='G', ALT=['A'], QUAL=29,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
20 FILTER='PASS', INFO={'H2': True, 'NS': 3, 'DB': True, 'DP': 14, 'AF': [0.5]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
21 }, FORMAT='GT:GQ:DP:HQ', samples=[{'GT': '0', 'HQ': [58, 50], 'DP': 3, 'GQ'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
22 : 49, 'name': 'NA00001'}, {'GT': '0', 'HQ': [65, 3], 'DP': 5, 'GQ': 3, 'nam
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
23 e' : 'NA00002'}, {'GT': '0', 'DP': 3, 'GQ': 41, 'name': 'NA00003'}])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
24
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
25 This produces a great deal of information, but it is conveniently accessed.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
26 The attributes of a Record are the 8 fixed fields from the VCF spec plus two
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
27 more. That is:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
28
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
29 * ``Record.CHROM``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
30 * ``Record.POS``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
31 * ``Record.ID``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
32 * ``Record.REF``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
33 * ``Record.ALT``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
34 * ``Record.QUAL``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
35 * ``Record.FILTER``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
36 * ``Record.INFO``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
37
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
38 plus two more attributes to handle genotype information:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
39
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
40 * ``Record.FORMAT``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
41 * ``Record.samples``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
42
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
43 ``samples``, not being the title of any column, is left lowercase. The format
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
44 of the fixed fields is from the spec. Comma-separated lists in the VCF are
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
45 converted to lists. In particular, one-entry VCF lists are converted to
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
46 one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
47 of key=value pairs are converted to Python dictionaries, with flags being given
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
48 a ``True`` value. Integers and floats are handled exactly as you'd expect::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
49
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
50 >>> record = vcf_reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
51 >>> print record.POS
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
52 17330
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
53 >>> print record.ALT
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
54 ['A']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
55 >>> print record.INFO['AF']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
56 [0.017]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
57
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
58 ``record.FORMAT`` will be a string specifying the format of the genotype
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
59 fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
60 ``None``. Finally, ``record.samples`` is a list of dictionaries containing the
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
61 parsed sample column::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
62
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
63 >>> record = vcf_reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
64 >>> for sample in record.samples:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
65 ... print sample['GT']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
66 '1|2'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
67 '2|1'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
68 '2/2'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
69
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
70 Metadata regarding the VCF file itself can be investigated through the
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
71 following attributes:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
72
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
73 * ``VCFReader.metadata``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
74 * ``VCFReader.infos``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
75 * ``VCFReader.filters``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
76 * ``VCFReader.formats``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
77 * ``VCFReader.samples``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
78
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
79 For example::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
80
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
81 >>> vcf_reader.metadata['fileDate']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
82 20090805
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
83 >>> vcf_reader.samples
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
84 ['NA00001', 'NA00002', 'NA00003']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
85 >>> vcf_reader.filters
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
86 {'q10': Filter(id='q10', desc='Quality below 10'),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
87 's50': Filter(id='s50', desc='Less than 50% of samples have data')}
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
88 >>> vcf_reader.infos['AA'].desc
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
89 Ancestral Allele
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
90
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
91 '''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
92 import collections
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
93 import re
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
94
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
95
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
96 # Metadata parsers/constants
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
97 RESERVED_INFO = {
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
98 'AA': 'String', 'AC': 'Integer', 'AF': 'Float', 'AN': 'Integer',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
99 'BQ': 'Float', 'CIGAR': 'String', 'DB': 'Flag', 'DP': 'Integer',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
100 'END': 'Integer', 'H2': 'Flag', 'MQ': 'Float', 'MQ0': 'Integer',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
101 'NS': 'Integer', 'SB': 'String', 'SOMATIC': 'Flag', 'VALIDATED': 'Flag'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
102 }
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
103
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
104 RESERVED_FORMAT = {
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
105 'GT': 'String', 'DP': 'Integer', 'FT': 'String', 'GL': 'Float',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
106 'GQ': 'Float', 'HQ': 'Float'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
107 }
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
108
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
109
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
110 _Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc'])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
111 _Filter = collections.namedtuple('Filter', ['id', 'desc'])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
112 _Format = collections.namedtuple('Format', ['id', 'num', 'type', 'desc'])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
113
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
114
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
115 class _vcf_metadata_parser(object):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
116 '''Parse the metadat in the header of a VCF file.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
117 def __init__(self, aggressive=False):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
118 super(_vcf_metadata_parser, self).__init__()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
119 self.aggro = aggressive
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
120 self.info_pattern = re.compile(r'''\#\#INFO=<
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
121 ID=(?P<id>[^,]+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
122 Number=(?P<number>\d+|\.|[AG]),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
123 Type=(?P<type>Integer|Float|Flag|Character|String),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
124 Description="(?P<desc>[^"]*)"
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
125 >''', re.VERBOSE)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
126 self.filter_pattern = re.compile(r'''\#\#FILTER=<
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
127 ID=(?P<id>[^,]+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
128 Description="(?P<desc>[^"]*)"
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
129 >''', re.VERBOSE)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
130 self.format_pattern = re.compile(r'''\#\#FORMAT=<
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
131 ID=(?P<id>.+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
132 Number=(?P<number>\d+|\.|[AG]),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
133 Type=(?P<type>.+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
134 Description="(?P<desc>.*)"
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
135 >''', re.VERBOSE)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
136 self.meta_pattern = re.compile(r'''##(?P<key>.+)=(?P<val>.+)''')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
137
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
138 def read_info(self, info_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
139 '''Read a meta-information INFO line.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
140 match = self.info_pattern.match(info_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
141 if not match:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
142 raise SyntaxError(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
143 "One of the INFO lines is malformed: {}".format(info_string))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
144
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
145 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
146 num = int(match.group('number'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
147 except ValueError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
148 num = None if self.aggro else '.'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
149
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
150 info = _Info(match.group('id'), num,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
151 match.group('type'), match.group('desc'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
152
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
153 return (match.group('id'), info)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
154
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
155 def read_filter(self, filter_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
156 '''Read a meta-information FILTER line.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
157 match = self.filter_pattern.match(filter_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
158 if not match:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
159 raise SyntaxError(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
160 "One of the FILTER lines is malformed: {}".format(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
161 filter_string))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
162
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
163 filt = _Filter(match.group('id'), match.group('desc'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
164
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
165 return (match.group('id'), filt)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
166
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
167 def read_format(self, format_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
168 '''Read a meta-information FORMAT line.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
169 match = self.format_pattern.match(format_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
170 if not match:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
171 raise SyntaxError(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
172 "One of the FORMAT lines is malformed: {}".format(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
173 format_string))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
174
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
175 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
176 num = int(match.group('number'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
177 except ValueError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
178 num = None if self.aggro else '.'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
179
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
180 form = _Format(match.group('id'), num,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
181 match.group('type'), match.group('desc'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
182
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
183 return (match.group('id'), form)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
184
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
185 def read_meta(self, meta_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
186 match = self.meta_pattern.match(meta_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
187 return match.group('key'), match.group('val')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
188
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
189
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
190 # Reader class
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
191 class _meta_info(object):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
192 '''Decorator for a property stored in the header info.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
193 def __init__(self, func):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
194 self.func = func
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
195
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
196 def __call__(self, fself):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
197 if getattr(fself, "_%s" % self.func.__name__) is None:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
198 fself._parse_metainfo()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
199
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
200 return self.func(fself)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
201
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
202 def __repr__(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
203 '''Return the function's docstring.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
204 return self.func.__doc__
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
205
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
206 def __doc__(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
207 '''Return the function's docstring.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
208 return self.func.__doc__
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
209
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
210 _Record = collections.namedtuple('Record', [
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
211 'CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
212 'samples'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
213 ])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
214
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
215
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
216 class VCFReader(object):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
217 '''Read and parse a VCF v 4.0 file'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
218 def __init__(self, fsock, aggressive=False):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
219 super(VCFReader, self).__init__()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
220 self.aggro = aggressive
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
221 self._metadata = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
222 self._infos = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
223 self._filters = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
224 self._formats = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
225 self._samples = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
226 self.reader = fsock
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
227 if aggressive:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
228 self._mapper = self._none_map
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
229 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
230 self._mapper = self._pass_map
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
231
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
232 def __iter__(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
233 return self
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
234
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
235 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
236 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
237 def metadata(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
238 '''Return the information from lines starting "##"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
239 return self._metadata
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
240
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
241 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
242 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
243 def infos(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
244 '''Return the information from lines starting "##INFO"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
245 return self._infos
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
246
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
247 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
248 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
249 def filters(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
250 '''Return the information from lines starting "##FILTER"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
251 return self._filters
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
252
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
253 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
254 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
255 def formats(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
256 '''Return the information from lines starting "##FORMAT"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
257 return self._formats
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
258
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
259 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
260 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
261 def samples(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
262 '''Return the names of the genotype fields.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
263 return self._samples
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
264
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
265 def _parse_metainfo(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
266 '''Parse the information stored in the metainfo of the VCF.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
267
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
268 The end user shouldn't have to use this. She can access the metainfo
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
269 directly with ``self.metadata``.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
270 for attr in ('_metadata', '_infos', '_filters', '_formats'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
271 setattr(self, attr, {})
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
272
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
273 parser = _vcf_metadata_parser()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
274
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
275 line = self.reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
276 while line.startswith('##'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
277 line = line.strip()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
278 if line.startswith('##INFO'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
279 key, val = parser.read_info(line)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
280 self._infos[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
281
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
282 elif line.startswith('##FILTER'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
283 key, val = parser.read_filter(line)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
284 self._filters[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
285
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
286 elif line.startswith('##FORMAT'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
287 key, val = parser.read_format(line)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
288 self._formats[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
289
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
290 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
291 key, val = parser.read_meta(line.strip())
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
292 self._metadata[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
293
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
294 line = self.reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
295
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
296 fields = line.split()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
297 self._samples = fields[9:]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
298
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
299 def _none_map(self, func, iterable, bad='.'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
300 '''``map``, but make bad values None.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
301 return [func(x) if x != bad else None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
302 for x in iterable]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
303
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
304 def _pass_map(self, func, iterable, bad='.'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
305 '''``map``, but make bad values None.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
306 return [func(x) if x != bad else bad
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
307 for x in iterable]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
308
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
309 def _parse_info(self, info_str):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
310 '''Parse the INFO field of a VCF entry into a dictionary of Python
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
311 types.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
312
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
313 '''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
314 entries = info_str.split(';')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
315 retdict = {}
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
316 for entry in entries:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
317 entry = entry.split('=')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
318 ID = entry[0]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
319 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
320 entry_type = self.infos[ID].type
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
321 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
322 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
323 entry_type = RESERVED_INFO[ID]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
324 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
325 if entry[1:]:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
326 entry_type = 'String'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
327 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
328 entry_type = 'Flag'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
329
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
330 if entry_type == 'Integer':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
331 vals = entry[1].split(',')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
332 val = self._mapper(int, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
333 elif entry_type == 'Float':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
334 vals = entry[1].split(',')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
335 val = self._mapper(float, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
336 elif entry_type == 'Flag':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
337 val = True
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
338 elif entry_type == 'String':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
339 val = entry[1]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
340
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
341 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
342 if self.infos[ID].num == 1:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
343 val = val[0]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
344 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
345 pass
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
346
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
347 retdict[ID] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
348
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
349 return retdict
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
350
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
351 def _parse_samples(self, samples, samp_fmt):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
352 '''Parse a sample entry according to the format specified in the FORMAT
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
353 column.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
354 samp_data = []
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
355 samp_fmt = samp_fmt.split(':')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
356 for sample in samples:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
357 sampdict = dict(zip(samp_fmt, sample.split(':')))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
358 for fmt in sampdict:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
359 vals = sampdict[fmt].split(',')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
360 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
361 entry_type = self.formats[fmt].type
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
362 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
363 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
364 entry_type = RESERVED_FORMAT[fmt]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
365 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
366 entry_type = 'String'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
367
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
368 if entry_type == 'Integer':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
369 sampdict[fmt] = self._mapper(int, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
370 elif entry_type == 'Float' or entry_type == 'Numeric':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
371 sampdict[fmt] = self._mapper(float, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
372 elif sampdict[fmt] == './.' and self.aggro:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
373 sampdict[fmt] = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
374
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
375 samp_data.append(sampdict)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
376
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
377 for name, data in zip(self.samples, samp_data):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
378 data['name'] = name
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
379
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
380 return samp_data
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
381
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
382 def next(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
383 '''Return the next record in the file.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
384 if self._samples is None:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
385 self._parse_metainfo()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
386 row = self.reader.next().split()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
387 chrom = row[0]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
388 pos = int(row[1])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
389
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
390 if row[2] != '.':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
391 ID = row[2]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
392 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
393 ID = None if self.aggro else row[2]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
394
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
395 ref = row[3]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
396 alt = self._mapper(str, row[4].split(','))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
397 qual = float(row[5]) if '.' in row[5] else int(row[5])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
398 filt = row[6].split(';') if ';' in row[6] else row[6]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
399 if filt == 'PASS' and self.aggro:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
400 filt = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
401 info = self._parse_info(row[7])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
402
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
403 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
404 fmt = row[8]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
405 except IndexError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
406 fmt = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
407 samples = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
408 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
409 samples = self._parse_samples(row[9:], fmt)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
410
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
411 record = _Record(chrom, pos, ID, ref, alt, qual, filt, info, fmt,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
412 samples)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
413 return record
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
414
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
415
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
416 def main():
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
417 '''Parse the example VCF file from the specification and print every
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
418 record.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
419 import contextlib
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
420 import StringIO
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
421 import textwrap
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
422 buff = '''\
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
423 ##fileformat=VCFv4.0
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
424 ##fileDate=20090805
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
425 ##source=myImputationProgramV3.1
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
426 ##reference=1000GenomesPilot-NCBI36
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
427 ##phasing=partial
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
428 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
429 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
430 ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
431 ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
432 ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
433 ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
434 ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
435 ##FILTER=<ID=q10,Description="Quality below 10">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
436 ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
437 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
438 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
439 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
440 ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
441 #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA00001\tNA00002\tNA00003
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
442 20\t14370\trs6054257\tG\tA\t29\tPASS\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51\t1/1:43:5:.,.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
443 20\t17330\t.\tT\tA\t3\tq10\tNS=3;DP=11;AF=0.017\tGT:GQ:DP:HQ\t0|0:49:3:58,50\t0|1:3:5:65,3\t0/0:41:3
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
444 20\t1110696\trs6040355\tA\tG,T\t67\tPASS\tNS=2;DP=10;AF=0.333,0.667;AA=T;DB\tGT:GQ:DP:HQ\t1|2:21:6:23,27\t2|1:2:0:18,2\t2/2:35:4
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
445 20\t1230237\t.\tT\t.\t47\tPASS\tNS=3;DP=13;AA=T\tGT:GQ:DP:HQ\t0|0:54:7:56,60\t0|0:48:4:51,51\t0/0:61:2
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
446 20\t1234567\tmicrosat1\tGTCT\tG,GTACT\t50\tPASS\tNS=3;DP=9;AA=G\tGT:GQ:DP\t./.:35:4\t0/2:17:2\t1/1:40:3
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
447 '''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
448 with contextlib.closing(StringIO.StringIO(textwrap.dedent(buff))) as sock:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
449 vcf_file = VCFReader(sock, aggressive=True)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
450 for record in vcf_file:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
451 print record
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
452
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
453
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
454 if __name__ == '__main__':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
455 main()