annotate vcf.py @ 17:0b0a6f326dad

Cleaned up the output dataset names for Merge Genomic Datasets
author melissacline
date Fri, 20 Mar 2015 14:22:02 -0700
parents 60efb9214eaa
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
1 #!/usr/bin/env python
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
2 '''A VCFv4.0 parser for Python.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
3
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
4 The intent of this module is to mimic the ``csv`` module in the Python stdlib,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
5 as opposed to more flexible serialization formats like JSON or YAML. ``vcf``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
6 will attempt to parse the content of each record based on the data types
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
7 specified in the meta-information lines -- specifically the ##INFO and
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
8 ##FORMAT lines. If these lines are missing or incomplete, it will check
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
9 against the reserved types mentioned in the spec. Failing that, it will just
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
10 return strings.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
11
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
12 There is currently one piece of interface: ``VCFReader``. It takes a file-like
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
13 object and acts as a reader::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
14
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
15 >>> import vcf
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
16 >>> vcf_reader = vcf.VCFReader(open('example.vcf', 'rb'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
17 >>> for record in vcf_reader:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
18 ... print record
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
19 Record(CHROM='20', POS=14370, ID='rs6054257', REF='G', ALT=['A'], QUAL=29,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
20 FILTER='PASS', INFO={'H2': True, 'NS': 3, 'DB': True, 'DP': 14, 'AF': [0.5]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
21 }, FORMAT='GT:GQ:DP:HQ', samples=[{'GT': '0', 'HQ': [58, 50], 'DP': 3, 'GQ'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
22 : 49, 'name': 'NA00001'}, {'GT': '0', 'HQ': [65, 3], 'DP': 5, 'GQ': 3, 'nam
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
23 e' : 'NA00002'}, {'GT': '0', 'DP': 3, 'GQ': 41, 'name': 'NA00003'}])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
24
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
25 This produces a great deal of information, but it is conveniently accessed.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
26 The attributes of a Record are the 8 fixed fields from the VCF spec plus two
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
27 more. That is:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
28
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
29 * ``Record.CHROM``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
30 * ``Record.POS``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
31 * ``Record.ID``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
32 * ``Record.REF``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
33 * ``Record.ALT``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
34 * ``Record.QUAL``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
35 * ``Record.FILTER``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
36 * ``Record.INFO``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
37
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
38 plus two more attributes to handle genotype information:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
39
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
40 * ``Record.FORMAT``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
41 * ``Record.samples``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
42
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
43 ``samples``, not being the title of any column, is left lowercase. The format
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
44 of the fixed fields is from the spec. Comma-separated lists in the VCF are
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
45 converted to lists. In particular, one-entry VCF lists are converted to
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
46 one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
47 of key=value pairs are converted to Python dictionaries, with flags being given
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
48 a ``True`` value. Integers and floats are handled exactly as you'd expect::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
49
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
50 >>> record = vcf_reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
51 >>> print record.POS
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
52 17330
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
53 >>> print record.ALT
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
54 ['A']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
55 >>> print record.INFO['AF']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
56 [0.017]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
57
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
58 ``record.FORMAT`` will be a string specifying the format of the genotype
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
59 fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
60 ``None``. Finally, ``record.samples`` is a list of dictionaries containing the
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
61 parsed sample column::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
62
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
63 >>> record = vcf_reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
64 >>> for sample in record.samples:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
65 ... print sample['GT']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
66 '1|2'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
67 '2|1'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
68 '2/2'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
69
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
70 Metadata regarding the VCF file itself can be investigated through the
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
71 following attributes:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
72
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
73 * ``VCFReader.metadata``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
74 * ``VCFReader.infos``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
75 * ``VCFReader.filters``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
76 * ``VCFReader.formats``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
77 * ``VCFReader.samples``
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
78
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
79 For example::
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
80
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
81 >>> vcf_reader.metadata['fileDate']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
82 20090805
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
83 >>> vcf_reader.samples
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
84 ['NA00001', 'NA00002', 'NA00003']
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
85 >>> vcf_reader.filters
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
86 {'q10': Filter(id='q10', desc='Quality below 10'),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
87 's50': Filter(id='s50', desc='Less than 50% of samples have data')}
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
88 >>> vcf_reader.infos['AA'].desc
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
89 Ancestral Allele
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
90
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
91 '''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
92 import collections
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
93 import re
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
94
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
95
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
96 # Metadata parsers/constants
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
97 RESERVED_INFO = {
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
98 'AA': 'String', 'AC': 'Integer', 'AF': 'Float', 'AN': 'Integer',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
99 'BQ': 'Float', 'CIGAR': 'String', 'DB': 'Flag', 'DP': 'Integer',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
100 'END': 'Integer', 'H2': 'Flag', 'MQ': 'Float', 'MQ0': 'Integer',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
101 'NS': 'Integer', 'SB': 'String', 'SOMATIC': 'Flag', 'VALIDATED': 'Flag'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
102 }
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
103
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
104 RESERVED_FORMAT = {
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
105 'GT': 'String', 'DP': 'Integer', 'FT': 'String', 'GL': 'Float',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
106 'GQ': 'Float', 'HQ': 'Float'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
107 }
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
108
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
109
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
110 _Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc'])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
111 _Filter = collections.namedtuple('Filter', ['id', 'desc'])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
112 _Format = collections.namedtuple('Format', ['id', 'num', 'type', 'desc'])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
113
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
114
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
115 class _vcf_metadata_parser(object):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
116 '''Parse the metadat in the header of a VCF file.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
117 def __init__(self, aggressive=False):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
118 super(_vcf_metadata_parser, self).__init__()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
119 self.aggro = aggressive
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
120 self.info_pattern = re.compile(r'''\#\#INFO=<
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
121 ID=(?P<id>[^,]+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
122 Number=(?P<number>\d+|\.|[AG]),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
123 Type=(?P<type>Integer|Float|Flag|Character|String),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
124 Description="(?P<desc>[^"]*)"
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
125 >''', re.VERBOSE)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
126 self.filter_pattern = re.compile(r'''\#\#FILTER=<
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
127 ID=(?P<id>[^,]+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
128 Description="(?P<desc>[^"]*)"
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
129 >''', re.VERBOSE)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
130 self.format_pattern = re.compile(r'''\#\#FORMAT=<
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
131 ID=(?P<id>.+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
132 Number=(?P<number>\d+|\.|[AG]),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
133 Type=(?P<type>.+),
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
134 Description="(?P<desc>.*)"
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
135 >''', re.VERBOSE)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
136 self.meta_pattern = re.compile(r'''##(?P<key>.+)=(?P<val>.+)''')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
137
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
138 def read_info(self, info_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
139 '''Read a meta-information INFO line.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
140 match = self.info_pattern.match(info_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
141 if not match:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
142 raise SyntaxError(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
143 "One of the INFO lines is malformed: {}".format(info_string))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
144
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
145 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
146 num = int(match.group('number'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
147 except ValueError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
148 num = None if self.aggro else '.'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
149
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
150 info = _Info(match.group('id'), num,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
151 match.group('type'), match.group('desc'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
152
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
153 return (match.group('id'), info)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
154
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
155 def read_filter(self, filter_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
156 '''Read a meta-information FILTER line.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
157 match = self.filter_pattern.match(filter_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
158 if not match:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
159 raise SyntaxError(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
160 "One of the FILTER lines is malformed: {}".format(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
161 filter_string))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
162
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
163 filt = _Filter(match.group('id'), match.group('desc'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
164
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
165 return (match.group('id'), filt)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
166
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
167 def read_format(self, format_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
168 '''Read a meta-information FORMAT line.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
169 match = self.format_pattern.match(format_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
170 if not match:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
171 raise SyntaxError(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
172 "One of the FORMAT lines is malformed: {}".format(
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
173 format_string))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
174
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
175 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
176 num = int(match.group('number'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
177 except ValueError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
178 num = None if self.aggro else '.'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
179
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
180 form = _Format(match.group('id'), num,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
181 match.group('type'), match.group('desc'))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
182
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
183 return (match.group('id'), form)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
184
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
185 def read_meta(self, meta_string):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
186 match = self.meta_pattern.match(meta_string)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
187 return match.group('key'), match.group('val')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
188
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
189
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
190 # Reader class
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
191 class _meta_info(object):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
192 '''Decorator for a property stored in the header info.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
193 def __init__(self, func):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
194 self.func = func
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
195
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
196 def __call__(self, fself):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
197 if getattr(fself, "_%s" % self.func.__name__) is None:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
198 fself._parse_metainfo()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
199
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
200 return self.func(fself)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
201
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
202 def __repr__(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
203 '''Return the function's docstring.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
204 return self.func.__doc__
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
205
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
206 def __doc__(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
207 '''Return the function's docstring.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
208 return self.func.__doc__
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
209
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
210 _Record = collections.namedtuple('Record', [
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
211 'CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT',
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
212 'samples'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
213 ])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
214
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
215
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
216 class VCFReader(object):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
217 '''Read and parse a VCF v 4.0 file'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
218 def __init__(self, fsock, aggressive=False):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
219 super(VCFReader, self).__init__()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
220 self.aggro = aggressive
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
221 self._metadata = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
222 self._infos = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
223 self._filters = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
224 self._formats = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
225 self._samples = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
226 self.reader = fsock
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
227 if aggressive:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
228 self._mapper = self._none_map
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
229 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
230 self._mapper = self._pass_map
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
231
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
232 def __iter__(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
233 return self
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
234
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
235 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
236 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
237 def metadata(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
238 '''Return the information from lines starting "##"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
239 return self._metadata
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
240
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
241 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
242 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
243 def infos(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
244 '''Return the information from lines starting "##INFO"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
245 return self._infos
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
246
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
247 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
248 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
249 def filters(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
250 '''Return the information from lines starting "##FILTER"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
251 return self._filters
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
252
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
253 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
254 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
255 def formats(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
256 '''Return the information from lines starting "##FORMAT"'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
257 return self._formats
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
258
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
259 @property
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
260 @_meta_info
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
261 def samples(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
262 '''Return the names of the genotype fields.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
263 return self._samples
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
264
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
265 def _parse_metainfo(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
266 '''Parse the information stored in the metainfo of the VCF.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
267
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
268 The end user shouldn't have to use this. She can access the metainfo
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
269 directly with ``self.metadata``.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
270 for attr in ('_metadata', '_infos', '_filters', '_formats'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
271 setattr(self, attr, {})
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
272
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
273 parser = _vcf_metadata_parser()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
274
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
275 line = self.reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
276 while line.startswith('##'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
277 line = line.strip()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
278 if line.startswith('##INFO'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
279 key, val = parser.read_info(line)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
280 self._infos[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
281
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
282 elif line.startswith('##FILTER'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
283 key, val = parser.read_filter(line)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
284 self._filters[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
285
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
286 elif line.startswith('##FORMAT'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
287 key, val = parser.read_format(line)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
288 self._formats[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
289
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
290 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
291 key, val = parser.read_meta(line.strip())
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
292 self._metadata[key] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
293
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
294 line = self.reader.next()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
295
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
296 fields = line.split()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
297 self._samples = fields[9:]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
298
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
299 def _none_map(self, func, iterable, bad='.'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
300 '''``map``, but make bad values None.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
301 return [func(x) if x != bad else None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
302 for x in iterable]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
303
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
304 def _pass_map(self, func, iterable, bad='.'):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
305 '''``map``, but make bad values None.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
306 return [func(x) if x != bad else bad
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
307 for x in iterable]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
308
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
309 def _parse_info(self, info_str):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
310 '''Parse the INFO field of a VCF entry into a dictionary of Python
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
311 types.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
312
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
313 '''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
314 entries = info_str.split(';')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
315 retdict = {}
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
316 for entry in entries:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
317 entry = entry.split('=')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
318 ID = entry[0]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
319 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
320 entry_type = self.infos[ID].type
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
321 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
322 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
323 entry_type = RESERVED_INFO[ID]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
324 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
325 if entry[1:]:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
326 entry_type = 'String'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
327 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
328 entry_type = 'Flag'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
329
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
330 if entry_type == 'Integer':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
331 vals = entry[1].split(',')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
332 val = self._mapper(int, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
333 elif entry_type == 'Float':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
334 vals = entry[1].split(',')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
335 val = self._mapper(float, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
336 elif entry_type == 'Flag':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
337 val = True
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
338 elif entry_type == 'String':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
339 val = entry[1]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
340
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
341 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
342 if self.infos[ID].num == 1:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
343 val = val[0]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
344 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
345 pass
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
346
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
347 retdict[ID] = val
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
348
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
349 return retdict
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
350
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
351 def _parse_samples(self, samples, samp_fmt):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
352 '''Parse a sample entry according to the format specified in the FORMAT
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
353 column.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
354 samp_data = []
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
355 samp_fmt = samp_fmt.split(':')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
356 for sample in samples:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
357 sampdict = dict(zip(samp_fmt, sample.split(':')))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
358 for fmt in sampdict:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
359 vals = sampdict[fmt].split(',')
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
360 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
361 entry_type = self.formats[fmt].type
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
362 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
363 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
364 entry_type = RESERVED_FORMAT[fmt]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
365 except KeyError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
366 entry_type = 'String'
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
367
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
368 if entry_type == 'Integer':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
369 sampdict[fmt] = self._mapper(int, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
370 elif entry_type == 'Float' or entry_type == 'Numeric':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
371 sampdict[fmt] = self._mapper(float, vals)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
372 elif sampdict[fmt] == './.' and self.aggro:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
373 sampdict[fmt] = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
374
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
375 samp_data.append(sampdict)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
376
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
377 for name, data in zip(self.samples, samp_data):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
378 data['name'] = name
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
379
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
380 return samp_data
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
381
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
382 def next(self):
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
383 '''Return the next record in the file.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
384 if self._samples is None:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
385 self._parse_metainfo()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
386 row = self.reader.next().split()
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
387 chrom = row[0]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
388 pos = int(row[1])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
389
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
390 if row[2] != '.':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
391 ID = row[2]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
392 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
393 ID = None if self.aggro else row[2]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
394
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
395 ref = row[3]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
396 alt = self._mapper(str, row[4].split(','))
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
397 qual = float(row[5]) if '.' in row[5] else int(row[5])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
398 filt = row[6].split(';') if ';' in row[6] else row[6]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
399 if filt == 'PASS' and self.aggro:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
400 filt = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
401 info = self._parse_info(row[7])
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
402
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
403 try:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
404 fmt = row[8]
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
405 except IndexError:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
406 fmt = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
407 samples = None
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
408 else:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
409 samples = self._parse_samples(row[9:], fmt)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
410
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
411 record = _Record(chrom, pos, ID, ref, alt, qual, filt, info, fmt,
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
412 samples)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
413 return record
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
414
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
415
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
416 def main():
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
417 '''Parse the example VCF file from the specification and print every
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
418 record.'''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
419 import contextlib
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
420 import StringIO
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
421 import textwrap
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
422 buff = '''\
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
423 ##fileformat=VCFv4.0
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
424 ##fileDate=20090805
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
425 ##source=myImputationProgramV3.1
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
426 ##reference=1000GenomesPilot-NCBI36
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
427 ##phasing=partial
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
428 ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
429 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
430 ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
431 ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
432 ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
433 ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
434 ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
435 ##FILTER=<ID=q10,Description="Quality below 10">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
436 ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
437 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
438 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
439 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
440 ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
441 #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tNA00001\tNA00002\tNA00003
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
442 20\t14370\trs6054257\tG\tA\t29\tPASS\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51\t1/1:43:5:.,.
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
443 20\t17330\t.\tT\tA\t3\tq10\tNS=3;DP=11;AF=0.017\tGT:GQ:DP:HQ\t0|0:49:3:58,50\t0|1:3:5:65,3\t0/0:41:3
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
444 20\t1110696\trs6040355\tA\tG,T\t67\tPASS\tNS=2;DP=10;AF=0.333,0.667;AA=T;DB\tGT:GQ:DP:HQ\t1|2:21:6:23,27\t2|1:2:0:18,2\t2/2:35:4
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
445 20\t1230237\t.\tT\t.\t47\tPASS\tNS=3;DP=13;AA=T\tGT:GQ:DP:HQ\t0|0:54:7:56,60\t0|0:48:4:51,51\t0/0:61:2
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
446 20\t1234567\tmicrosat1\tGTCT\tG,GTACT\t50\tPASS\tNS=3;DP=9;AA=G\tGT:GQ:DP\t./.:35:4\t0/2:17:2\t1/1:40:3
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
447 '''
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
448 with contextlib.closing(StringIO.StringIO(textwrap.dedent(buff))) as sock:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
449 vcf_file = VCFReader(sock, aggressive=True)
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
450 for record in vcf_file:
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
451 print record
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
452
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
453
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
454 if __name__ == '__main__':
60efb9214eaa Uploaded
melissacline
parents:
diff changeset
455 main()