annotate tools/blast2go/massage_xml_for_blast2go.py @ 24:05eef6b222af draft

planemo upload for repository https://github.com/peterjc/galaxy_blast/tools/blast2go commit 6f3c1a8da279f3b34d3bc627c97713d8dfe5f8ed
author peterjc
date Fri, 15 May 2015 06:01:16 -0400
parents 31cb702eb5a8
children 242cf17c3bf9
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
2 """Script for reformatting Blast XML to suit Blast2GO.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
3
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
4 This script takes exactly two command line arguments:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
5 * Input BLAST XML filename
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
6 * Output BLAST XML filename
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
7
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
8 Sadly b2g4pipe (at least v2.3.5 to v2.5.0) cannot cope with current
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
9 style large BLAST XML files (e.g. from BLAST 2.2.25+), so we reformat
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
10 these to avoid it crashing with a Java heap space OutOfMemoryError.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
11
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
12 As part of this reformatting, we check for BLASTP or BLASTX output
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
13 (otherwise raise an error), and print the query count.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
14
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
15 This script is called from my Galaxy wrapper for Blast2GO for pipelines,
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
16 available from the Galaxy Tool Shed here:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
17 http://toolshed.g2.bx.psu.edu/view/peterjc/blast2go
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
18
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
19 This script is under version control here:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
20 https://github.com/peterjc/galaxy_blast/tree/master/blast2go
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
21 """
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
22 import sys
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
23 import os
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
24
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
25 def stop_err(msg, error_level=1):
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
26 """Print error message to stdout and quit with given error level."""
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
27 sys.stderr.write("%s\n" % msg)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
28 sys.exit(error_level)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
29
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
30 def prepare_xml(original_xml, mangled_xml):
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
31 """Reformat BLAST XML to suit Blast2GO.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
32
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
33 Blast2GO can't cope with 1000s of <Iteration> tags within a
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
34 single <BlastResult> tag, so instead split this into one
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
35 full XML record per interation (i.e. per query). This gives
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
36 a concatenated XML file mimicing old versions of BLAST.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
37
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
38 This also checks for BLASTP or BLASTX output, and outputs
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
39 the number of queries. Galaxy will show this as "info".
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
40 """
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
41 in_handle = open(original_xml)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
42 footer = " </BlastOutput_iterations>\n</BlastOutput>\n"
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
43 header = ""
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
44 while True:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
45 line = in_handle.readline()
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
46 if not line:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
47 #No hits?
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
48 stop_err("Problem with XML file?")
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
49 if line.strip() == "<Iteration>":
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
50 break
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
51 header += line
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
52
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
53 if "<BlastOutput_program>blastx</BlastOutput_program>" in header:
24
05eef6b222af planemo upload for repository https://github.com/peterjc/galaxy_blast/tools/blast2go commit 6f3c1a8da279f3b34d3bc627c97713d8dfe5f8ed
peterjc
parents: 23
diff changeset
54 print("BLASTX output identified")
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
55 elif "<BlastOutput_program>blastp</BlastOutput_program>" in header:
24
05eef6b222af planemo upload for repository https://github.com/peterjc/galaxy_blast/tools/blast2go commit 6f3c1a8da279f3b34d3bc627c97713d8dfe5f8ed
peterjc
parents: 23
diff changeset
56 print("BLASTP output identified")
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
57 else:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
58 in_handle.close()
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
59 stop_err("Expect BLASTP or BLASTX output")
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
60
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
61 out_handle = open(mangled_xml, "w")
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
62 out_handle.write(header)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
63 out_handle.write(line)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
64 count = 1
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
65 while True:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
66 line = in_handle.readline()
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
67 if not line:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
68 break
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
69 elif line.strip() == "<Iteration>":
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
70 #Insert footer/header
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
71 out_handle.write(footer)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
72 out_handle.write(header)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
73 count += 1
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
74 out_handle.write(line)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
75
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
76 out_handle.close()
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
77 in_handle.close()
24
05eef6b222af planemo upload for repository https://github.com/peterjc/galaxy_blast/tools/blast2go commit 6f3c1a8da279f3b34d3bc627c97713d8dfe5f8ed
peterjc
parents: 23
diff changeset
78 print("Input has %i queries" % count)
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
79
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
80
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
81 if __name__ == "__main__":
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
82 # Run the conversion...
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
83 if len(sys.argv) != 3:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
84 stop_err("Require two arguments: XML input filename, XML output filename")
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
85
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
86 xml_file, out_xml_file = sys.argv[1:]
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
87
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
88 if not os.path.isfile(xml_file):
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
89 stop_err("Input BLAST XML file not found: %s" % xml_file)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
90
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
91 prepare_xml(xml_file, out_xml_file)