annotate tools/blast2go/massage_xml_for_blast2go.py @ 25:242cf17c3bf9 draft default tip

"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
author peterjc
date Wed, 09 Sep 2020 15:01:39 +0000
parents 05eef6b222af
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
2 """Script for reformatting Blast XML to suit Blast2GO.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
3
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
4 This script takes exactly two command line arguments:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
5 * Input BLAST XML filename
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
6 * Output BLAST XML filename
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
7
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
8 Sadly b2g4pipe (at least v2.3.5 to v2.5.0) cannot cope with current
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
9 style large BLAST XML files (e.g. from BLAST 2.2.25+), so we reformat
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
10 these to avoid it crashing with a Java heap space OutOfMemoryError.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
11
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
12 As part of this reformatting, we check for BLASTP or BLASTX output
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
13 (otherwise raise an error), and print the query count.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
14
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
15 This script is called from my Galaxy wrapper for Blast2GO for pipelines,
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
16 available from the Galaxy Tool Shed here:
25
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
17 http://toolshed.g2.bx.psu.edu/view/peterjc/blast2go
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
18
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
19 This script is under version control here:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
20 https://github.com/peterjc/galaxy_blast/tree/master/blast2go
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
21 """
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
22
25
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
23 import os
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
24 import sys
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
25
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
26
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
27 def prepare_xml(original_xml, mangled_xml):
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
28 """Reformat BLAST XML to suit Blast2GO.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
29
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
30 Blast2GO can't cope with 1000s of <Iteration> tags within a
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
31 single <BlastResult> tag, so instead split this into one
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
32 full XML record per interation (i.e. per query). This gives
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
33 a concatenated XML file mimicing old versions of BLAST.
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
34
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
35 This also checks for BLASTP or BLASTX output, and outputs
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
36 the number of queries. Galaxy will show this as "info".
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
37 """
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
38 in_handle = open(original_xml)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
39 footer = " </BlastOutput_iterations>\n</BlastOutput>\n"
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
40 header = ""
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
41 while True:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
42 line = in_handle.readline()
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
43 if not line:
25
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
44 # No hits?
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
45 sys.exit("Problem with XML file?")
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
46 if line.strip() == "<Iteration>":
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
47 break
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
48 header += line
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
49
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
50 if "<BlastOutput_program>blastx</BlastOutput_program>" in header:
24
05eef6b222af planemo upload for repository https://github.com/peterjc/galaxy_blast/tools/blast2go commit 6f3c1a8da279f3b34d3bc627c97713d8dfe5f8ed
peterjc
parents: 23
diff changeset
51 print("BLASTX output identified")
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
52 elif "<BlastOutput_program>blastp</BlastOutput_program>" in header:
24
05eef6b222af planemo upload for repository https://github.com/peterjc/galaxy_blast/tools/blast2go commit 6f3c1a8da279f3b34d3bc627c97713d8dfe5f8ed
peterjc
parents: 23
diff changeset
53 print("BLASTP output identified")
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
54 else:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
55 in_handle.close()
25
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
56 sys.exit("Expect BLASTP or BLASTX output")
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
57
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
58 out_handle = open(mangled_xml, "w")
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
59 out_handle.write(header)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
60 out_handle.write(line)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
61 count = 1
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
62 while True:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
63 line = in_handle.readline()
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
64 if not line:
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
65 break
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
66 elif line.strip() == "<Iteration>":
25
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
67 # Insert footer/header
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
68 out_handle.write(footer)
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
69 out_handle.write(header)
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
70 count += 1
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
71 out_handle.write(line)
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
72
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
73 out_handle.close()
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
74 in_handle.close()
24
05eef6b222af planemo upload for repository https://github.com/peterjc/galaxy_blast/tools/blast2go commit 6f3c1a8da279f3b34d3bc627c97713d8dfe5f8ed
peterjc
parents: 23
diff changeset
75 print("Input has %i queries" % count)
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
76
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
77
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
78 if __name__ == "__main__":
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
79 # Run the conversion...
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
80 if len(sys.argv) != 3:
25
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
81 sys.exit("Require two arguments: XML input filename, XML output filename")
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
82
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
83 xml_file, out_xml_file = sys.argv[1:]
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
84
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
85 if not os.path.isfile(xml_file):
25
242cf17c3bf9 "planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go commit 0c82b9ef284c686cbffd30582d2586e4fb52881e"
peterjc
parents: 24
diff changeset
86 sys.exit("Input BLAST XML file not found: %s" % xml_file)
23
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
87
31cb702eb5a8 Uploaded v0.0.9, embed citation, updated README
peterjc
parents:
diff changeset
88 prepare_xml(xml_file, out_xml_file)