Mercurial > repos > peterjc > ncbi_blast_plus
changeset 20:688f3fb09a6a draft
Uploaded v0.0.20 preview 11, moved to GitHub, MIT license, reST markup.
author | peterjc |
---|---|
date | Tue, 30 Jul 2013 07:33:46 -0400 |
parents | c1a6e5aefee0 |
children | 6902315b7730 |
files | ncbi_blast_plus/README.rst ncbi_blast_plus/blastxml_to_tabular.py ncbi_blast_plus/blastxml_to_tabular.xml ncbi_blast_plus/ncbi_blastdbcmd_info.xml ncbi_blast_plus/ncbi_blastdbcmd_wrapper.xml ncbi_blast_plus/ncbi_blastn_wrapper.xml ncbi_blast_plus/ncbi_blastp_wrapper.xml ncbi_blast_plus/ncbi_blastx_wrapper.xml ncbi_blast_plus/ncbi_makeblastdb.xml ncbi_blast_plus/ncbi_rpsblast_wrapper.xml ncbi_blast_plus/ncbi_rpstblastn_wrapper.xml ncbi_blast_plus/ncbi_tblastn_wrapper.xml ncbi_blast_plus/ncbi_tblastx_wrapper.xml ncbi_blast_plus/repository_dependencies.xml ncbi_blast_plus/tool_dependencies.xml test-data/blastx_sample.xml tools/ncbi_blast_plus/blastxml_to_tabular.py tools/ncbi_blast_plus/blastxml_to_tabular.xml tools/ncbi_blast_plus/ncbi_blast_plus.txt tools/ncbi_blast_plus/ncbi_blastdbcmd_info.xml tools/ncbi_blast_plus/ncbi_blastdbcmd_wrapper.xml tools/ncbi_blast_plus/ncbi_blastn_wrapper.xml tools/ncbi_blast_plus/ncbi_blastp_wrapper.xml tools/ncbi_blast_plus/ncbi_blastx_wrapper.xml tools/ncbi_blast_plus/ncbi_makeblastdb.xml tools/ncbi_blast_plus/ncbi_rpsblast_wrapper.xml tools/ncbi_blast_plus/ncbi_rpstblastn_wrapper.xml tools/ncbi_blast_plus/ncbi_tblastn_wrapper.xml tools/ncbi_blast_plus/ncbi_tblastx_wrapper.xml tools/ncbi_blast_plus/repository_dependencies.xml tools/ncbi_blast_plus/tool_dependencies.xml |
diffstat | 30 files changed, 2894 insertions(+), 2875 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/README.rst Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,166 @@ +Galaxy wrappers for NCBI BLAST+ suite +===================================== + +These wrappers are copyright 2010-2013 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the licence text below. + +Currently tested with NCBI BLAST 2.2.26+ (i.e. version 2.2.26 of BLAST+), +and does not work with the NCBI 'legacy' BLAST suite (e.g. blastall). + +Note that these wrappers (and the associated datatypes) were originally +distributed as part of the main Galaxy repository, but as of August 2012 +moved to the Galaxy Tool Shed as 'ncbi_blast_plus' (and 'blast_datatypes'). +My thanks to Dannon Baker from the Galaxy development team for his assistance +with this. + +These wrappers are available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + + +Automated Installation +====================== + +Galaxy should be able to automatically install the dependencies, i.e. the +'blast_datatypes' repository which defines the BLAST XML file format +('blastxml') and protein and nucleotide BLAST databases ('blastdbp' and +'blastdbn'). + +You must tell Galaxy about any system level BLAST databases using configuration +files blastdb.loc (nucleotide databases like NT) and blastdb_p.loc (protein +databases like NR), and blastdb_d.loc (protein domain databases like CDD or +SMART) which are located in the tool-data/ folder. Sample files are included +which explain the tab-based format to use. + +You can download the NCBI provided databases as tar-balls from here: + +* ftp://ftp.ncbi.nlm.nih.gov/blast/db/ (nucleotide and protein databases like NR) +* ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/ (domain databases like CDD) + + +Manual Installation +=================== + +For those not using Galaxy's automated installation from the Tool Shed, put +the XML and Python files in the tools/ncbi_blast_plus/ folder and add the XML +files to your tool_conf.xml as normal (and do the same in tool_conf.xml.sample +in order to run the unit tests). For example, use:: + + <section name="NCBI BLAST+" id="ncbi_blast_plus_tools"> + <tool file="ncbi_blast_plus/ncbi_blastn_wrapper.xml" /> + <tool file="ncbi_blast_plus/ncbi_blastp_wrapper.xml" /> + <tool file="ncbi_blast_plus/ncbi_blastx_wrapper.xml" /> + <tool file="ncbi_blast_plus/ncbi_tblastn_wrapper.xml" /> + <tool file="ncbi_blast_plus/ncbi_tblastx_wrapper.xml" /> + <tool file="ncbi_blast_plus/ncbi_makeblastdb.xml" /> + <tool file="ncbi_blast_plus/ncbi_blastdbcmd_wrapper.xml" /> + <tool file="ncbi_blast_plus/ncbi_blastdbcmd_info.xml" /> + <tool file="ncbi_blast_plus/ncbi_rpsblast_wrapper.xml" /> + <tool file="ncbi_blast_plus/ncbi_rpstblastn_wrapper.xml" /> + <tool file="ncbi_blast_plus/blastxml_to_tabular.xml" /> + </section> + +You will also need to install 'blast_datatypes' from the Tool Shed. This +defines the BLAST XML file format ('blastxml') and protein and nucleotide +BLAST databases composite file formats ('blastdbp' and 'blastdbn'). + +As described above for an automated installation, you must also tell Galaxy +about any system level BLAST databases using the tool-data/blastdb*.loc files. + +You must install the NCBI BLAST+ standalone tools somewhere on the system +path. Currently the unit tests are written using "BLAST 2.2.26+". + +Run the functional tests (adjusting the section identifier to match your +tool_conf.xml.sample file):: + + ./run_functional_tests.sh -sid NCBI_BLAST+-ncbi_blast_plus_tools + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.11 - Final revision as part of the Galaxy main repository, and the + first release via the Tool Shed +v0.0.12 - Implements genetic code option for translation searches. + - Changes <parallelism> to 1000 sequences at a time (to cope with + very large sets of queries where BLAST+ can become memory hungry) + - Include warning that BLAST+ with subject FASTA gives pairwise + e-values +v0.0.13 - Use the new error handling options in Galaxy (the previously + bundled hide_stderr.py script is no longer needed). +v0.0.14 - Support for makeblastdb and blastdbinfo with local BLAST databases + in the history (using work from Edward Kirton), requires v0.0.14 + of the 'blast_datatypes' repository from the Tool Shed. +v0.0.15 - Stronger warning in help text against searching against subject + FASTA files (better looking e-values than you might be expecting). +v0.0.16 - Added repository_dependencies.xml for automates installation of the + 'blast_datatypes' repository from the Tool Shed. +v0.0.17 - The BLAST+ search tools now default to extended tabular output + (all too often our users where having to re-run searches just to + get one of the missing columns like query or subject length) +v0.0.18 - Defensive quoting of filenames in case of spaces (where possible, + BLAST+ handling of some mult-file arguments is problematic). +v0.0.19 - Added wrappers for rpsblast and rpstblastn, and new blastdb_d.loc + for the domain databases they use (e.g. CDD, PFAM or SMART). + - Correct case of exception regular expression (for error handling + fall-back in case the return code is not set properly). + - Clearer naming of output files. +v0.0.20 - Added unit tests for BLASTN and TBLASTX. + - Added percentage identity option to BLASTN. + - Fallback on ElementTree if cElementTree missing in XML to tabular. + - Link to Tool Shed added to help text and this documentation. + - Tweak dependency on blast_datatypes to also work on Test Tool Shed + - Adopted standard MIT License. + - Development moved to GitHub, https://github.com/peterjc/galaxy_blast +======= ====================================================================== + + +Bug Reports +=========== + +You can file an issue here https://github.com/peterjc/galaxy_blast/issues or ask +us on the Galaxy development list http://lists.bx.psu.edu/listinfo/galaxy-dev + + +Developers +========== + +This script and related tools were originally developed on the 'tools' branch +of the following Mercurial repository: +https://bitbucket.org/peterjc/galaxy-central/ + +As of July 2013, development is continuing on a dedicated GitHub repository: +https://github.com/peterjc/galaxy_blast + +For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball I use +the following command from the GitHub repository root folder:: + + $ ./ncbi_blast_plus/make_ncbi_blast_plus.sh + +This simplifies ensuring a consistent set of files is bundled each time, +including all the relevant test files. + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/blastxml_to_tabular.py Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,261 @@ +#!/usr/bin/env python +"""Convert a BLAST XML file to tabular output. + +Takes three command line options, input BLAST XML filename, output tabular +BLAST filename, output format (std for standard 12 columns, or ext for the +extended 24 columns offered in the BLAST+ wrappers). + +The 12 columns output are 'qseqid sseqid pident length mismatch gapopen qstart +qend sstart send evalue bitscore' or 'std' at the BLAST+ command line, which +mean: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The additional columns offered in the Galaxy BLAST+ wrappers are: + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +Most of these fields are given explicitly in the XML file, others some like +the percentage identity and the number of gap openings must be calculated. + +Be aware that the sequence in the extended tabular output or XML direct from +BLAST+ may or may not use XXXX masking on regions of low complexity. This +can throw the off the calculation of percentage identity and gap openings. +[In fact, both BLAST 2.2.24+ and 2.2.25+ have a subtle bug in this regard, +with these numbers changing depending on whether or not the low complexity +filter is used.] + +This script attempts to produce identical output to what BLAST+ would have done. +However, check this with "diff -b ..." since BLAST+ sometimes includes an extra +space character (probably a bug). +""" +import sys +import re + +if "-v" in sys.argv or "--version" in sys.argv: + print "v0.0.12" + sys.exit(0) + +if sys.version_info[:2] >= ( 2, 5 ): + try: + from xml.etree import cElementTree as ElementTree + except ImportError: + from xml.etree import ElementTree as ElementTree +else: + from galaxy import eggs + import pkg_resources; pkg_resources.require( "elementtree" ) + from elementtree import ElementTree + +def stop_err( msg ): + sys.stderr.write("%s\n" % msg) + sys.exit(1) + +#Parse Command Line +try: + in_file, out_file, out_fmt = sys.argv[1:] +except: + stop_err("Expect 3 arguments: input BLAST XML file, output tabular file, out format (std or ext)") + +if out_fmt == "std": + extended = False +elif out_fmt == "x22": + stop_err("Format argument x22 has been replaced with ext (extended 24 columns)") +elif out_fmt == "ext": + extended = True +else: + stop_err("Format argument should be std (12 column) or ext (extended 24 columns)") + + +# get an iterable +try: + context = ElementTree.iterparse(in_file, events=("start", "end")) +except: + stop_err("Invalid data format.") +# turn it into an iterator +context = iter(context) +# get the root element +try: + event, root = context.next() +except: + stop_err( "Invalid data format." ) + + +re_default_query_id = re.compile("^Query_\d+$") +assert re_default_query_id.match("Query_101") +assert not re_default_query_id.match("Query_101a") +assert not re_default_query_id.match("MyQuery_101") +re_default_subject_id = re.compile("^Subject_\d+$") +assert re_default_subject_id.match("Subject_1") +assert not re_default_subject_id.match("Subject_") +assert not re_default_subject_id.match("Subject_12a") +assert not re_default_subject_id.match("TheSubject_1") + + +outfile = open(out_file, 'w') +blast_program = None +for event, elem in context: + if event == "end" and elem.tag == "BlastOutput_program": + blast_program = elem.text + # for every <Iteration> tag + if event == "end" and elem.tag == "Iteration": + #Expecting either this, from BLAST 2.2.25+ using FASTA vs FASTA + # <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID> + # <Iteration_query-def>Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1</Iteration_query-def> + # <Iteration_query-len>406</Iteration_query-len> + # <Iteration_hits></Iteration_hits> + # + #Or, from BLAST 2.2.24+ run online + # <Iteration_query-ID>Query_1</Iteration_query-ID> + # <Iteration_query-def>Sample</Iteration_query-def> + # <Iteration_query-len>516</Iteration_query-len> + # <Iteration_hits>... + qseqid = elem.findtext("Iteration_query-ID") + if re_default_query_id.match(qseqid): + #Place holder ID, take the first word of the query definition + qseqid = elem.findtext("Iteration_query-def").split(None,1)[0] + qlen = int(elem.findtext("Iteration_query-len")) + + # for every <Hit> within <Iteration> + for hit in elem.findall("Iteration_hits/Hit"): + #Expecting either this, + # <Hit_id>gi|3024260|sp|P56514.1|OPSD_BUFBU</Hit_id> + # <Hit_def>RecName: Full=Rhodopsin</Hit_def> + # <Hit_accession>P56514</Hit_accession> + #or, + # <Hit_id>Subject_1</Hit_id> + # <Hit_def>gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus]</Hit_def> + # <Hit_accession>Subject_1</Hit_accession> + # + #apparently depending on the parse_deflines switch + sseqid = hit.findtext("Hit_id").split(None,1)[0] + hit_def = sseqid + " " + hit.findtext("Hit_def") + if re_default_subject_id.match(sseqid) \ + and sseqid == hit.findtext("Hit_accession"): + #Place holder ID, take the first word of the subject definition + hit_def = hit.findtext("Hit_def") + sseqid = hit_def.split(None,1)[0] + # for every <Hsp> within <Hit> + for hsp in hit.findall("Hit_hsps/Hsp"): + nident = hsp.findtext("Hsp_identity") + length = hsp.findtext("Hsp_align-len") + pident = "%0.2f" % (100*float(nident)/float(length)) + + q_seq = hsp.findtext("Hsp_qseq") + h_seq = hsp.findtext("Hsp_hseq") + m_seq = hsp.findtext("Hsp_midline") + assert len(q_seq) == len(h_seq) == len(m_seq) == int(length) + gapopen = str(len(q_seq.replace('-', ' ').split())-1 + \ + len(h_seq.replace('-', ' ').split())-1) + + mismatch = m_seq.count(' ') + m_seq.count('+') \ + - q_seq.count('-') - h_seq.count('-') + #TODO - Remove this alternative mismatch calculation and test + #once satisifed there are no problems + expected_mismatch = len(q_seq) \ + - sum(1 for q,h in zip(q_seq, h_seq) \ + if q == h or q == "-" or h == "-") + xx = sum(1 for q,h in zip(q_seq, h_seq) if q=="X" and h=="X") + if not (expected_mismatch - q_seq.count("X") <= int(mismatch) <= expected_mismatch + xx): + stop_err("%s vs %s mismatches, expected %i <= %i <= %i" \ + % (qseqid, sseqid, expected_mismatch - q_seq.count("X"), + int(mismatch), expected_mismatch)) + + #TODO - Remove this alternative identity calculation and test + #once satisifed there are no problems + expected_identity = sum(1 for q,h in zip(q_seq, h_seq) if q == h) + if not (expected_identity - xx <= int(nident) <= expected_identity + q_seq.count("X")): + stop_err("%s vs %s identities, expected %i <= %i <= %i" \ + % (qseqid, sseqid, expected_identity, int(nident), + expected_identity + q_seq.count("X"))) + + + evalue = hsp.findtext("Hsp_evalue") + if evalue == "0": + evalue = "0.0" + else: + evalue = "%0.0e" % float(evalue) + + bitscore = float(hsp.findtext("Hsp_bit-score")) + if bitscore < 100: + #Seems to show one decimal place for lower scores + bitscore = "%0.1f" % bitscore + else: + #Note BLAST does not round to nearest int, it truncates + bitscore = "%i" % bitscore + + values = [qseqid, + sseqid, + pident, + length, #hsp.findtext("Hsp_align-len") + str(mismatch), + gapopen, + hsp.findtext("Hsp_query-from"), #qstart, + hsp.findtext("Hsp_query-to"), #qend, + hsp.findtext("Hsp_hit-from"), #sstart, + hsp.findtext("Hsp_hit-to"), #send, + evalue, #hsp.findtext("Hsp_evalue") in scientific notation + bitscore, #hsp.findtext("Hsp_bit-score") rounded + ] + + if extended: + sallseqid = ";".join(name.split(None,1)[0] for name in hit_def.split(">")) + #print hit_def, "-->", sallseqid + positive = hsp.findtext("Hsp_positive") + ppos = "%0.2f" % (100*float(positive)/float(length)) + qframe = hsp.findtext("Hsp_query-frame") + sframe = hsp.findtext("Hsp_hit-frame") + if blast_program == "blastp": + #Probably a bug in BLASTP that they use 0 or 1 depending on format + if qframe == "0": qframe = "1" + if sframe == "0": sframe = "1" + slen = int(hit.findtext("Hit_len")) + values.extend([sallseqid, + hsp.findtext("Hsp_score"), #score, + nident, + positive, + hsp.findtext("Hsp_gaps"), #gaps, + ppos, + qframe, + sframe, + #NOTE - for blastp, XML shows original seq, tabular uses XXX masking + q_seq, + h_seq, + str(qlen), + str(slen), + ]) + #print "\t".join(values) + outfile.write("\t".join(values) + "\n") + # prevents ElementTree from growing large datastructure + root.clear() + elem.clear() +outfile.close()
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/blastxml_to_tabular.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,137 @@ +<tool id="blastxml_to_tabular" name="BLAST XML to tabular" version="0.0.11"> + <description>Convert BLAST XML output to tabular</description> + <version_command interpreter="python">blastxml_to_tabular.py --version</version_command> + <command interpreter="python"> + blastxml_to_tabular.py $blastxml_file $tabular_file $out_format + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + </stdio> + <inputs> + <param name="blastxml_file" type="data" format="blastxml" label="BLAST results as XML"/> + <param name="out_format" type="select" label="Output format"> + <option value="std">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + </param> + </inputs> + <outputs> + <data name="tabular_file" format="tabular" label="BLAST results as tabular" /> + </outputs> + <requirements> + </requirements> + <tests> + <test> + <param name="blastxml_file" value="blastp_four_human_vs_rhodopsin.xml" ftype="blastxml" /> + <param name="out_format" value="std" /> + <!-- Note this has some white space differences from the actual blastp output blast_four_human_vs_rhodopsin.tabluar --> + <output name="tabular_file" file="blastp_four_human_vs_rhodopsin_converted.tabular" ftype="tabular" /> + </test> + <test> + <param name="blastxml_file" value="blastp_four_human_vs_rhodopsin.xml" ftype="blastxml" /> + <param name="out_format" value="ext" /> + <!-- Note this has some white space differences from the actual blastp output blast_four_human_vs_rhodopsin_22c.tabluar --> + <output name="tabular_file" file="blastp_four_human_vs_rhodopsin_converted_ext.tabular" ftype="tabular" /> + </test> + <test> + <param name="blastxml_file" value="blastp_sample.xml" ftype="blastxml" /> + <param name="out_format" value="std" /> + <!-- Note this has some white space differences from the actual blastp output --> + <output name="tabular_file" file="blastp_sample_converted.tabular" ftype="tabular" /> + </test> + <test> + <param name="blastxml_file" value="blastx_rhodopsin_vs_four_human.xml" ftype="blastxml" /> + <param name="out_format" value="std" /> + <!-- Note this has some white space differences from the actual blastx output --> + <output name="tabular_file" file="blastx_rhodopsin_vs_four_human_converted.tabular" ftype="tabular" /> + </test> + <test> + <param name="blastxml_file" value="blastx_rhodopsin_vs_four_human.xml" ftype="blastxml" /> + <param name="out_format" value="ext" /> + <!-- Note this has some white space and XXXX masking differences from the actual blastx output --> + <output name="tabular_file" file="blastx_rhodopsin_vs_four_human_converted_ext.tabular" ftype="tabular" /> + </test> + <test> + <param name="blastxml_file" value="blastx_sample.xml" ftype="blastxml" /> + <param name="out_format" value="std" /> + <!-- Note this has some white space differences from the actual blastx output --> + <output name="tabular_file" file="blastx_sample_converted.tabular" ftype="tabular" /> + </test> + <test> + <param name="blastxml_file" value="blastp_human_vs_pdb_seg_no.xml" ftype="blastxml" /> + <param name="out_format" value="std" /> + <!-- Note this has some white space differences from the actual blastp output --> + <output name="tabular_file" file="blastp_human_vs_pdb_seg_no_converted_std.tabular" ftype="tabular" /> + </test> + <test> + <param name="blastxml_file" value="blastp_human_vs_pdb_seg_no.xml" ftype="blastxml" /> + <param name="out_format" value="ext" /> + <!-- Note this has some white space differences from the actual blastp output --> + <output name="tabular_file" file="blastp_human_vs_pdb_seg_no_converted_ext.tabular" ftype="tabular" /> + </test> + </tests> + <help> + +**What it does** + +NCBI BLAST+ (and the older NCBI 'legacy' BLAST) can output in a range of +formats including tabular and a more detailed XML format. A complex workflow +may need both the XML and the tabular output - but running BLAST twice is +slow and wasteful. + +This tool takes the BLAST XML output and can convert it into the +standard 12 column tabular equivalent: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 22 column tabular +BLAST output. This tool now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +Beware that the XML file (and thus the conversion) and the tabular output +direct from BLAST+ may differ in the presence of XXXX masking on regions +low complexity (columns 21 and 22), and thus also calculated figures like +the percentage identity (column 3). + +**References** + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_blastdbcmd_info.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,67 @@ +<tool id="ncbi_blastdbcmd_info" name="NCBI BLAST+ database info" version="0.0.6"> + <description>Show BLAST database information from blastdbcmd</description> + <requirements> + <requirement type="binary">blastdbcmd</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>blastdbcmd -version</version_command> + <command> +blastdbcmd -dbtype $db_opts.db_type -db "${db_opts.database.fields.path}" -info -out "$info" + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- Suspect blastdbcmd sometimes fails to set error level --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <conditional name="db_opts"> + <param name="db_type" type="select" label="Type of BLAST database"> + <option value="nucl" selected="True">Nucleotide</option> + <option value="prot">Protein</option> + </param> + <when value="nucl"> + <param name="database" type="select" label="Nucleotide BLAST database"> + <options from_file="blastdb.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + </when> + <when value="prot"> + <param name="database" type="select" label="Protein BLAST database"> + <options from_file="blastdb_p.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + </when> + </conditional> + </inputs> + <outputs> + <data name="info" format="txt" label="${db_opts.database.fields.name} info" /> + </outputs> + <help> + +**What it does** + +Calls the NCBI BLAST+ blastdbcmd command line tool with the -info +switch to give summary information about a BLAST database, such as +the size (number of sequences and total length) and date. + +------- + +**References** + +Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. + +Schaffer et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. 2001. Nucleic Acids Res. 29:2994-3005. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_blastdbcmd_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,139 @@ +<tool id="ncbi_blastdbcmd_wrapper" name="NCBI BLAST+ blastdbcmd entry(s)" version="0.0.6"> + <description>Extract sequence(s) from BLAST database</description> + <requirements> + <requirement type="binary">blastdbcmd</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>blastdbcmd -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +blastdbcmd -dbtype $db_opts.db_type -db "${db_opts.database.fields.path}" + +##TODO: What about -ctrl_a and -target_only as advanced options? + +#if $id_opts.id_type=="file": +-entry_batch "$id_opts.entries" +#else: +##Perform some simple search/replaces to remove whitespace +##and make it comma separated, and escape any pipe characters +-entry "$id_opts.entries.replace('\r',',').replace('\n',',').replace(' ','').replace(',,',',').replace(',,',',').strip(',').replace('|','\|')" +#end if + +##When building a BLAST database, to ensure unique IDs makeblastdb will +##do things like turning a FASTA entry with ID of ERP44 into lcl|ERP44 +##(if using -parse_seqids) or simply assign it an ID using the record +##number like gnl|BL_ORD_ID|123 (to cope with duplicate IDs in the FASTA +##file). In -parse_seqids mode, a duplicate FASTA ID gives an error. +## +##The BLAST plain text and XML output will contain these BLAST IDs, but +##the tabular output does not (at least, not in BLAST 2.2.25+). +##Therefore in general, Galaxy users won't care about the (internal) +##BLAST identifiers. +## +##The blastdbcmd FASTA output will also contain these IDs, but in the +##context of the BLAST tabular output they are not helpful. Therefore +##to recover the original ID as used in the FASTA file for makeblastdb +##we need a litte post processing. +## +##We remove the NCBI's lcl|... or gnl|BL_ORD_ID|123 prefixes +##using sed, however the exact syntax differs for Mac OS X's sed + +#if str($outfmt)=="blastid": +-out "$seq" +#else if sys.platform == "darwin": +| sed -E 's/^>(lcl\||gnl\|BL_ORD_ID\|[0-9]* )/>/1' > "$seq" +#else: +| sed 's/>\(lcl|\|gnl|BL_ORD_ID|[0-9]* \)/>/1' > "$seq" +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- Suspect blastdbcmd sometimes fails to set error level --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <conditional name="db_opts"> + <param name="db_type" type="select" label="Type of BLAST database"> + <option value="nucl" selected="True">Nucleotide</option> + <option value="prot">Protein</option> + </param> + <when value="nucl"> + <param name="database" type="select" label="Nucleotide BLAST database"> + <options from_file="blastdb.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + </when> + <when value="prot"> + <param name="database" type="select" label="Protein BLAST database"> + <options from_file="blastdb_p.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + </when> + </conditional> + <conditional name="id_opts"> + <param name="id_type" type="select" label="Type of identifier list"> + <option value="file">From file</option> + <option value="prompt">User entered</option> + </param> + <when value="file"> + <param name="entries" type="data" format="txt,tabular" label="Sequence identifier(s)" help="Plain text file with one ID per line (i.e. single column tabular file)"/> + </when> + <when value="prompt"> + <param name="entries" type="text" label="Sequence identifier(s)" help="Comma or new line separated list." optional="False" area="True" size="10x30"/> + </when> + </conditional> + <param name="outfmt" type="select" label="Output format"> + <option value="original">FASTA with original identifiers</option> + <option value="blastid">FASTA with BLAST assigned identifiers</option> + </param> + </inputs> + <outputs> + <data name="seq" format="fasta" label="Sequences from ${db_opts.database.fields.name}" /> + </outputs> + <help> + +**What it does** + +Extracts FASTA formatted sequences from a BLAST database +using the NCBI BLAST+ blastdbcmd command line tool. + +.. class:: warningmark + +**BLAST assigned identifiers** + +When a BLAST database is constructed from a FASTA file, the +original identifiers can be replaced with BLAST assigned +identifiers, partly to ensure uniqueness. e.g. Sometimes +a prefix of 'lcl|' is added (lcl is short for local), +or an arbitrary name starting 'gnl|BL_ORD_ID|' is created. + +If you are using the tabular output from BLAST, it will contain +the original identifiers - not the BLAST assigned identifiers +suitable for use with the blastdbcmd tool. + +If you are using the XML or plain text output, this will also +contain the BLAST assigned identifiers. However, this means +getting a list of BLAST assigned identifiers isn't straightforward. + +------- + +**References** + +Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. + +Schaffer et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. 2001. Nucleic Acids Res. 29:2994-3005. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_blastn_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,257 @@ +<tool id="ncbi_blastn_wrapper" name="NCBI BLAST+ blastn" version="0.0.20"> + <description>Search nucleotide database with nucleotide query sequence(s)</description> + <!-- If job splitting is enabled, break up the query file into parts --> + <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> + <requirements> + <requirement type="binary">blastn</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>blastn -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +blastn +-query "$query" +#if $db_opts.db_opts_selector == "db": + -db "${db_opts.database.fields.path}" +#elif $db_opts.db_opts_selector == "histdb": + -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" +#else: + -subject "$db_opts.subject" +#end if +-task $blast_type +-evalue $evalue_cutoff +-out "$output1" +##Set the extended list here so if/when we add things, saved workflows are not affected +#if str($out_format)=="ext": + -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" +#else: + -outfmt $out_format +#end if +-num_threads 8 +#if $adv_opts.adv_opts_selector=="advanced": +$adv_opts.filter_query +$adv_opts.strand +## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string +## Note -max_target_seqs overrides -num_descriptions and -num_alignments +#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): +-max_target_seqs $adv_opts.max_hits +#end if +#if (str($adv_opts.identity_cutoff) and float(str($adv_opts.identity_cutoff)) > 0 ): +-perc_identity $adv_opts.identity_cutoff +#end if +#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): +-word_size $adv_opts.word_size +#end if +$adv_opts.ungapped +$adv_opts.parse_deflines +## End of advanced options: +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> + <conditional name="db_opts"> + <param name="db_opts_selector" type="select" label="Subject database/sequences"> + <option value="db" selected="True">Locally installed BLAST database</option> + <option value="histdb">BLAST database from your history</option> + <option value="file">FASTA file from your history (see warning note below)</option> + </param> + <when value="db"> + <param name="database" type="select" label="Nucleotide BLAST database"> + <options from_file="blastdb.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="histdb"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="data" format="blastdbn" label="Nucleotide BLAST database" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="file"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="data" format="fasta" label="Nucleotide FASTA file to use as database"/> + </when> + </conditional> + <param name="blast_type" type="select" display="radio" label="Type of BLAST"> + <option value="megablast">megablast</option> + <option value="blastn">blastn</option> + <option value="blastn-short">blastn-short</option> + <option value="dc-megablast">dc-megablast</option> + <!-- Using BLAST 2.2.24+ this gives an error: + BLAST engine error: Program type 'vecscreen' not supported + <option value="vecscreen">vecscreen</option> + --> + </param> + <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> + <param name="out_format" type="select" label="Output format"> + <option value="6">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + <option value="5">BLAST XML</option> + <option value="0">Pairwise text</option> + <option value="0 -html">Pairwise HTML</option> + <option value="2">Query-anchored text</option> + <option value="2 -html">Query-anchored HTML</option> + <option value="4">Flat query-anchored text</option> + <option value="4 -html">Flat query-anchored HTML</option> + <!-- + <option value="-outfmt 11">BLAST archive format (ASN.1)</option> + --> + </param> + <conditional name="adv_opts"> + <param name="adv_opts_selector" type="select" label="Advanced Options"> + <option value="basic" selected="True">Hide Advanced Options</option> + <option value="advanced">Show Advanced Options</option> + </param> + <when value="basic" /> + <when value="advanced"> + <!-- Could use a select (yes, no, other) where other allows setting 'level window linker' --> + <param name="filter_query" type="boolean" label="Filter out low complexity regions (with DUST)" truevalue="-dust yes" falsevalue="-dust no" checked="true" /> + <param name="strand" type="select" label="Query strand(s) to search against database/subject"> + <option value="-strand both">Both</option> + <option value="-strand plus">Plus (forward)</option> + <option value="-strand minus">Minus (reverse complement)</option> + </param> + <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> + <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> + <validator type="in_range" min="0" /> + </param> + <param name="identity_cutoff" type="float" min="0" max="100" value="0" label="Percent identity cutoff (-perc_identity)" help="Use zero for no cutoff" /> + <!-- I'd like word_size to be optional, with minimum 4 for blastn --> + <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 4."> + <validator type="in_range" min="0" /> + </param> + <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped" falsevalue="" checked="false" /> + <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> + </when> + </conditional> + </inputs> + <outputs> + <data name="output1" format="tabular" label="${blast_type.value_label} on ${on_string}"> + <change_format> + <when input="out_format" value="0" format="txt"/> + <when input="out_format" value="0 -html" format="html"/> + <when input="out_format" value="2" format="txt"/> + <when input="out_format" value="2 -html" format="html"/> + <when input="out_format" value="4" format="txt"/> + <when input="out_format" value="4 -html" format="html"/> + <when input="out_format" value="5" format="blastxml"/> + </change_format> + </data> + </outputs> + <tests> + <test> + <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="three_human_mRNA.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-40" /> + <param name="out_format" value="6" /> + <param name="adv_opts_selector" value="basic" /> + <output name="output1" file="blastn_rhodopsin_vs_three_human.tabular" ftype="tabular" /> + </test> + </tests> + <help> + +.. class:: warningmark + +**Note**. Database searches may take a substantial amount of time. +For large input datasets it is advisable to allow overnight processing. + +----- + +**What it does** + +Search a *nucleotide database* using a *nucleotide query*, +using the NCBI BLAST+ blastn command line tool. +Algorithms include blastn, megablast, and discontiguous megablast. + +.. class:: warningmark + +You can also search against a FASTA file of subject nucleotide +sequences. This is *not* advised because it is slower (only one +CPU is used), but more importantly gives e-values for pairwise +searches (very small e-values which will look overly signficiant). +In most cases you should instead turn the other FASTA file into a +database first using *makeblastdb* and search against that. + +----- + +**Output format** + +Because Galaxy focuses on processing tabular data, the default output of this +tool is tabular. The standard BLAST+ tabular output contains 12 columns: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 24 column tabular +BLAST output. Galaxy now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +The third option is BLAST XML output, which is designed to be parsed by +another program, and is understood by some Galaxy tools. + +You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). +The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. +The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. +The two query anchored outputs show a multiple sequence alignment between the query and all the matches, +and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). + +------- + +**References** + +Zhang et al. A Greedy Algorithm for Aligning DNA Sequences. 2000. JCB: 203-214. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_blastp_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,308 @@ +<tool id="ncbi_blastp_wrapper" name="NCBI BLAST+ blastp" version="0.0.20"> + <description>Search protein database with protein query sequence(s)</description> + <!-- If job splitting is enabled, break up the query file into parts --> + <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> + <requirements> + <requirement type="binary">blastp</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>blastp -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +blastp +-query "$query" +#if $db_opts.db_opts_selector == "db": + -db "${db_opts.database.fields.path}" +#elif $db_opts.db_opts_selector == "histdb": + -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" +#else: + -subject "$db_opts.subject" +#end if +-task $blast_type +-evalue $evalue_cutoff +-out "$output1" +##Set the extended list here so if/when we add things, saved workflows are not affected +#if str($out_format)=="ext": + -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" +#else: + -outfmt $out_format +#end if +-num_threads 8 +#if $adv_opts.adv_opts_selector=="advanced": +$adv_opts.filter_query +-matrix $adv_opts.matrix +## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string +## Note -max_target_seqs overrides -num_descriptions and -num_alignments +#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): +-max_target_seqs $adv_opts.max_hits +#end if +#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): +-word_size $adv_opts.word_size +#end if +##Ungapped disabled for now - see comments below +##$adv_opts.ungapped +$adv_opts.parse_deflines +## End of advanced options: +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <param name="query" type="data" format="fasta" label="Protein query sequence(s)"/> + <conditional name="db_opts"> + <param name="db_opts_selector" type="select" label="Subject database/sequences"> + <option value="db" selected="True">Locally installed BLAST database</option> + <option value="histdb">BLAST database from your history</option> + <option value="file">FASTA file from your history (see warning note below)</option> + </param> + <when value="db"> + <param name="database" type="select" label="Protein BLAST database"> + <options from_file="blastdb_p.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="histdb"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="data" format="blastdbp" label="Protein BLAST database" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="file"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="data" format="fasta" label="Protein FASTA file to use as database"/> + </when> + </conditional> + <param name="blast_type" type="select" display="radio" label="Type of BLAST"> + <option value="blastp">blastp</option> + <option value="blastp-short">blastp-short</option> + </param> + <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> + <param name="out_format" type="select" label="Output format"> + <option value="6">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + <option value="5">BLAST XML</option> + <option value="0">Pairwise text</option> + <option value="0 -html">Pairwise HTML</option> + <option value="2">Query-anchored text</option> + <option value="2 -html">Query-anchored HTML</option> + <option value="4">Flat query-anchored text</option> + <option value="4 -html">Flat query-anchored HTML</option> + <!-- + <option value="-outfmt 11">BLAST archive format (ASN.1)</option> + --> + </param> + <conditional name="adv_opts"> + <param name="adv_opts_selector" type="select" label="Advanced Options"> + <option value="basic" selected="True">Hide Advanced Options</option> + <option value="advanced">Show Advanced Options</option> + </param> + <when value="basic" /> + <when value="advanced"> + <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> + <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="false" /> + <param name="matrix" type="select" label="Scoring matrix"> + <option value="BLOSUM90">BLOSUM90</option> + <option value="BLOSUM80">BLOSUM80</option> + <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> + <option value="BLOSUM50">BLOSUM50</option> + <option value="BLOSUM45">BLOSUM45</option> + <option value="PAM250">PAM250</option> + <option value="PAM70">PAM70</option> + <option value="PAM30">PAM30</option> + </param> + <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> + <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> + <validator type="in_range" min="0" /> + </param> + <!-- I'd like word_size to be optional, with minimum 2 for blastp --> + <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> + <validator type="in_range" min="0" /> + </param> + <!-- + Can't use '-ungapped' on its own, error back is: + Composition-adjusted searched are not supported with an ungapped search, please add -comp_based_stats F or do a gapped search + Tried using '-ungapped -comp_based_stats F' and blastp crashed with 'Attempt to access NULL pointer.' + <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped -comp_based_stats F" falsevalue="" checked="false" /> + --> + <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> + </when> + </conditional> + </inputs> + <outputs> + <data name="output1" format="tabular" label="${blast_type.value_label} on ${on_string}"> + <change_format> + <when input="out_format" value="0" format="txt"/> + <when input="out_format" value="0 -html" format="html"/> + <when input="out_format" value="2" format="txt"/> + <when input="out_format" value="2 -html" format="html"/> + <when input="out_format" value="4" format="txt"/> + <when input="out_format" value="4 -html" format="html"/> + <when input="out_format" value="5" format="blastxml"/> + </change_format> + </data> + </outputs> + <tests> + <test> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_proteins.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-8" /> + <param name="blast_type" value="blastp" /> + <param name="out_format" value="5" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="False" /> + <param name="matrix" value="BLOSUM62" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="True" /> + <output name="output1" file="blastp_four_human_vs_rhodopsin.xml" ftype="blastxml" /> + </test> + <test> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_proteins.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-8" /> + <param name="blast_type" value="blastp" /> + <param name="out_format" value="6" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="False" /> + <param name="matrix" value="BLOSUM62" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="True" /> + <output name="output1" file="blastp_four_human_vs_rhodopsin.tabular" ftype="tabular" /> + </test> + <test> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_proteins.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-8" /> + <param name="blast_type" value="blastp" /> + <param name="out_format" value="ext" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="False" /> + <param name="matrix" value="BLOSUM62" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="True" /> + <output name="output1" file="blastp_four_human_vs_rhodopsin_ext.tabular" ftype="tabular" /> + </test> + <test> + <param name="query" value="rhodopsin_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-8" /> + <param name="blast_type" value="blastp" /> + <param name="out_format" value="6" /> + <param name="adv_opts_selector" value="basic" /> + <output name="output1" file="blastp_rhodopsin_vs_four_human.tabular" ftype="tabular" /> + </test> + </tests> + <help> + +.. class:: warningmark + +**Note**. Database searches may take a substantial amount of time. +For large input datasets it is advisable to allow overnight processing. + +----- + +**What it does** + +Search a *protein database* using a *protein query*, +using the NCBI BLAST+ blastp command line tool. + +.. class:: warningmark + +You can also search against a FASTA file of subject protein +sequences. This is *not* advised because it is slower (only one +CPU is used), but more importantly gives e-values for pairwise +searches (very small e-values which will look overly signficiant). +In most cases you should instead turn the other FASTA file into a +database first using *makeblastdb* and search against that. + +----- + +**Output format** + +Because Galaxy focuses on processing tabular data, the default output of this +tool is tabular. The standard BLAST+ tabular output contains 12 columns: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 24 column tabular +BLAST output. Galaxy now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +The third option is BLAST XML output, which is designed to be parsed by +another program, and is understood by some Galaxy tools. + +You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). +The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. +The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. +The two query anchored outputs show a multiple sequence alignment between the query and all the matches, +and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). + +------- + +**References** + +Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. + +Schaffer et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. 2001. Nucleic Acids Res. 29:2994-3005. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_blastx_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,294 @@ +<tool id="ncbi_blastx_wrapper" name="NCBI BLAST+ blastx" version="0.0.19"> + <description>Search protein database with translated nucleotide query sequence(s)</description> + <!-- If job splitting is enabled, break up the query file into parts --> + <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> + <requirements> + <requirement type="binary">blastx</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>blastx -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +blastx +-query "$query" +#if $db_opts.db_opts_selector == "db": + -db "${db_opts.database.fields.path}" +#elif $db_opts.db_opts_selector == "histdb": + -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" +#else: + -subject "$db_opts.subject" +#end if +-query_gencode $query_gencode +-evalue $evalue_cutoff +-out "$output1" +##Set the extended list here so if/when we add things, saved workflows are not affected +#if str($out_format)=="ext": + -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" +#else: + -outfmt $out_format +#end if +-num_threads 8 +#if $adv_opts.adv_opts_selector=="advanced": +$adv_opts.filter_query +$adv_opts.strand +-matrix $adv_opts.matrix +## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string +## Note -max_target_seqs overrides -num_descriptions and -num_alignments +#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): +-max_target_seqs $adv_opts.max_hits +#end if +#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): +-word_size $adv_opts.word_size +#end if +$adv_opts.ungapped +$adv_opts.parse_deflines +## End of advanced options: +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> + <conditional name="db_opts"> + <param name="db_opts_selector" type="select" label="Subject database/sequences"> + <option value="db" selected="True">Locally installed BLAST database</option> + <option value="histdb">BLAST database from your history</option> + <option value="file">FASTA file from your history (see warning note below)</option> + </param> + <when value="db"> + <param name="database" type="select" label="Protein BLAST database"> + <options from_file="blastdb_p.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="histdb"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="data" format="blastdbp" label="Protein BLAST database" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="file"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="data" format="fasta" label="Protein FASTA file to use as database"/> + </when> + </conditional> + <param name="query_gencode" type="select" label="Query genetic code"> + <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> + <option value="1" select="True">1. Standard</option> + <option value="2">2. Vertebrate Mitochondrial</option> + <option value="3">3. Yeast Mitochondrial</option> + <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> + <option value="5">5. Invertebrate Mitochondrial</option> + <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> + <option value="9">9. Echinoderm Mitochondrial</option> + <option value="10">10. Euplotid Nuclear</option> + <option value="11">11. Bacteria and Archaea</option> + <option value="12">12. Alternative Yeast Nuclear</option> + <option value="13">13. Ascidian Mitochondrial</option> + <option value="14">14. Flatworm Mitochondrial</option> + <option value="15">15. Blepharisma Macronuclear</option> + <option value="16">16. Chlorophycean Mitochondrial Code</option> + <option value="21">21. Trematode Mitochondrial Code</option> + <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> + <option value="23">23. Thraustochytrium Mitochondrial Code</option> + <option value="24">24. Pterobranchia mitochondrial code</option> + </param> + <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> + <param name="out_format" type="select" label="Output format"> + <option value="6">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + <option value="5">BLAST XML</option> + <option value="0">Pairwise text</option> + <option value="0 -html">Pairwise HTML</option> + <option value="2">Query-anchored text</option> + <option value="2 -html">Query-anchored HTML</option> + <option value="4">Flat query-anchored text</option> + <option value="4 -html">Flat query-anchored HTML</option> + <!-- + <option value="-outfmt 11">BLAST archive format (ASN.1)</option> + --> + </param> + <conditional name="adv_opts"> + <param name="adv_opts_selector" type="select" label="Advanced Options"> + <option value="basic" selected="True">Hide Advanced Options</option> + <option value="advanced">Show Advanced Options</option> + </param> + <when value="basic" /> + <when value="advanced"> + <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> + <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="true" /> + <param name="strand" type="select" label="Query strand(s) to search against database/subject"> + <option value="-strand both">Both</option> + <option value="-strand plus">Plus (forward)</option> + <option value="-strand minus">Minus (reverse complement)</option> + </param> + <param name="matrix" type="select" label="Scoring matrix"> + <option value="BLOSUM90">BLOSUM90</option> + <option value="BLOSUM80">BLOSUM80</option> + <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> + <option value="BLOSUM50">BLOSUM50</option> + <option value="BLOSUM45">BLOSUM45</option> + <option value="PAM250">PAM250</option> + <option value="PAM70">PAM70</option> + <option value="PAM30">PAM30</option> + </param> + <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> + <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> + <validator type="in_range" min="0" /> + </param> + <!-- I'd like word_size to be optional, with minimum 2 for blastx --> + <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> + <validator type="in_range" min="0" /> + </param> + <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped" falsevalue="" checked="false" /> + <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> + </when> + </conditional> + </inputs> + <outputs> + <data name="output1" format="tabular" label="blastx on ${on_string}"> + <change_format> + <when input="out_format" value="0" format="txt"/> + <when input="out_format" value="0 -html" format="html"/> + <when input="out_format" value="2" format="txt"/> + <when input="out_format" value="2 -html" format="html"/> + <when input="out_format" value="4" format="txt"/> + <when input="out_format" value="4 -html" format="html"/> + <when input="out_format" value="5" format="blastxml"/> + </change_format> + </data> + </outputs> + <tests> + <test> + <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="5" /> + <param name="adv_opts_selector" value="basic" /> + <output name="output1" file="blastx_rhodopsin_vs_four_human.xml" ftype="blastxml" /> + </test> + <test> + <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="6" /> + <param name="adv_opts_selector" value="basic" /> + <output name="output1" file="blastx_rhodopsin_vs_four_human.tabular" ftype="tabular" /> + </test> + <test> + <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="ext" /> + <param name="adv_opts_selector" value="basic" /> + <output name="output1" file="blastx_rhodopsin_vs_four_human_ext.tabular" ftype="tabular" /> + </test> + </tests> + <help> + +.. class:: warningmark + +**Note**. Database searches may take a substantial amount of time. +For large input datasets it is advisable to allow overnight processing. + +----- + +**What it does** + +Search a *protein database* using a *translated nucleotide query*, +using the NCBI BLAST+ blastx command line tool. + +.. class:: warningmark + +You can also search against a FASTA file of subject protein +sequences. This is *not* advised because it is slower (only one +CPU is used), but more importantly gives e-values for pairwise +searches (very small e-values which will look overly signficiant). +In most cases you should instead turn the other FASTA file into a +database first using *makeblastdb* and search against that. + +----- + +**Output format** + +Because Galaxy focuses on processing tabular data, the default output of this +tool is tabular. The standard BLAST+ tabular output contains 12 columns: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 24 column tabular +BLAST output. Galaxy now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +The third option is BLAST XML output, which is designed to be parsed by +another program, and is understood by some Galaxy tools. + +You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). +The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. +The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. +The two query anchored outputs show a multiple sequence alignment between the query and all the matches, +and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). + +------- + +**References** + +Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_makeblastdb.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,129 @@ +<tool id="ncbi_makeblastdb" name="NCBI BLAST+ makeblastdb" version="0.0.5"> + <description>Make BLAST database</description> + <requirements> + <requirement type="binary">makeblastdb</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>makeblastdb -version</version_command> + <command> +makeblastdb -out "${os.path.join($outfile.extra_files_path,'blastdb')}" +$parse_seqids +$hash_index +## Single call to -in with multiple filenames space separated with outer quotes +## (presumably any filenames with spaces would be a problem). Note this gives +## some extra spaces, e.g. -in " file1 file2 file3 " but BLAST seems happy: +-in " +#for $i in $in +${i.file} #end for +" +#if $title: +-title "$title" +#else: +##Would default to being based on the cryptic Galaxy filenames, which is unhelpful +-title "BLAST Database" +#end if +-dbtype $dbtype +## #set $sep = '-mask_data ' +## #for $i in $mask_data +## $sep${i.file} +## #set $set = ', ' +## #end for +## #set $sep = '-gi_mask -gi_mask_name ' +## #for $i in $gi_mask +## $sep${i.file} +## #set $set = ', ' +## #end for +## #if $tax.select == 'id': +## -taxid $tax.id +## #else if $tax.select == 'map': +## -taxid_map $tax.map +## #end if +</command> +<stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> +</stdio> +<inputs> + <param name="dbtype" type="select" display="radio" label="Molecule type of input"> + <option value="prot">protein</option> + <option value="nucl">nucleotide</option> + </param> + <!-- TODO Allow merging of existing BLAST databases (conditional on the database type) + <repeat name="in" title="Blast or Fasta Database" min="1"> + <param name="file" type="data" format="fasta,blastdbn,blastdbp" label="Blast or Fasta database" /> + </repeat> + --> + <repeat name="in" title="FASTA file" min="1"> + <param name="file" type="data" format="fasta" /> + </repeat> + <param name="title" type="text" value="" label="Title for BLAST database" help="This is the database name shown in BLAST search output" /> + <param name="parse_seqids" type="boolean" truevalue="-parse_seqids" falsevalue="" checked="False" label="Parse the sequence identifiers" help="This is only advised if your FASTA file follows the NCBI naming conventions using pipe '|' symbols" /> + <param name="hash_index" type="boolean" truevalue="-hash_index" falsevalue="" checked="true" label="Enable the creation of sequence hash values." help="These hash values can then be used to quickly determine if a given sequence data exists in this BLAST database." /> + + <!-- SEQUENCE MASKING OPTIONS --> + <!-- TODO + <repeat name="mask_data" title="Provide one or more files containing masking data"> + <param name="file" type="data" format="asnb" label="File containing masking data" help="As produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)" /> + </repeat> + <repeat name="gi_mask" title="Create GI indexed masking data"> + <param name="file" type="data" format="asnb" label="Masking data output file" /> + </repeat> + --> + + <!-- TAXONOMY OPTIONS --> + <!-- TODO + <conditional name="tax"> + <param name="select" type="select" label="Taxonomy options"> + <option value="">Do not assign sequences to Taxonomy IDs</option> + <option value="id">Assign all sequences to one Taxonomy ID</option> + <option value="map">Supply text file mapping sequence IDs to taxnomy IDs</option> + </param> + <when value=""> + </when> + <when value="id"> + <param name="id" type="integer" value="" label="NCBI taxonomy ID" help="Integer >=0" /> + </when> + <when value="map"> + <param name="file" type="data" format="txt" label="Seq ID : Tax ID mapping file" help="Format: SequenceId TaxonomyId" /> + </when> + </conditional> + --> +</inputs> +<outputs> + <!-- If we only accepted one FASTA file, we could use its human name here... --> + <data name="outfile" format="data" label="${dbtype.value_label} BLAST database from ${on_string}"> + <change_format> + <when input="dbtype" value="nucl" format="blastdbn"/> + <when input="dbtype" value="prot" format="blastdbp"/> + </change_format> + </data> +</outputs> +<help> +**What it does** + +Make BLAST database from one or more FASTA files and/or BLAST databases. + +This is a wrapper for the NCBI BLAST+ tool 'makeblastdb', which is the +replacement for the 'formatdb' tool in the NCBI 'legacy' BLAST suite. + +<!-- +Applying masks to an existing BLAST database will not change the original database; a new database will be created. +For this reason, it's best to apply all masks at once to minimize the number of unnecessary intermediate databases. +--> + +**Documentation** + +http://www.ncbi.nlm.nih.gov/books/NBK1763/ + +**References** + +Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus +</help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_rpsblast_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,238 @@ +<tool id="ncbi_rpsblast_wrapper" name="NCBI BLAST+ rpsblast" version="0.0.4"> + <description>Search protein domain database (PSSMs) with protein query sequence(s)</description> + <!-- If job splitting is enabled, break up the query file into parts --> + <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject" merge_outputs="output1"></parallelism> + <requirements> + <requirement type="binary">rpsblast</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>rpsblast -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +rpsblast +-query "$query" +#if $db_opts.db_opts_selector == "db": + -db "${db_opts.database.fields.path}" +#elif $db_opts.db_opts_selector == "histdb": + -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" +#end if +-evalue $evalue_cutoff +-out "$output1" +##Set the extended list here so if/when we add things, saved workflows are not affected +#if str($out_format)=="ext": + -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" +#else: + -outfmt $out_format +#end if +-num_threads 8 +#if $adv_opts.adv_opts_selector=="advanced": +$adv_opts.filter_query +## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string +## Note -max_target_seqs overrides -num_descriptions and -num_alignments +#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): +-max_target_seqs $adv_opts.max_hits +#end if +#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): +-word_size $adv_opts.word_size +#end if +$adv_opts.parse_deflines +## End of advanced options: +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <param name="query" type="data" format="fasta" label="Protein query sequence(s)"/> + <conditional name="db_opts"> + <param name="db_opts_selector" type="select" label="Protein domain database (PSSM)"> + <option value="db" selected="True">Locally installed BLAST database</option> + <!-- TODO - define new datatype + <option value="histdb">BLAST protein domain database from your history</option> + --> + </param> + <when value="db"> + <param name="database" type="select" label="Protein domain database"> + <options from_file="blastdb_d.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="hidden" value="" /> + </when> + <!-- TODO - define new datatype + <when value="histdb"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="data" format="blastdbd" label="Protein domain database" /> + <param name="subject" type="hidden" value="" /> + </when> + --> + </conditional> + <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> + <param name="out_format" type="select" label="Output format"> + <option value="6">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + <option value="5">BLAST XML</option> + <option value="0">Pairwise text</option> + <option value="0 -html">Pairwise HTML</option> + <option value="2">Query-anchored text</option> + <option value="2 -html">Query-anchored HTML</option> + <option value="4">Flat query-anchored text</option> + <option value="4 -html">Flat query-anchored HTML</option> + <!-- + <option value="-outfmt 11">BLAST archive format (ASN.1)</option> + --> + </param> + <conditional name="adv_opts"> + <param name="adv_opts_selector" type="select" label="Advanced Options"> + <option value="basic" selected="True">Hide Advanced Options</option> + <option value="advanced">Show Advanced Options</option> + </param> + <when value="basic" /> + <when value="advanced"> + <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> + <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="false" /> + <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> + <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> + <validator type="in_range" min="0" /> + </param> + <!-- I'd like word_size to be optional, with minimum 2 for rpsblast --> + <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> + <validator type="in_range" min="0" /> + </param> + <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> + </when> + </conditional> + </inputs> + <outputs> + <data name="output1" format="tabular" label="rpsblast on ${on_string}"> + <change_format> + <when input="out_format" value="0" format="txt"/> + <when input="out_format" value="0 -html" format="html"/> + <when input="out_format" value="2" format="txt"/> + <when input="out_format" value="2 -html" format="html"/> + <when input="out_format" value="4" format="txt"/> + <when input="out_format" value="4 -html" format="html"/> + <when input="out_format" value="5" format="blastxml"/> + </change_format> + </data> + </outputs> + <help> + +.. class:: warningmark + +**Note**. Database searches may take a substantial amount of time. +For large input datasets it is advisable to allow overnight processing. + +----- + +**What it does** + +Search a *protein domain database* using a *protein query*, +using the NCBI BLAST+ rpsblast command line tool. + +The protein domain databases use position-specific scoring matrices +(PSSMs) and are available for a number of domain collections including: + +*CDD* - NCBI curarated meta-collection of domains, see +http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains + +*Kog* - PSSMs from automatically aligned sequences and sequence +fragments classified in the KOGs resource, the eukaryotic +counterpart to COGs, see http://www.ncbi.nlm.nih.gov/COG/new/ + +*Cog* - PSSMs from automatically aligned sequences and sequence +fragments classified in the COGs resource, which focuses primarily +on prokaryotes, see http://www.ncbi.nlm.nih.gov/COG/new/ + +*Pfam* - PSSMs from Pfam-A seed alignment database, see +http://pfam.sanger.ac.uk/ + +*Smart* - PSSMs from SMART domain alignment database, see +http://smart.embl-heidelberg.de/ + +*Tigr* - PSSMs from TIGRFAM database of protein families, see +http://www.jcvi.org/cms/research/projects/tigrfams/overview/ + +*Prk* - PSSms from automatically aligned stable clusters in the +Protein Clusters database, see +http://www.ncbi.nlm.nih.gov/proteinclusters?cmd=search&db=proteinclusters + +The exact list of domain databases offered will depend on how your +local Galaxy has been configured. + +----- + +**Output format** + +Because Galaxy focuses on processing tabular data, the default output of this +tool is tabular. The standard BLAST+ tabular output contains 12 columns: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 24 column tabular +BLAST output. Galaxy now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +The third option is BLAST XML output, which is designed to be parsed by +another program, and is understood by some Galaxy tools. + +You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). +The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. +The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. +The two query anchored outputs show a multiple sequence alignment between the query and all the matches, +and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). + +------- + +**References** + +Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W327-31. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_rpstblastn_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,239 @@ +<tool id="ncbi_rpstblastn_wrapper" name="NCBI BLAST+ rpstblastn" version="0.0.4"> + <description>Search protein domain database (PSSMs) with translated nucleotide query sequence(s)</description> + <!-- If job splitting is enabled, break up the query file into parts --> + <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject" merge_outputs="output1"></parallelism> + <requirements> + <requirement type="binary">rpstblastn</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>rpstblastn -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +rpstblastn +-query "$query" +#if $db_opts.db_opts_selector == "db": + -db "${db_opts.database.fields.path}" +#elif $db_opts.db_opts_selector == "histdb": + -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" +#end if +-evalue $evalue_cutoff +-out "$output1" +##Set the extended list here so if/when we add things, saved workflows are not affected +#if str($out_format)=="ext": + -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" +#else: + -outfmt $out_format +#end if +##Seems rpstblastn does not currently support multiple threads :( +##-num_threads 8 +#if $adv_opts.adv_opts_selector=="advanced": +$adv_opts.filter_query +## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string +## Note -max_target_seqs overrides -num_descriptions and -num_alignments +#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): +-max_target_seqs $adv_opts.max_hits +#end if +#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): +-word_size $adv_opts.word_size +#end if +$adv_opts.parse_deflines +## End of advanced options: +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> + <conditional name="db_opts"> + <param name="db_opts_selector" type="select" label="Protein domain database (PSSM)"> + <option value="db" selected="True">Locally installed BLAST database</option> + <!-- TODO - define new datatype + <option value="histdb">BLAST protein domain database from your history</option> + --> + </param> + <when value="db"> + <param name="database" type="select" label="Protein domain database"> + <options from_file="blastdb_d.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="hidden" value="" /> + </when> + <!-- TODO - define new datatype + <when value="histdb"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="data" format="blastdbd" label="Protein domain database" /> + <param name="subject" type="hidden" value="" /> + </when> + --> + </conditional> + <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> + <param name="out_format" type="select" label="Output format"> + <option value="6">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + <option value="5">BLAST XML</option> + <option value="0">Pairwise text</option> + <option value="0 -html">Pairwise HTML</option> + <option value="2">Query-anchored text</option> + <option value="2 -html">Query-anchored HTML</option> + <option value="4">Flat query-anchored text</option> + <option value="4 -html">Flat query-anchored HTML</option> + <!-- + <option value="-outfmt 11">BLAST archive format (ASN.1)</option> + --> + </param> + <conditional name="adv_opts"> + <param name="adv_opts_selector" type="select" label="Advanced Options"> + <option value="basic" selected="True">Hide Advanced Options</option> + <option value="advanced">Show Advanced Options</option> + </param> + <when value="basic" /> + <when value="advanced"> + <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> + <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="false" /> + <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> + <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> + <validator type="in_range" min="0" /> + </param> + <!-- I'd like word_size to be optional, with minimum 2 for rpsblast --> + <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> + <validator type="in_range" min="0" /> + </param> + <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> + </when> + </conditional> + </inputs> + <outputs> + <data name="output1" format="tabular" label="rpstblastn on ${on_string}"> + <change_format> + <when input="out_format" value="0" format="txt"/> + <when input="out_format" value="0 -html" format="html"/> + <when input="out_format" value="2" format="txt"/> + <when input="out_format" value="2 -html" format="html"/> + <when input="out_format" value="4" format="txt"/> + <when input="out_format" value="4 -html" format="html"/> + <when input="out_format" value="5" format="blastxml"/> + </change_format> + </data> + </outputs> + <help> + +.. class:: warningmark + +**Note**. Database searches may take a substantial amount of time. +For large input datasets it is advisable to allow overnight processing. + +----- + +**What it does** + +Search a *protein domain database* using a *nucleotide query*, +using the NCBI BLAST+ rpstblastn command line tool. + +The protein domain databases use position-specific scoring matrices +(PSSMs) and are available for a number of domain collections including: + +*CDD* - NCBI curarated meta-collection of domains, see +http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains + +*Kog* - PSSMs from automatically aligned sequences and sequence +fragments classified in the KOGs resource, the eukaryotic +counterpart to COGs, see http://www.ncbi.nlm.nih.gov/COG/new/ + +*Cog* - PSSMs from automatically aligned sequences and sequence +fragments classified in the COGs resource, which focuses primarily +on prokaryotes, see http://www.ncbi.nlm.nih.gov/COG/new/ + +*Pfam* - PSSMs from Pfam-A seed alignment database, see +http://pfam.sanger.ac.uk/ + +*Smart* - PSSMs from SMART domain alignment database, see +http://smart.embl-heidelberg.de/ + +*Tigr* - PSSMs from TIGRFAM database of protein families, see +http://www.jcvi.org/cms/research/projects/tigrfams/overview/ + +*Prk* - PSSms from automatically aligned stable clusters in the +Protein Clusters database, see +http://www.ncbi.nlm.nih.gov/proteinclusters?cmd=search&db=proteinclusters + +The exact list of domain databases offered will depend on how your +local Galaxy has been configured. + +----- + +**Output format** + +Because Galaxy focuses on processing tabular data, the default output of this +tool is tabular. The standard BLAST+ tabular output contains 12 columns: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 24 column tabular +BLAST output. Galaxy now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +The third option is BLAST XML output, which is designed to be parsed by +another program, and is understood by some Galaxy tools. + +You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). +The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. +The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. +The two query anchored outputs show a multiple sequence alignment between the query and all the matches, +and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). + +------- + +**References** + +Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W327-31. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_tblastn_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,340 @@ +<tool id="ncbi_tblastn_wrapper" name="NCBI BLAST+ tblastn" version="0.0.20"> + <description>Search translated nucleotide database with protein query sequence(s)</description> + <!-- If job splitting is enabled, break up the query file into parts --> + <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> + <requirements> + <requirement type="binary">tblastn</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>tblastn -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +tblastn +-query "$query" +#if $db_opts.db_opts_selector == "db": + -db "${db_opts.database.fields.path}" +#elif $db_opts.db_opts_selector == "histdb": + -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" +#else: + -subject "$db_opts.subject" +#end if +-evalue $evalue_cutoff +-out "$output1" +##Set the extended list here so if/when we add things, saved workflows are not affected +#if str($out_format)=="ext": + -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" +#else: + -outfmt $out_format +#end if +-num_threads 8 +#if $adv_opts.adv_opts_selector=="advanced": +-db_gencode $adv_opts.db_gencode +$adv_opts.filter_query +-matrix $adv_opts.matrix +## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string +## Note -max_target_seqs overrides -num_descriptions and -num_alignments +#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): +-max_target_seqs $adv_opts.max_hits +#end if +#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): +-word_size $adv_opts.word_size +#end if +##Ungapped disabled for now - see comments below +##$adv_opts.ungapped +$adv_opts.parse_deflines +## End of advanced options: +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <param name="query" type="data" format="fasta" label="Protein query sequence(s)"/> + <conditional name="db_opts"> + <param name="db_opts_selector" type="select" label="Subject database/sequences"> + <option value="db" selected="True">Locally installed BLAST database</option> + <option value="histdb">BLAST database from your history</option> + <option value="file">FASTA file from your history (see warning note below)</option> + </param> + <when value="db"> + <param name="database" type="select" label="Nucleotide BLAST database"> + <options from_file="blastdb.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="histdb"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="data" format="blastdbn" label="Nucleotide BLAST database" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="file"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="data" format="fasta" label="Nucleotide FASTA file to use as database"/> + </when> + </conditional> + <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> + <param name="out_format" type="select" label="Output format"> + <option value="6">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + <option value="5">BLAST XML</option> + <option value="0">Pairwise text</option> + <option value="0 -html">Pairwise HTML</option> + <option value="2">Query-anchored text</option> + <option value="2 -html">Query-anchored HTML</option> + <option value="4">Flat query-anchored text</option> + <option value="4 -html">Flat query-anchored HTML</option> + <!-- + <option value="-outfmt 11">BLAST archive format (ASN.1)</option> + --> + </param> + <conditional name="adv_opts"> + <param name="adv_opts_selector" type="select" label="Advanced Options"> + <option value="basic" selected="True">Hide Advanced Options</option> + <option value="advanced">Show Advanced Options</option> + </param> + <when value="basic" /> + <when value="advanced"> + <param name="db_gencode" type="select" label="Database/subject genetic code"> + <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> + <option value="1" select="True">1. Standard</option> + <option value="2">2. Vertebrate Mitochondrial</option> + <option value="3">3. Yeast Mitochondrial</option> + <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> + <option value="5">5. Invertebrate Mitochondrial</option> + <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> + <option value="9">9. Echinoderm Mitochondrial</option> + <option value="10">10. Euplotid Nuclear</option> + <option value="11">11. Bacteria and Archaea</option> + <option value="12">12. Alternative Yeast Nuclear</option> + <option value="13">13. Ascidian Mitochondrial</option> + <option value="14">14. Flatworm Mitochondrial</option> + <option value="15">15. Blepharisma Macronuclear</option> + <option value="16">16. Chlorophycean Mitochondrial Code</option> + <option value="21">21. Trematode Mitochondrial Code</option> + <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> + <option value="23">23. Thraustochytrium Mitochondrial Code</option> + <option value="24">24. Pterobranchia mitochondrial code</option> + </param> + <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> + <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="true" /> + <param name="matrix" type="select" label="Scoring matrix"> + <option value="BLOSUM90">BLOSUM90</option> + <option value="BLOSUM80">BLOSUM80</option> + <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> + <option value="BLOSUM50">BLOSUM50</option> + <option value="BLOSUM45">BLOSUM45</option> + <option value="PAM250">PAM250</option> + <option value="PAM70">PAM70</option> + <option value="PAM30">PAM30</option> + </param> + <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> + <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> + <validator type="in_range" min="0" /> + </param> + <!-- I'd like word_size to be optional, with minimum 2 for blastp --> + <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> + <validator type="in_range" min="0" /> + </param> + <!-- + Can't use '-ungapped' on its own, error back is: + Composition-adjusted searched are not supported with an ungapped search, please add -comp_based_stats F or do a gapped search + Tried using '-ungapped -comp_based_stats F' and tblastn crashed with 'Attempt to access NULL pointer.' + <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped -comp_based_stats F" falsevalue="" checked="false" /> + --> + <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> + </when> + </conditional> + </inputs> + <outputs> + <data name="output1" format="tabular" label="tblastn on ${on_string}"> + <change_format> + <when input="out_format" value="0" format="txt"/> + <when input="out_format" value="0 -html" format="html"/> + <when input="out_format" value="2" format="txt"/> + <when input="out_format" value="2 -html" format="html"/> + <when input="out_format" value="4" format="txt"/> + <when input="out_format" value="4 -html" format="html"/> + <when input="out_format" value="5" format="blastxml"/> + </change_format> + </data> + </outputs> + <tests> + <test> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="5" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="false" /> + <param name="matrix" value="BLOSUM80" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="false" /> + <output name="output1" file="tblastn_four_human_vs_rhodopsin.xml" ftype="blastxml" /> + </test> + <test> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="ext" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="false" /> + <param name="matrix" value="BLOSUM80" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="false" /> + <output name="output1" file="tblastn_four_human_vs_rhodopsin_ext.tabular" ftype="tabular" /> + </test> + <test> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="6" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="false" /> + <param name="matrix" value="BLOSUM80" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="false" /> + <output name="output1" file="tblastn_four_human_vs_rhodopsin.tabular" ftype="tabular" /> + </test> + <test> + <!-- Same as above, but parse deflines - on BLAST 2.2.25+ - 2.2.27+ makes no difference --> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="6" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="false" /> + <param name="matrix" value="BLOSUM80" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="true" /> + <output name="output1" file="tblastn_four_human_vs_rhodopsin.tabular" ftype="tabular" /> + </test> + <test> + <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-10" /> + <param name="out_format" value="0 -html" /> + <param name="adv_opts_selector" value="advanced" /> + <param name="filter_query" value="false" /> + <param name="matrix" value="BLOSUM80" /> + <param name="max_hits" value="0" /> + <param name="word_size" value="0" /> + <param name="parse_deflines" value="false" /> + <output name="output1" file="tblastn_four_human_vs_rhodopsin.html" ftype="html" /> + </test> + </tests> + <help> + +.. class:: warningmark + +**Note**. Database searches may take a substantial amount of time. +For large input datasets it is advisable to allow overnight processing. + +----- + +**What it does** + +Search a *translated nucleotide database* using a *protein query*, +using the NCBI BLAST+ tblastn command line tool. + +.. class:: warningmark + +You can also search against a FASTA file of subject nucleotide +sequences. This is *not* advised because it is slower (only one +CPU is used), but more importantly gives e-values for pairwise +searches (very small e-values which will look overly signficiant). +In most cases you should instead turn the other FASTA file into a +database first using *makeblastdb* and search against that. + +----- + +**Output format** + +Because Galaxy focuses on processing tabular data, the default output of this +tool is tabular. The standard BLAST+ tabular output contains 12 columns: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 24 column tabular +BLAST output. Galaxy now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +The third option is BLAST XML output, which is designed to be parsed by +another program, and is understood by some Galaxy tools. + +You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). +The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. +The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. +The two query anchored outputs show a multiple sequence alignment between the query and all the matches, +and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). + +------- + +**References** + +Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/ncbi_tblastx_wrapper.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,294 @@ +<tool id="ncbi_tblastx_wrapper" name="NCBI BLAST+ tblastx" version="0.0.20"> + <description>Search translated nucleotide database with translated nucleotide query sequence(s)</description> + <!-- If job splitting is enabled, break up the query file into parts --> + <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> + <requirements> + <requirement type="binary">tblastx</requirement> + <requirement type="package" version="2.2.26+">blast+</requirement> + </requirements> + <version_command>tblastx -version</version_command> + <command> +## The command is a Cheetah template which allows some Python based syntax. +## Lines starting hash hash are comments. Galaxy will turn newlines into spaces +tblastx +-query "$query" +#if $db_opts.db_opts_selector == "db": + -db "${db_opts.database.fields.path}" +#elif $db_opts.db_opts_selector == "histdb": + -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" +#else: + -subject "$db_opts.subject" +#end if +-query_gencode $query_gencode +-evalue $evalue_cutoff +-out "$output1" +##Set the extended list here so if/when we add things, saved workflows are not affected +#if str($out_format)=="ext": + -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" +#else: + -outfmt $out_format +#end if +-num_threads 8 +#if $adv_opts.adv_opts_selector=="advanced": +-db_gencode $adv_opts.db_gencode +$adv_opts.filter_query +$adv_opts.strand +-matrix $adv_opts.matrix +## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string +## Note -max_target_seqs overrides -num_descriptions and -num_alignments +#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): +-max_target_seqs $adv_opts.max_hits +#end if +#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): +-word_size $adv_opts.word_size +#end if +$adv_opts.parse_deflines +## End of advanced options: +#end if + </command> + <stdio> + <!-- Anything other than zero is an error --> + <exit_code range="1:" /> + <exit_code range=":-1" /> + <!-- In case the return code has not been set propery check stderr too --> + <regex match="Error:" /> + <regex match="Exception:" /> + </stdio> + <inputs> + <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> + <conditional name="db_opts"> + <param name="db_opts_selector" type="select" label="Subject database/sequences"> + <option value="db" selected="True">Locally installed BLAST database</option> + <option value="histdb">BLAST database from your history</option> + <option value="file">FASTA file from your history (see warning note below)</option> + </param> + <when value="db"> + <param name="database" type="select" label="Nucleotide BLAST database"> + <options from_file="blastdb.loc"> + <column name="value" index="0"/> + <column name="name" index="1"/> + <column name="path" index="2"/> + </options> + </param> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="histdb"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="data" format="blastdbn" label="Nucleotide BLAST database" /> + <param name="subject" type="hidden" value="" /> + </when> + <when value="file"> + <param name="database" type="hidden" value="" /> + <param name="histdb" type="hidden" value="" /> + <param name="subject" type="data" format="fasta" label="Nucleotide FASTA file to use as database"/> + </when> + </conditional> + <param name="query_gencode" type="select" label="Query genetic code"> + <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> + <option value="1" select="True">1. Standard</option> + <option value="2">2. Vertebrate Mitochondrial</option> + <option value="3">3. Yeast Mitochondrial</option> + <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> + <option value="5">5. Invertebrate Mitochondrial</option> + <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> + <option value="9">9. Echinoderm Mitochondrial</option> + <option value="10">10. Euplotid Nuclear</option> + <option value="11">11. Bacteria and Archaea</option> + <option value="12">12. Alternative Yeast Nuclear</option> + <option value="13">13. Ascidian Mitochondrial</option> + <option value="14">14. Flatworm Mitochondrial</option> + <option value="15">15. Blepharisma Macronuclear</option> + <option value="16">16. Chlorophycean Mitochondrial Code</option> + <option value="21">21. Trematode Mitochondrial Code</option> + <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> + <option value="23">23. Thraustochytrium Mitochondrial Code</option> + <option value="24">24. Pterobranchia mitochondrial code</option> + </param> + <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> + <param name="out_format" type="select" label="Output format"> + <option value="6">Tabular (standard 12 columns)</option> + <option value="ext" selected="True">Tabular (extended 24 columns)</option> + <option value="5">BLAST XML</option> + <option value="0">Pairwise text</option> + <option value="0 -html">Pairwise HTML</option> + <option value="2">Query-anchored text</option> + <option value="2 -html">Query-anchored HTML</option> + <option value="4">Flat query-anchored text</option> + <option value="4 -html">Flat query-anchored HTML</option> + <!-- + <option value="-outfmt 11">BLAST archive format (ASN.1)</option> + --> + </param> + <conditional name="adv_opts"> + <param name="adv_opts_selector" type="select" label="Advanced Options"> + <option value="basic" selected="True">Hide Advanced Options</option> + <option value="advanced">Show Advanced Options</option> + </param> + <when value="basic" /> + <when value="advanced"> + <param name="db_gencode" type="select" label="Database/subject genetic code"> + <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> + <option value="1" select="True">1. Standard</option> + <option value="2">2. Vertebrate Mitochondrial</option> + <option value="3">3. Yeast Mitochondrial</option> + <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> + <option value="5">5. Invertebrate Mitochondrial</option> + <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> + <option value="9">9. Echinoderm Mitochondrial</option> + <option value="10">10. Euplotid Nuclear</option> + <option value="11">11. Bacteria and Archaea</option> + <option value="12">12. Alternative Yeast Nuclear</option> + <option value="13">13. Ascidian Mitochondrial</option> + <option value="14">14. Flatworm Mitochondrial</option> + <option value="15">15. Blepharisma Macronuclear</option> + <option value="16">16. Chlorophycean Mitochondrial Code</option> + <option value="21">21. Trematode Mitochondrial Code</option> + <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> + <option value="23">23. Thraustochytrium Mitochondrial Code</option> + <option value="24">24. Pterobranchia mitochondrial code</option> + </param> + <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> + <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="true" /> + <param name="strand" type="select" label="Query strand(s) to search against database/subject"> + <option value="-strand both">Both</option> + <option value="-strand plus">Plus (forward)</option> + <option value="-strand minus">Minus (reverse complement)</option> + </param> + <param name="matrix" type="select" label="Scoring matrix"> + <option value="BLOSUM90">BLOSUM90</option> + <option value="BLOSUM80">BLOSUM80</option> + <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> + <option value="BLOSUM50">BLOSUM50</option> + <option value="BLOSUM45">BLOSUM45</option> + <option value="PAM250">PAM250</option> + <option value="PAM70">PAM70</option> + <option value="PAM30">PAM30</option> + </param> + <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> + <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> + <validator type="in_range" min="0" /> + </param> + <!-- I'd like word_size to be optional, with minimum 2 for tblastx --> + <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> + <validator type="in_range" min="0" /> + </param> + <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> + </when> + </conditional> + </inputs> + <outputs> + <data name="output1" format="tabular" label="tblastx on ${on_string}"> + <change_format> + <when input="out_format" value="0" format="txt"/> + <when input="out_format" value="0 -html" format="html"/> + <when input="out_format" value="2" format="txt"/> + <when input="out_format" value="2 -html" format="html"/> + <when input="out_format" value="4" format="txt"/> + <when input="out_format" value="4 -html" format="html"/> + <when input="out_format" value="5" format="blastxml"/> + </change_format> + </data> + </outputs> + <tests> + <test> + <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> + <param name="db_opts_selector" value="file" /> + <param name="subject" value="three_human_mRNA.fasta" ftype="fasta" /> + <param name="database" value="" /> + <param name="evalue_cutoff" value="1e-40" /> + <param name="out_format" value="6" /> + <param name="adv_opts_selector" value="basic" /> + <output name="output1" file="tblastx_rhodopsin_vs_three_human.tabular" ftype="tabular" /> + </test> + </tests> + <help> + +.. class:: warningmark + +**Note**. Database searches may take a substantial amount of time. +For large input datasets it is advisable to allow overnight processing. + +----- + +**What it does** + +Search a *translated nucleotide database* using a *protein query*, +using the NCBI BLAST+ tblastx command line tool. + +.. class:: warningmark + +You can also search against a FASTA file of subject nucleotide +sequences. This is *not* advised because it is slower (only one +CPU is used), but more importantly gives e-values for pairwise +searches (very small e-values which will look overly signficiant). +In most cases you should instead turn the other FASTA file into a +database first using *makeblastdb* and search against that. + +----- + +**Output format** + +Because Galaxy focuses on processing tabular data, the default output of this +tool is tabular. The standard BLAST+ tabular output contains 12 columns: + +====== ========= ============================================ +Column NCBI name Description +------ --------- -------------------------------------------- + 1 qseqid Query Seq-id (ID of your sequence) + 2 sseqid Subject Seq-id (ID of the database hit) + 3 pident Percentage of identical matches + 4 length Alignment length + 5 mismatch Number of mismatches + 6 gapopen Number of gap openings + 7 qstart Start of alignment in query + 8 qend End of alignment in query + 9 sstart Start of alignment in subject (database hit) + 10 send End of alignment in subject (database hit) + 11 evalue Expectation value (E-value) + 12 bitscore Bit score +====== ========= ============================================ + +The BLAST+ tools can optionally output additional columns of information, +but this takes longer to calculate. Most (but not all) of these columns are +included by selecting the extended tabular output. The extra columns are +included *after* the standard 12 columns. This is so that you can write +workflow filtering steps that accept either the 12 or 24 column tabular +BLAST output. Galaxy now uses this extended 24 column output by default. + +====== ============= =========================================== +Column NCBI name Description +------ ------------- ------------------------------------------- + 13 sallseqid All subject Seq-id(s), separated by a ';' + 14 score Raw score + 15 nident Number of identical matches + 16 positive Number of positive-scoring matches + 17 gaps Total number of gaps + 18 ppos Percentage of positive-scoring matches + 19 qframe Query frame + 20 sframe Subject frame + 21 qseq Aligned part of query sequence + 22 sseq Aligned part of subject sequence + 23 qlen Query sequence length + 24 slen Subject sequence length +====== ============= =========================================== + +The third option is BLAST XML output, which is designed to be parsed by +another program, and is understood by some Galaxy tools. + +You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). +The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. +The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. +The two query anchored outputs show a multiple sequence alignment between the query and all the matches, +and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). + +------- + +**References** + +Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus + </help> +</tool>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/repository_dependencies.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,5 @@ +<?xml version="1.0"?> +<repositories description="This requires the BLAST datatype definitions (e.g. the BLAST XML format)."> +<!-- Revision 4:f9a7783ed7b6 on the main (and test) tool shed is v0.0.14 which added BLAST databases --> +<repository changeset_revision="f9a7783ed7b6" name="blast_datatypes" owner="devteam" toolshed="http://testtoolshed.g2.bx.psu.edu" /> +</repositories>
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/ncbi_blast_plus/tool_dependencies.xml Tue Jul 30 07:33:46 2013 -0400 @@ -0,0 +1,20 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="blast+" version="2.2.26+"> + <install version="1.0"> + <actions> + <action type="download_by_url">ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.26/ncbi-blast-2.2.26+-src.tar.gz</action> + <action type="shell_command">cd c++ && ./configure --prefix=$INSTALL_DIR && make && make install</action> + <action type="set_environment"> + <environment_variable name="PATH" action="prepend_to">$INSTALL_DIR/bin</environment_variable> + </action> + </actions> + </install> + <readme> +Downloads and compiles BLAST+ from the NCBI, which assumes you have +all the required build dependencies installed. See: +http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download + </readme> + </package> +</tool_dependency> +
--- a/tools/ncbi_blast_plus/blastxml_to_tabular.py Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,261 +0,0 @@ -#!/usr/bin/env python -"""Convert a BLAST XML file to tabular output. - -Takes three command line options, input BLAST XML filename, output tabular -BLAST filename, output format (std for standard 12 columns, or ext for the -extended 24 columns offered in the BLAST+ wrappers). - -The 12 columns output are 'qseqid sseqid pident length mismatch gapopen qstart -qend sstart send evalue bitscore' or 'std' at the BLAST+ command line, which -mean: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The additional columns offered in the Galaxy BLAST+ wrappers are: - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -Most of these fields are given explicitly in the XML file, others some like -the percentage identity and the number of gap openings must be calculated. - -Be aware that the sequence in the extended tabular output or XML direct from -BLAST+ may or may not use XXXX masking on regions of low complexity. This -can throw the off the calculation of percentage identity and gap openings. -[In fact, both BLAST 2.2.24+ and 2.2.25+ have a subtle bug in this regard, -with these numbers changing depending on whether or not the low complexity -filter is used.] - -This script attempts to produce identical output to what BLAST+ would have done. -However, check this with "diff -b ..." since BLAST+ sometimes includes an extra -space character (probably a bug). -""" -import sys -import re - -if "-v" in sys.argv or "--version" in sys.argv: - print "v0.0.12" - sys.exit(0) - -if sys.version_info[:2] >= ( 2, 5 ): - try: - from xml.etree import cElementTree as ElementTree - except ImportError: - from xml.etree import ElementTree as ElementTree -else: - from galaxy import eggs - import pkg_resources; pkg_resources.require( "elementtree" ) - from elementtree import ElementTree - -def stop_err( msg ): - sys.stderr.write("%s\n" % msg) - sys.exit(1) - -#Parse Command Line -try: - in_file, out_file, out_fmt = sys.argv[1:] -except: - stop_err("Expect 3 arguments: input BLAST XML file, output tabular file, out format (std or ext)") - -if out_fmt == "std": - extended = False -elif out_fmt == "x22": - stop_err("Format argument x22 has been replaced with ext (extended 24 columns)") -elif out_fmt == "ext": - extended = True -else: - stop_err("Format argument should be std (12 column) or ext (extended 24 columns)") - - -# get an iterable -try: - context = ElementTree.iterparse(in_file, events=("start", "end")) -except: - stop_err("Invalid data format.") -# turn it into an iterator -context = iter(context) -# get the root element -try: - event, root = context.next() -except: - stop_err( "Invalid data format." ) - - -re_default_query_id = re.compile("^Query_\d+$") -assert re_default_query_id.match("Query_101") -assert not re_default_query_id.match("Query_101a") -assert not re_default_query_id.match("MyQuery_101") -re_default_subject_id = re.compile("^Subject_\d+$") -assert re_default_subject_id.match("Subject_1") -assert not re_default_subject_id.match("Subject_") -assert not re_default_subject_id.match("Subject_12a") -assert not re_default_subject_id.match("TheSubject_1") - - -outfile = open(out_file, 'w') -blast_program = None -for event, elem in context: - if event == "end" and elem.tag == "BlastOutput_program": - blast_program = elem.text - # for every <Iteration> tag - if event == "end" and elem.tag == "Iteration": - #Expecting either this, from BLAST 2.2.25+ using FASTA vs FASTA - # <Iteration_query-ID>sp|Q9BS26|ERP44_HUMAN</Iteration_query-ID> - # <Iteration_query-def>Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1</Iteration_query-def> - # <Iteration_query-len>406</Iteration_query-len> - # <Iteration_hits></Iteration_hits> - # - #Or, from BLAST 2.2.24+ run online - # <Iteration_query-ID>Query_1</Iteration_query-ID> - # <Iteration_query-def>Sample</Iteration_query-def> - # <Iteration_query-len>516</Iteration_query-len> - # <Iteration_hits>... - qseqid = elem.findtext("Iteration_query-ID") - if re_default_query_id.match(qseqid): - #Place holder ID, take the first word of the query definition - qseqid = elem.findtext("Iteration_query-def").split(None,1)[0] - qlen = int(elem.findtext("Iteration_query-len")) - - # for every <Hit> within <Iteration> - for hit in elem.findall("Iteration_hits/Hit"): - #Expecting either this, - # <Hit_id>gi|3024260|sp|P56514.1|OPSD_BUFBU</Hit_id> - # <Hit_def>RecName: Full=Rhodopsin</Hit_def> - # <Hit_accession>P56514</Hit_accession> - #or, - # <Hit_id>Subject_1</Hit_id> - # <Hit_def>gi|57163783|ref|NP_001009242.1| rhodopsin [Felis catus]</Hit_def> - # <Hit_accession>Subject_1</Hit_accession> - # - #apparently depending on the parse_deflines switch - sseqid = hit.findtext("Hit_id").split(None,1)[0] - hit_def = sseqid + " " + hit.findtext("Hit_def") - if re_default_subject_id.match(sseqid) \ - and sseqid == hit.findtext("Hit_accession"): - #Place holder ID, take the first word of the subject definition - hit_def = hit.findtext("Hit_def") - sseqid = hit_def.split(None,1)[0] - # for every <Hsp> within <Hit> - for hsp in hit.findall("Hit_hsps/Hsp"): - nident = hsp.findtext("Hsp_identity") - length = hsp.findtext("Hsp_align-len") - pident = "%0.2f" % (100*float(nident)/float(length)) - - q_seq = hsp.findtext("Hsp_qseq") - h_seq = hsp.findtext("Hsp_hseq") - m_seq = hsp.findtext("Hsp_midline") - assert len(q_seq) == len(h_seq) == len(m_seq) == int(length) - gapopen = str(len(q_seq.replace('-', ' ').split())-1 + \ - len(h_seq.replace('-', ' ').split())-1) - - mismatch = m_seq.count(' ') + m_seq.count('+') \ - - q_seq.count('-') - h_seq.count('-') - #TODO - Remove this alternative mismatch calculation and test - #once satisifed there are no problems - expected_mismatch = len(q_seq) \ - - sum(1 for q,h in zip(q_seq, h_seq) \ - if q == h or q == "-" or h == "-") - xx = sum(1 for q,h in zip(q_seq, h_seq) if q=="X" and h=="X") - if not (expected_mismatch - q_seq.count("X") <= int(mismatch) <= expected_mismatch + xx): - stop_err("%s vs %s mismatches, expected %i <= %i <= %i" \ - % (qseqid, sseqid, expected_mismatch - q_seq.count("X"), - int(mismatch), expected_mismatch)) - - #TODO - Remove this alternative identity calculation and test - #once satisifed there are no problems - expected_identity = sum(1 for q,h in zip(q_seq, h_seq) if q == h) - if not (expected_identity - xx <= int(nident) <= expected_identity + q_seq.count("X")): - stop_err("%s vs %s identities, expected %i <= %i <= %i" \ - % (qseqid, sseqid, expected_identity, int(nident), - expected_identity + q_seq.count("X"))) - - - evalue = hsp.findtext("Hsp_evalue") - if evalue == "0": - evalue = "0.0" - else: - evalue = "%0.0e" % float(evalue) - - bitscore = float(hsp.findtext("Hsp_bit-score")) - if bitscore < 100: - #Seems to show one decimal place for lower scores - bitscore = "%0.1f" % bitscore - else: - #Note BLAST does not round to nearest int, it truncates - bitscore = "%i" % bitscore - - values = [qseqid, - sseqid, - pident, - length, #hsp.findtext("Hsp_align-len") - str(mismatch), - gapopen, - hsp.findtext("Hsp_query-from"), #qstart, - hsp.findtext("Hsp_query-to"), #qend, - hsp.findtext("Hsp_hit-from"), #sstart, - hsp.findtext("Hsp_hit-to"), #send, - evalue, #hsp.findtext("Hsp_evalue") in scientific notation - bitscore, #hsp.findtext("Hsp_bit-score") rounded - ] - - if extended: - sallseqid = ";".join(name.split(None,1)[0] for name in hit_def.split(">")) - #print hit_def, "-->", sallseqid - positive = hsp.findtext("Hsp_positive") - ppos = "%0.2f" % (100*float(positive)/float(length)) - qframe = hsp.findtext("Hsp_query-frame") - sframe = hsp.findtext("Hsp_hit-frame") - if blast_program == "blastp": - #Probably a bug in BLASTP that they use 0 or 1 depending on format - if qframe == "0": qframe = "1" - if sframe == "0": sframe = "1" - slen = int(hit.findtext("Hit_len")) - values.extend([sallseqid, - hsp.findtext("Hsp_score"), #score, - nident, - positive, - hsp.findtext("Hsp_gaps"), #gaps, - ppos, - qframe, - sframe, - #NOTE - for blastp, XML shows original seq, tabular uses XXX masking - q_seq, - h_seq, - str(qlen), - str(slen), - ]) - #print "\t".join(values) - outfile.write("\t".join(values) + "\n") - # prevents ElementTree from growing large datastructure - root.clear() - elem.clear() -outfile.close()
--- a/tools/ncbi_blast_plus/blastxml_to_tabular.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,137 +0,0 @@ -<tool id="blastxml_to_tabular" name="BLAST XML to tabular" version="0.0.11"> - <description>Convert BLAST XML output to tabular</description> - <version_command interpreter="python">blastxml_to_tabular.py --version</version_command> - <command interpreter="python"> - blastxml_to_tabular.py $blastxml_file $tabular_file $out_format - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - </stdio> - <inputs> - <param name="blastxml_file" type="data" format="blastxml" label="BLAST results as XML"/> - <param name="out_format" type="select" label="Output format"> - <option value="std">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - </param> - </inputs> - <outputs> - <data name="tabular_file" format="tabular" label="BLAST results as tabular" /> - </outputs> - <requirements> - </requirements> - <tests> - <test> - <param name="blastxml_file" value="blastp_four_human_vs_rhodopsin.xml" ftype="blastxml" /> - <param name="out_format" value="std" /> - <!-- Note this has some white space differences from the actual blastp output blast_four_human_vs_rhodopsin.tabluar --> - <output name="tabular_file" file="blastp_four_human_vs_rhodopsin_converted.tabular" ftype="tabular" /> - </test> - <test> - <param name="blastxml_file" value="blastp_four_human_vs_rhodopsin.xml" ftype="blastxml" /> - <param name="out_format" value="ext" /> - <!-- Note this has some white space differences from the actual blastp output blast_four_human_vs_rhodopsin_22c.tabluar --> - <output name="tabular_file" file="blastp_four_human_vs_rhodopsin_converted_ext.tabular" ftype="tabular" /> - </test> - <test> - <param name="blastxml_file" value="blastp_sample.xml" ftype="blastxml" /> - <param name="out_format" value="std" /> - <!-- Note this has some white space differences from the actual blastp output --> - <output name="tabular_file" file="blastp_sample_converted.tabular" ftype="tabular" /> - </test> - <test> - <param name="blastxml_file" value="blastx_rhodopsin_vs_four_human.xml" ftype="blastxml" /> - <param name="out_format" value="std" /> - <!-- Note this has some white space differences from the actual blastx output --> - <output name="tabular_file" file="blastx_rhodopsin_vs_four_human_converted.tabular" ftype="tabular" /> - </test> - <test> - <param name="blastxml_file" value="blastx_rhodopsin_vs_four_human.xml" ftype="blastxml" /> - <param name="out_format" value="ext" /> - <!-- Note this has some white space and XXXX masking differences from the actual blastx output --> - <output name="tabular_file" file="blastx_rhodopsin_vs_four_human_converted_ext.tabular" ftype="tabular" /> - </test> - <test> - <param name="blastxml_file" value="blastx_sample.xml" ftype="blastxml" /> - <param name="out_format" value="std" /> - <!-- Note this has some white space differences from the actual blastx output --> - <output name="tabular_file" file="blastx_sample_converted.tabular" ftype="tabular" /> - </test> - <test> - <param name="blastxml_file" value="blastp_human_vs_pdb_seg_no.xml" ftype="blastxml" /> - <param name="out_format" value="std" /> - <!-- Note this has some white space differences from the actual blastp output --> - <output name="tabular_file" file="blastp_human_vs_pdb_seg_no_converted_std.tabular" ftype="tabular" /> - </test> - <test> - <param name="blastxml_file" value="blastp_human_vs_pdb_seg_no.xml" ftype="blastxml" /> - <param name="out_format" value="ext" /> - <!-- Note this has some white space differences from the actual blastp output --> - <output name="tabular_file" file="blastp_human_vs_pdb_seg_no_converted_ext.tabular" ftype="tabular" /> - </test> - </tests> - <help> - -**What it does** - -NCBI BLAST+ (and the older NCBI 'legacy' BLAST) can output in a range of -formats including tabular and a more detailed XML format. A complex workflow -may need both the XML and the tabular output - but running BLAST twice is -slow and wasteful. - -This tool takes the BLAST XML output and can convert it into the -standard 12 column tabular equivalent: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 22 column tabular -BLAST output. This tool now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -Beware that the XML file (and thus the conversion) and the tabular output -direct from BLAST+ may differ in the presence of XXXX masking on regions -low complexity (columns 21 and 22), and thus also calculated figures like -the percentage identity (column 3). - -**References** - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_blast_plus.txt Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,151 +0,0 @@ -Galaxy wrappers for NCBI BLAST+ suite -===================================== - -These wrappers are copyright 2010-2013 by Peter Cock, The James Hutton Institute -(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. -See the licence text below. - -Currently tested with NCBI BLAST 2.2.26+ (i.e. version 2.2.26 of BLAST+), -and does not work with the NCBI 'legacy' BLAST suite (e.g. blastall). - -Note that these wrappers (and the associated datatypes) were originally -distributed as part of the main Galaxy repository, but as of August 2012 -moved to the Galaxy Tool Shed as 'ncbi_blast_plus' (and 'blast_datatypes'). -My thanks to Dannon Baker from the Galaxy development team for his assistance -with this. - -These wrappers are available from the Galaxy Tool Shed at: -http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - - -Automated Installation -====================== - -Galaxy should be able to automatically install the dependencies, i.e. the -'blast_datatypes' repository which defines the BLAST XML file format -('blastxml') and protein and nucleotide BLAST databases ('blastdbp' and -'blastdbn'). - -You must tell Galaxy about any system level BLAST databases using configuration -files blastdb.loc (nucleotide databases like NT) and blastdb_p.loc (protein -databases like NR), and blastdb_d.loc (protein domain databases like CDD or -SMART) which are located in the tool-data/ folder. Sample files are included -which explain the tab-based format to use. - -You can download the NCBI provided databases as tar-balls from here: -ftp://ftp.ncbi.nlm.nih.gov/blast/db/ (nucleotide and protein databases like NR) -ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian/ (domain databases like CDD) - - -Manual Installation -=================== - -For those not using Galaxy's automated installation from the Tool Shed, put -the XML and Python files in the tools/ncbi_blast_plus/ folder and add the XML -files to your tool_conf.xml as normal (and do the same in tool_conf.xml.sample -in order to run the unit tests). For example, use: - - <section name="NCBI BLAST+" id="ncbi_blast_plus_tools"> - <tool file="ncbi_blast_plus/ncbi_blastn_wrapper.xml" /> - <tool file="ncbi_blast_plus/ncbi_blastp_wrapper.xml" /> - <tool file="ncbi_blast_plus/ncbi_blastx_wrapper.xml" /> - <tool file="ncbi_blast_plus/ncbi_tblastn_wrapper.xml" /> - <tool file="ncbi_blast_plus/ncbi_tblastx_wrapper.xml" /> - <tool file="ncbi_blast_plus/ncbi_makeblastdb.xml" /> - <tool file="ncbi_blast_plus/ncbi_blastdbcmd_wrapper.xml" /> - <tool file="ncbi_blast_plus/ncbi_blastdbcmd_info.xml" /> - <tool file="ncbi_blast_plus/ncbi_rpsblast_wrapper.xml" /> - <tool file="ncbi_blast_plus/ncbi_rpstblastn_wrapper.xml" /> - <tool file="ncbi_blast_plus/blastxml_to_tabular.xml" /> - </section> - -You will also need to install 'blast_datatypes' from the Tool Shed. This -defines the BLAST XML file format ('blastxml') and protein and nucleotide -BLAST databases composite file formats ('blastdbp' and 'blastdbn'). - -As described above for an automated installation, you must also tell Galaxy -about any system level BLAST databases using the tool-data/blastdb*.loc files. - -You must install the NCBI BLAST+ standalone tools somewhere on the system -path. Currently the unit tests are written using "BLAST 2.2.26+". - -Run the functional tests (adjusting the section identifier to match your -tool_conf.xml.sample file): - -./run_functional_tests.sh -sid NCBI_BLAST+-ncbi_blast_plus_tools - - -History -======= - -v0.0.11 - Final revision as part of the Galaxy main repository, and the - first release via the Tool Shed -v0.0.12 - Implements genetic code option for translation searches. - - Changes <parallelism> to 1000 sequences at a time (to cope with - very large sets of queries where BLAST+ can become memory hungry) - - Include warning that BLAST+ with subject FASTA gives pairwise - e-values -v0.0.13 - Use the new error handling options in Galaxy (the previously - bundled hide_stderr.py script is no longer needed). -v0.0.14 - Support for makeblastdb and blastdbinfo with local BLAST databases - in the history (using work from Edward Kirton), requires v0.0.14 - of the 'blast_datatypes' repository from the Tool Shed. -v0.0.15 - Stronger warning in help text against searching against subject - FASTA files (better looking e-values than you might be expecting). -v0.0.16 - Added repository_dependencies.xml for automates installation of the - 'blast_datatypes' repository from the Tool Shed. -v0.0.17 - The BLAST+ search tools now default to extended tabular output - (all too often our users where having to re-run searches just to - get one of the missing columns like query or subject length) -v0.0.18 - Defensive quoting of filenames in case of spaces (where possible, - BLAST+ handling of some mult-file arguments is problematic). -v0.0.19 - Added wrappers for rpsblast and rpstblastn, and new blastdb_d.loc - for the domain databases they use (e.g. CDD, PFAM or SMART). - - Correct case of exception regular expression (for error handling - fall-back in case the return code is not set properly). - - Clearer naming of output files. -v0.0.20 - Added unit tests for BLASTN and TBLASTX. - - Fallback on ElementTree if cElementTree missing in XML to tabular. - - Link to Tool Shed added to help text and this documentation. - - Tweak dependency on blast_datatypes to also work on Test Tool Shed - - -Developers -========== - -This script and related tools are being developed on the 'tools' branch of the -following Mercurial repository: -https://bitbucket.org/peterjc/galaxy-central/ - -For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball I use -the following command from the Galaxy root folder: - -$ ./tools/ncbi_blast_plus/make_ncbi_blast_plus.sh - -This simplifies ensuring a consistent set of files is bundled each time, -including all the relevant test files. - - -Licence (MIT/BSD style) -======================= - -Permission to use, copy, modify, and distribute this software and its -documentation with or without modifications and for any purpose and -without fee is hereby granted, provided that any copyright notices -appear in all copies and that both those copyright notices and this -permission notice appear in supporting documentation, and that the -names of the contributors or copyright holders not be used in -advertising or publicity pertaining to distribution of the software -without specific prior permission. - -THE CONTRIBUTORS AND COPYRIGHT HOLDERS OF THIS SOFTWARE DISCLAIM ALL -WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED -WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE -CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT -OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS -OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE -OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE -OR PERFORMANCE OF THIS SOFTWARE. - -NOTE: This is the licence for the Galaxy Wrapper only. NCBI BLAST+ and -associated data files are available and licenced separately.
--- a/tools/ncbi_blast_plus/ncbi_blastdbcmd_info.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,67 +0,0 @@ -<tool id="ncbi_blastdbcmd_info" name="NCBI BLAST+ database info" version="0.0.6"> - <description>Show BLAST database information from blastdbcmd</description> - <requirements> - <requirement type="binary">blastdbcmd</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>blastdbcmd -version</version_command> - <command> -blastdbcmd -dbtype $db_opts.db_type -db "${db_opts.database.fields.path}" -info -out "$info" - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- Suspect blastdbcmd sometimes fails to set error level --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <conditional name="db_opts"> - <param name="db_type" type="select" label="Type of BLAST database"> - <option value="nucl" selected="True">Nucleotide</option> - <option value="prot">Protein</option> - </param> - <when value="nucl"> - <param name="database" type="select" label="Nucleotide BLAST database"> - <options from_file="blastdb.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - </when> - <when value="prot"> - <param name="database" type="select" label="Protein BLAST database"> - <options from_file="blastdb_p.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - </when> - </conditional> - </inputs> - <outputs> - <data name="info" format="txt" label="${db_opts.database.fields.name} info" /> - </outputs> - <help> - -**What it does** - -Calls the NCBI BLAST+ blastdbcmd command line tool with the -info -switch to give summary information about a BLAST database, such as -the size (number of sequences and total length) and date. - -------- - -**References** - -Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. - -Schaffer et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. 2001. Nucleic Acids Res. 29:2994-3005. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_blastdbcmd_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,139 +0,0 @@ -<tool id="ncbi_blastdbcmd_wrapper" name="NCBI BLAST+ blastdbcmd entry(s)" version="0.0.6"> - <description>Extract sequence(s) from BLAST database</description> - <requirements> - <requirement type="binary">blastdbcmd</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>blastdbcmd -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -blastdbcmd -dbtype $db_opts.db_type -db "${db_opts.database.fields.path}" - -##TODO: What about -ctrl_a and -target_only as advanced options? - -#if $id_opts.id_type=="file": --entry_batch "$id_opts.entries" -#else: -##Perform some simple search/replaces to remove whitespace -##and make it comma separated, and escape any pipe characters --entry "$id_opts.entries.replace('\r',',').replace('\n',',').replace(' ','').replace(',,',',').replace(',,',',').strip(',').replace('|','\|')" -#end if - -##When building a BLAST database, to ensure unique IDs makeblastdb will -##do things like turning a FASTA entry with ID of ERP44 into lcl|ERP44 -##(if using -parse_seqids) or simply assign it an ID using the record -##number like gnl|BL_ORD_ID|123 (to cope with duplicate IDs in the FASTA -##file). In -parse_seqids mode, a duplicate FASTA ID gives an error. -## -##The BLAST plain text and XML output will contain these BLAST IDs, but -##the tabular output does not (at least, not in BLAST 2.2.25+). -##Therefore in general, Galaxy users won't care about the (internal) -##BLAST identifiers. -## -##The blastdbcmd FASTA output will also contain these IDs, but in the -##context of the BLAST tabular output they are not helpful. Therefore -##to recover the original ID as used in the FASTA file for makeblastdb -##we need a litte post processing. -## -##We remove the NCBI's lcl|... or gnl|BL_ORD_ID|123 prefixes -##using sed, however the exact syntax differs for Mac OS X's sed - -#if str($outfmt)=="blastid": --out "$seq" -#else if sys.platform == "darwin": -| sed -E 's/^>(lcl\||gnl\|BL_ORD_ID\|[0-9]* )/>/1' > "$seq" -#else: -| sed 's/>\(lcl|\|gnl|BL_ORD_ID|[0-9]* \)/>/1' > "$seq" -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- Suspect blastdbcmd sometimes fails to set error level --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <conditional name="db_opts"> - <param name="db_type" type="select" label="Type of BLAST database"> - <option value="nucl" selected="True">Nucleotide</option> - <option value="prot">Protein</option> - </param> - <when value="nucl"> - <param name="database" type="select" label="Nucleotide BLAST database"> - <options from_file="blastdb.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - </when> - <when value="prot"> - <param name="database" type="select" label="Protein BLAST database"> - <options from_file="blastdb_p.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - </when> - </conditional> - <conditional name="id_opts"> - <param name="id_type" type="select" label="Type of identifier list"> - <option value="file">From file</option> - <option value="prompt">User entered</option> - </param> - <when value="file"> - <param name="entries" type="data" format="txt,tabular" label="Sequence identifier(s)" help="Plain text file with one ID per line (i.e. single column tabular file)"/> - </when> - <when value="prompt"> - <param name="entries" type="text" label="Sequence identifier(s)" help="Comma or new line separated list." optional="False" area="True" size="10x30"/> - </when> - </conditional> - <param name="outfmt" type="select" label="Output format"> - <option value="original">FASTA with original identifiers</option> - <option value="blastid">FASTA with BLAST assigned identifiers</option> - </param> - </inputs> - <outputs> - <data name="seq" format="fasta" label="Sequences from ${db_opts.database.fields.name}" /> - </outputs> - <help> - -**What it does** - -Extracts FASTA formatted sequences from a BLAST database -using the NCBI BLAST+ blastdbcmd command line tool. - -.. class:: warningmark - -**BLAST assigned identifiers** - -When a BLAST database is constructed from a FASTA file, the -original identifiers can be replaced with BLAST assigned -identifiers, partly to ensure uniqueness. e.g. Sometimes -a prefix of 'lcl|' is added (lcl is short for local), -or an arbitrary name starting 'gnl|BL_ORD_ID|' is created. - -If you are using the tabular output from BLAST, it will contain -the original identifiers - not the BLAST assigned identifiers -suitable for use with the blastdbcmd tool. - -If you are using the XML or plain text output, this will also -contain the BLAST assigned identifiers. However, this means -getting a list of BLAST assigned identifiers isn't straightforward. - -------- - -**References** - -Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. - -Schaffer et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. 2001. Nucleic Acids Res. 29:2994-3005. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_blastn_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,253 +0,0 @@ -<tool id="ncbi_blastn_wrapper" name="NCBI BLAST+ blastn" version="0.0.20"> - <description>Search nucleotide database with nucleotide query sequence(s)</description> - <!-- If job splitting is enabled, break up the query file into parts --> - <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> - <requirements> - <requirement type="binary">blastn</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>blastn -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -blastn --query "$query" -#if $db_opts.db_opts_selector == "db": - -db "${db_opts.database.fields.path}" -#elif $db_opts.db_opts_selector == "histdb": - -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" -#else: - -subject "$db_opts.subject" -#end if --task $blast_type --evalue $evalue_cutoff --out "$output1" -##Set the extended list here so if/when we add things, saved workflows are not affected -#if str($out_format)=="ext": - -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" -#else: - -outfmt $out_format -#end if --num_threads 8 -#if $adv_opts.adv_opts_selector=="advanced": -$adv_opts.filter_query -$adv_opts.strand -## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string -## Note -max_target_seqs overrides -num_descriptions and -num_alignments -#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): --max_target_seqs $adv_opts.max_hits -#end if -#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): --word_size $adv_opts.word_size -#end if -$adv_opts.ungapped -$adv_opts.parse_deflines -## End of advanced options: -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> - <conditional name="db_opts"> - <param name="db_opts_selector" type="select" label="Subject database/sequences"> - <option value="db" selected="True">Locally installed BLAST database</option> - <option value="histdb">BLAST database from your history</option> - <option value="file">FASTA file from your history (see warning note below)</option> - </param> - <when value="db"> - <param name="database" type="select" label="Nucleotide BLAST database"> - <options from_file="blastdb.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="histdb"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="data" format="blastdbn" label="Nucleotide BLAST database" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="file"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="data" format="fasta" label="Nucleotide FASTA file to use as database"/> - </when> - </conditional> - <param name="blast_type" type="select" display="radio" label="Type of BLAST"> - <option value="megablast">megablast</option> - <option value="blastn">blastn</option> - <option value="blastn-short">blastn-short</option> - <option value="dc-megablast">dc-megablast</option> - <!-- Using BLAST 2.2.24+ this gives an error: - BLAST engine error: Program type 'vecscreen' not supported - <option value="vecscreen">vecscreen</option> - --> - </param> - <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> - <param name="out_format" type="select" label="Output format"> - <option value="6">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - <option value="5">BLAST XML</option> - <option value="0">Pairwise text</option> - <option value="0 -html">Pairwise HTML</option> - <option value="2">Query-anchored text</option> - <option value="2 -html">Query-anchored HTML</option> - <option value="4">Flat query-anchored text</option> - <option value="4 -html">Flat query-anchored HTML</option> - <!-- - <option value="-outfmt 11">BLAST archive format (ASN.1)</option> - --> - </param> - <conditional name="adv_opts"> - <param name="adv_opts_selector" type="select" label="Advanced Options"> - <option value="basic" selected="True">Hide Advanced Options</option> - <option value="advanced">Show Advanced Options</option> - </param> - <when value="basic" /> - <when value="advanced"> - <!-- Could use a select (yes, no, other) where other allows setting 'level window linker' --> - <param name="filter_query" type="boolean" label="Filter out low complexity regions (with DUST)" truevalue="-dust yes" falsevalue="-dust no" checked="true" /> - <param name="strand" type="select" label="Query strand(s) to search against database/subject"> - <option value="-strand both">Both</option> - <option value="-strand plus">Plus (forward)</option> - <option value="-strand minus">Minus (reverse complement)</option> - </param> - <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> - <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> - <validator type="in_range" min="0" /> - </param> - <!-- I'd like word_size to be optional, with minimum 4 for blastn --> - <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 4."> - <validator type="in_range" min="0" /> - </param> - <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped" falsevalue="" checked="false" /> - <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> - </when> - </conditional> - </inputs> - <outputs> - <data name="output1" format="tabular" label="${blast_type.value_label} on ${on_string}"> - <change_format> - <when input="out_format" value="0" format="txt"/> - <when input="out_format" value="0 -html" format="html"/> - <when input="out_format" value="2" format="txt"/> - <when input="out_format" value="2 -html" format="html"/> - <when input="out_format" value="4" format="txt"/> - <when input="out_format" value="4 -html" format="html"/> - <when input="out_format" value="5" format="blastxml"/> - </change_format> - </data> - </outputs> - <tests> - <test> - <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="three_human_mRNA.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-40" /> - <param name="out_format" value="6" /> - <param name="adv_opts_selector" value="basic" /> - <output name="output1" file="blastn_rhodopsin_vs_three_human.tabular" ftype="tabular" /> - </test> - </tests> - <help> - -.. class:: warningmark - -**Note**. Database searches may take a substantial amount of time. -For large input datasets it is advisable to allow overnight processing. - ------ - -**What it does** - -Search a *nucleotide database* using a *nucleotide query*, -using the NCBI BLAST+ blastn command line tool. -Algorithms include blastn, megablast, and discontiguous megablast. - -.. class:: warningmark - -You can also search against a FASTA file of subject nucleotide -sequences. This is *not* advised because it is slower (only one -CPU is used), but more importantly gives e-values for pairwise -searches (very small e-values which will look overly signficiant). -In most cases you should instead turn the other FASTA file into a -database first using *makeblastdb* and search against that. - ------ - -**Output format** - -Because Galaxy focuses on processing tabular data, the default output of this -tool is tabular. The standard BLAST+ tabular output contains 12 columns: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 24 column tabular -BLAST output. Galaxy now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -The third option is BLAST XML output, which is designed to be parsed by -another program, and is understood by some Galaxy tools. - -You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). -The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. -The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. -The two query anchored outputs show a multiple sequence alignment between the query and all the matches, -and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). - -------- - -**References** - -Zhang et al. A Greedy Algorithm for Aligning DNA Sequences. 2000. JCB: 203-214. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_blastp_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,308 +0,0 @@ -<tool id="ncbi_blastp_wrapper" name="NCBI BLAST+ blastp" version="0.0.20"> - <description>Search protein database with protein query sequence(s)</description> - <!-- If job splitting is enabled, break up the query file into parts --> - <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> - <requirements> - <requirement type="binary">blastp</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>blastp -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -blastp --query "$query" -#if $db_opts.db_opts_selector == "db": - -db "${db_opts.database.fields.path}" -#elif $db_opts.db_opts_selector == "histdb": - -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" -#else: - -subject "$db_opts.subject" -#end if --task $blast_type --evalue $evalue_cutoff --out "$output1" -##Set the extended list here so if/when we add things, saved workflows are not affected -#if str($out_format)=="ext": - -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" -#else: - -outfmt $out_format -#end if --num_threads 8 -#if $adv_opts.adv_opts_selector=="advanced": -$adv_opts.filter_query --matrix $adv_opts.matrix -## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string -## Note -max_target_seqs overrides -num_descriptions and -num_alignments -#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): --max_target_seqs $adv_opts.max_hits -#end if -#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): --word_size $adv_opts.word_size -#end if -##Ungapped disabled for now - see comments below -##$adv_opts.ungapped -$adv_opts.parse_deflines -## End of advanced options: -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <param name="query" type="data" format="fasta" label="Protein query sequence(s)"/> - <conditional name="db_opts"> - <param name="db_opts_selector" type="select" label="Subject database/sequences"> - <option value="db" selected="True">Locally installed BLAST database</option> - <option value="histdb">BLAST database from your history</option> - <option value="file">FASTA file from your history (see warning note below)</option> - </param> - <when value="db"> - <param name="database" type="select" label="Protein BLAST database"> - <options from_file="blastdb_p.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="histdb"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="data" format="blastdbp" label="Protein BLAST database" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="file"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="data" format="fasta" label="Protein FASTA file to use as database"/> - </when> - </conditional> - <param name="blast_type" type="select" display="radio" label="Type of BLAST"> - <option value="blastp">blastp</option> - <option value="blastp-short">blastp-short</option> - </param> - <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> - <param name="out_format" type="select" label="Output format"> - <option value="6">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - <option value="5">BLAST XML</option> - <option value="0">Pairwise text</option> - <option value="0 -html">Pairwise HTML</option> - <option value="2">Query-anchored text</option> - <option value="2 -html">Query-anchored HTML</option> - <option value="4">Flat query-anchored text</option> - <option value="4 -html">Flat query-anchored HTML</option> - <!-- - <option value="-outfmt 11">BLAST archive format (ASN.1)</option> - --> - </param> - <conditional name="adv_opts"> - <param name="adv_opts_selector" type="select" label="Advanced Options"> - <option value="basic" selected="True">Hide Advanced Options</option> - <option value="advanced">Show Advanced Options</option> - </param> - <when value="basic" /> - <when value="advanced"> - <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> - <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="false" /> - <param name="matrix" type="select" label="Scoring matrix"> - <option value="BLOSUM90">BLOSUM90</option> - <option value="BLOSUM80">BLOSUM80</option> - <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> - <option value="BLOSUM50">BLOSUM50</option> - <option value="BLOSUM45">BLOSUM45</option> - <option value="PAM250">PAM250</option> - <option value="PAM70">PAM70</option> - <option value="PAM30">PAM30</option> - </param> - <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> - <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> - <validator type="in_range" min="0" /> - </param> - <!-- I'd like word_size to be optional, with minimum 2 for blastp --> - <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> - <validator type="in_range" min="0" /> - </param> - <!-- - Can't use '-ungapped' on its own, error back is: - Composition-adjusted searched are not supported with an ungapped search, please add -comp_based_stats F or do a gapped search - Tried using '-ungapped -comp_based_stats F' and blastp crashed with 'Attempt to access NULL pointer.' - <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped -comp_based_stats F" falsevalue="" checked="false" /> - --> - <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> - </when> - </conditional> - </inputs> - <outputs> - <data name="output1" format="tabular" label="${blast_type.value_label} on ${on_string}"> - <change_format> - <when input="out_format" value="0" format="txt"/> - <when input="out_format" value="0 -html" format="html"/> - <when input="out_format" value="2" format="txt"/> - <when input="out_format" value="2 -html" format="html"/> - <when input="out_format" value="4" format="txt"/> - <when input="out_format" value="4 -html" format="html"/> - <when input="out_format" value="5" format="blastxml"/> - </change_format> - </data> - </outputs> - <tests> - <test> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_proteins.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-8" /> - <param name="blast_type" value="blastp" /> - <param name="out_format" value="5" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="False" /> - <param name="matrix" value="BLOSUM62" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="True" /> - <output name="output1" file="blastp_four_human_vs_rhodopsin.xml" ftype="blastxml" /> - </test> - <test> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_proteins.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-8" /> - <param name="blast_type" value="blastp" /> - <param name="out_format" value="6" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="False" /> - <param name="matrix" value="BLOSUM62" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="True" /> - <output name="output1" file="blastp_four_human_vs_rhodopsin.tabular" ftype="tabular" /> - </test> - <test> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_proteins.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-8" /> - <param name="blast_type" value="blastp" /> - <param name="out_format" value="ext" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="False" /> - <param name="matrix" value="BLOSUM62" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="True" /> - <output name="output1" file="blastp_four_human_vs_rhodopsin_ext.tabular" ftype="tabular" /> - </test> - <test> - <param name="query" value="rhodopsin_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-8" /> - <param name="blast_type" value="blastp" /> - <param name="out_format" value="6" /> - <param name="adv_opts_selector" value="basic" /> - <output name="output1" file="blastp_rhodopsin_vs_four_human.tabular" ftype="tabular" /> - </test> - </tests> - <help> - -.. class:: warningmark - -**Note**. Database searches may take a substantial amount of time. -For large input datasets it is advisable to allow overnight processing. - ------ - -**What it does** - -Search a *protein database* using a *protein query*, -using the NCBI BLAST+ blastp command line tool. - -.. class:: warningmark - -You can also search against a FASTA file of subject protein -sequences. This is *not* advised because it is slower (only one -CPU is used), but more importantly gives e-values for pairwise -searches (very small e-values which will look overly signficiant). -In most cases you should instead turn the other FASTA file into a -database first using *makeblastdb* and search against that. - ------ - -**Output format** - -Because Galaxy focuses on processing tabular data, the default output of this -tool is tabular. The standard BLAST+ tabular output contains 12 columns: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 24 column tabular -BLAST output. Galaxy now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -The third option is BLAST XML output, which is designed to be parsed by -another program, and is understood by some Galaxy tools. - -You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). -The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. -The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. -The two query anchored outputs show a multiple sequence alignment between the query and all the matches, -and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). - -------- - -**References** - -Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. - -Schaffer et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. 2001. Nucleic Acids Res. 29:2994-3005. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_blastx_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,294 +0,0 @@ -<tool id="ncbi_blastx_wrapper" name="NCBI BLAST+ blastx" version="0.0.19"> - <description>Search protein database with translated nucleotide query sequence(s)</description> - <!-- If job splitting is enabled, break up the query file into parts --> - <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> - <requirements> - <requirement type="binary">blastx</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>blastx -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -blastx --query "$query" -#if $db_opts.db_opts_selector == "db": - -db "${db_opts.database.fields.path}" -#elif $db_opts.db_opts_selector == "histdb": - -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" -#else: - -subject "$db_opts.subject" -#end if --query_gencode $query_gencode --evalue $evalue_cutoff --out "$output1" -##Set the extended list here so if/when we add things, saved workflows are not affected -#if str($out_format)=="ext": - -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" -#else: - -outfmt $out_format -#end if --num_threads 8 -#if $adv_opts.adv_opts_selector=="advanced": -$adv_opts.filter_query -$adv_opts.strand --matrix $adv_opts.matrix -## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string -## Note -max_target_seqs overrides -num_descriptions and -num_alignments -#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): --max_target_seqs $adv_opts.max_hits -#end if -#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): --word_size $adv_opts.word_size -#end if -$adv_opts.ungapped -$adv_opts.parse_deflines -## End of advanced options: -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> - <conditional name="db_opts"> - <param name="db_opts_selector" type="select" label="Subject database/sequences"> - <option value="db" selected="True">Locally installed BLAST database</option> - <option value="histdb">BLAST database from your history</option> - <option value="file">FASTA file from your history (see warning note below)</option> - </param> - <when value="db"> - <param name="database" type="select" label="Protein BLAST database"> - <options from_file="blastdb_p.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="histdb"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="data" format="blastdbp" label="Protein BLAST database" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="file"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="data" format="fasta" label="Protein FASTA file to use as database"/> - </when> - </conditional> - <param name="query_gencode" type="select" label="Query genetic code"> - <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> - <option value="1" select="True">1. Standard</option> - <option value="2">2. Vertebrate Mitochondrial</option> - <option value="3">3. Yeast Mitochondrial</option> - <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> - <option value="5">5. Invertebrate Mitochondrial</option> - <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> - <option value="9">9. Echinoderm Mitochondrial</option> - <option value="10">10. Euplotid Nuclear</option> - <option value="11">11. Bacteria and Archaea</option> - <option value="12">12. Alternative Yeast Nuclear</option> - <option value="13">13. Ascidian Mitochondrial</option> - <option value="14">14. Flatworm Mitochondrial</option> - <option value="15">15. Blepharisma Macronuclear</option> - <option value="16">16. Chlorophycean Mitochondrial Code</option> - <option value="21">21. Trematode Mitochondrial Code</option> - <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> - <option value="23">23. Thraustochytrium Mitochondrial Code</option> - <option value="24">24. Pterobranchia mitochondrial code</option> - </param> - <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> - <param name="out_format" type="select" label="Output format"> - <option value="6">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - <option value="5">BLAST XML</option> - <option value="0">Pairwise text</option> - <option value="0 -html">Pairwise HTML</option> - <option value="2">Query-anchored text</option> - <option value="2 -html">Query-anchored HTML</option> - <option value="4">Flat query-anchored text</option> - <option value="4 -html">Flat query-anchored HTML</option> - <!-- - <option value="-outfmt 11">BLAST archive format (ASN.1)</option> - --> - </param> - <conditional name="adv_opts"> - <param name="adv_opts_selector" type="select" label="Advanced Options"> - <option value="basic" selected="True">Hide Advanced Options</option> - <option value="advanced">Show Advanced Options</option> - </param> - <when value="basic" /> - <when value="advanced"> - <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> - <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="true" /> - <param name="strand" type="select" label="Query strand(s) to search against database/subject"> - <option value="-strand both">Both</option> - <option value="-strand plus">Plus (forward)</option> - <option value="-strand minus">Minus (reverse complement)</option> - </param> - <param name="matrix" type="select" label="Scoring matrix"> - <option value="BLOSUM90">BLOSUM90</option> - <option value="BLOSUM80">BLOSUM80</option> - <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> - <option value="BLOSUM50">BLOSUM50</option> - <option value="BLOSUM45">BLOSUM45</option> - <option value="PAM250">PAM250</option> - <option value="PAM70">PAM70</option> - <option value="PAM30">PAM30</option> - </param> - <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> - <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> - <validator type="in_range" min="0" /> - </param> - <!-- I'd like word_size to be optional, with minimum 2 for blastx --> - <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> - <validator type="in_range" min="0" /> - </param> - <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped" falsevalue="" checked="false" /> - <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> - </when> - </conditional> - </inputs> - <outputs> - <data name="output1" format="tabular" label="blastx on ${on_string}"> - <change_format> - <when input="out_format" value="0" format="txt"/> - <when input="out_format" value="0 -html" format="html"/> - <when input="out_format" value="2" format="txt"/> - <when input="out_format" value="2 -html" format="html"/> - <when input="out_format" value="4" format="txt"/> - <when input="out_format" value="4 -html" format="html"/> - <when input="out_format" value="5" format="blastxml"/> - </change_format> - </data> - </outputs> - <tests> - <test> - <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="5" /> - <param name="adv_opts_selector" value="basic" /> - <output name="output1" file="blastx_rhodopsin_vs_four_human.xml" ftype="blastxml" /> - </test> - <test> - <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="6" /> - <param name="adv_opts_selector" value="basic" /> - <output name="output1" file="blastx_rhodopsin_vs_four_human.tabular" ftype="tabular" /> - </test> - <test> - <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="ext" /> - <param name="adv_opts_selector" value="basic" /> - <output name="output1" file="blastx_rhodopsin_vs_four_human_ext.tabular" ftype="tabular" /> - </test> - </tests> - <help> - -.. class:: warningmark - -**Note**. Database searches may take a substantial amount of time. -For large input datasets it is advisable to allow overnight processing. - ------ - -**What it does** - -Search a *protein database* using a *translated nucleotide query*, -using the NCBI BLAST+ blastx command line tool. - -.. class:: warningmark - -You can also search against a FASTA file of subject protein -sequences. This is *not* advised because it is slower (only one -CPU is used), but more importantly gives e-values for pairwise -searches (very small e-values which will look overly signficiant). -In most cases you should instead turn the other FASTA file into a -database first using *makeblastdb* and search against that. - ------ - -**Output format** - -Because Galaxy focuses on processing tabular data, the default output of this -tool is tabular. The standard BLAST+ tabular output contains 12 columns: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 24 column tabular -BLAST output. Galaxy now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -The third option is BLAST XML output, which is designed to be parsed by -another program, and is understood by some Galaxy tools. - -You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). -The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. -The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. -The two query anchored outputs show a multiple sequence alignment between the query and all the matches, -and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). - -------- - -**References** - -Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_makeblastdb.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,129 +0,0 @@ -<tool id="ncbi_makeblastdb" name="NCBI BLAST+ makeblastdb" version="0.0.5"> - <description>Make BLAST database</description> - <requirements> - <requirement type="binary">makeblastdb</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>makeblastdb -version</version_command> - <command> -makeblastdb -out "${os.path.join($outfile.extra_files_path,'blastdb')}" -$parse_seqids -$hash_index -## Single call to -in with multiple filenames space separated with outer quotes -## (presumably any filenames with spaces would be a problem). Note this gives -## some extra spaces, e.g. -in " file1 file2 file3 " but BLAST seems happy: --in " -#for $i in $in -${i.file} #end for -" -#if $title: --title "$title" -#else: -##Would default to being based on the cryptic Galaxy filenames, which is unhelpful --title "BLAST Database" -#end if --dbtype $dbtype -## #set $sep = '-mask_data ' -## #for $i in $mask_data -## $sep${i.file} -## #set $set = ', ' -## #end for -## #set $sep = '-gi_mask -gi_mask_name ' -## #for $i in $gi_mask -## $sep${i.file} -## #set $set = ', ' -## #end for -## #if $tax.select == 'id': -## -taxid $tax.id -## #else if $tax.select == 'map': -## -taxid_map $tax.map -## #end if -</command> -<stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> -</stdio> -<inputs> - <param name="dbtype" type="select" display="radio" label="Molecule type of input"> - <option value="prot">protein</option> - <option value="nucl">nucleotide</option> - </param> - <!-- TODO Allow merging of existing BLAST databases (conditional on the database type) - <repeat name="in" title="Blast or Fasta Database" min="1"> - <param name="file" type="data" format="fasta,blastdbn,blastdbp" label="Blast or Fasta database" /> - </repeat> - --> - <repeat name="in" title="FASTA file" min="1"> - <param name="file" type="data" format="fasta" /> - </repeat> - <param name="title" type="text" value="" label="Title for BLAST database" help="This is the database name shown in BLAST search output" /> - <param name="parse_seqids" type="boolean" truevalue="-parse_seqids" falsevalue="" checked="False" label="Parse the sequence identifiers" help="This is only advised if your FASTA file follows the NCBI naming conventions using pipe '|' symbols" /> - <param name="hash_index" type="boolean" truevalue="-hash_index" falsevalue="" checked="true" label="Enable the creation of sequence hash values." help="These hash values can then be used to quickly determine if a given sequence data exists in this BLAST database." /> - - <!-- SEQUENCE MASKING OPTIONS --> - <!-- TODO - <repeat name="mask_data" title="Provide one or more files containing masking data"> - <param name="file" type="data" format="asnb" label="File containing masking data" help="As produced by NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)" /> - </repeat> - <repeat name="gi_mask" title="Create GI indexed masking data"> - <param name="file" type="data" format="asnb" label="Masking data output file" /> - </repeat> - --> - - <!-- TAXONOMY OPTIONS --> - <!-- TODO - <conditional name="tax"> - <param name="select" type="select" label="Taxonomy options"> - <option value="">Do not assign sequences to Taxonomy IDs</option> - <option value="id">Assign all sequences to one Taxonomy ID</option> - <option value="map">Supply text file mapping sequence IDs to taxnomy IDs</option> - </param> - <when value=""> - </when> - <when value="id"> - <param name="id" type="integer" value="" label="NCBI taxonomy ID" help="Integer >=0" /> - </when> - <when value="map"> - <param name="file" type="data" format="txt" label="Seq ID : Tax ID mapping file" help="Format: SequenceId TaxonomyId" /> - </when> - </conditional> - --> -</inputs> -<outputs> - <!-- If we only accepted one FASTA file, we could use its human name here... --> - <data name="outfile" format="data" label="${dbtype.value_label} BLAST database from ${on_string}"> - <change_format> - <when input="dbtype" value="nucl" format="blastdbn"/> - <when input="dbtype" value="prot" format="blastdbp"/> - </change_format> - </data> -</outputs> -<help> -**What it does** - -Make BLAST database from one or more FASTA files and/or BLAST databases. - -This is a wrapper for the NCBI BLAST+ tool 'makeblastdb', which is the -replacement for the 'formatdb' tool in the NCBI 'legacy' BLAST suite. - -<!-- -Applying masks to an existing BLAST database will not change the original database; a new database will be created. -For this reason, it's best to apply all masks at once to minimize the number of unnecessary intermediate databases. ---> - -**Documentation** - -http://www.ncbi.nlm.nih.gov/books/NBK1763/ - -**References** - -Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus -</help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_rpsblast_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,238 +0,0 @@ -<tool id="ncbi_rpsblast_wrapper" name="NCBI BLAST+ rpsblast" version="0.0.4"> - <description>Search protein domain database (PSSMs) with protein query sequence(s)</description> - <!-- If job splitting is enabled, break up the query file into parts --> - <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject" merge_outputs="output1"></parallelism> - <requirements> - <requirement type="binary">rpsblast</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>rpsblast -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -rpsblast --query "$query" -#if $db_opts.db_opts_selector == "db": - -db "${db_opts.database.fields.path}" -#elif $db_opts.db_opts_selector == "histdb": - -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" -#end if --evalue $evalue_cutoff --out "$output1" -##Set the extended list here so if/when we add things, saved workflows are not affected -#if str($out_format)=="ext": - -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" -#else: - -outfmt $out_format -#end if --num_threads 8 -#if $adv_opts.adv_opts_selector=="advanced": -$adv_opts.filter_query -## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string -## Note -max_target_seqs overrides -num_descriptions and -num_alignments -#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): --max_target_seqs $adv_opts.max_hits -#end if -#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): --word_size $adv_opts.word_size -#end if -$adv_opts.parse_deflines -## End of advanced options: -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <param name="query" type="data" format="fasta" label="Protein query sequence(s)"/> - <conditional name="db_opts"> - <param name="db_opts_selector" type="select" label="Protein domain database (PSSM)"> - <option value="db" selected="True">Locally installed BLAST database</option> - <!-- TODO - define new datatype - <option value="histdb">BLAST protein domain database from your history</option> - --> - </param> - <when value="db"> - <param name="database" type="select" label="Protein domain database"> - <options from_file="blastdb_d.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="hidden" value="" /> - </when> - <!-- TODO - define new datatype - <when value="histdb"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="data" format="blastdbd" label="Protein domain database" /> - <param name="subject" type="hidden" value="" /> - </when> - --> - </conditional> - <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> - <param name="out_format" type="select" label="Output format"> - <option value="6">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - <option value="5">BLAST XML</option> - <option value="0">Pairwise text</option> - <option value="0 -html">Pairwise HTML</option> - <option value="2">Query-anchored text</option> - <option value="2 -html">Query-anchored HTML</option> - <option value="4">Flat query-anchored text</option> - <option value="4 -html">Flat query-anchored HTML</option> - <!-- - <option value="-outfmt 11">BLAST archive format (ASN.1)</option> - --> - </param> - <conditional name="adv_opts"> - <param name="adv_opts_selector" type="select" label="Advanced Options"> - <option value="basic" selected="True">Hide Advanced Options</option> - <option value="advanced">Show Advanced Options</option> - </param> - <when value="basic" /> - <when value="advanced"> - <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> - <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="false" /> - <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> - <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> - <validator type="in_range" min="0" /> - </param> - <!-- I'd like word_size to be optional, with minimum 2 for rpsblast --> - <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> - <validator type="in_range" min="0" /> - </param> - <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> - </when> - </conditional> - </inputs> - <outputs> - <data name="output1" format="tabular" label="rpsblast on ${on_string}"> - <change_format> - <when input="out_format" value="0" format="txt"/> - <when input="out_format" value="0 -html" format="html"/> - <when input="out_format" value="2" format="txt"/> - <when input="out_format" value="2 -html" format="html"/> - <when input="out_format" value="4" format="txt"/> - <when input="out_format" value="4 -html" format="html"/> - <when input="out_format" value="5" format="blastxml"/> - </change_format> - </data> - </outputs> - <help> - -.. class:: warningmark - -**Note**. Database searches may take a substantial amount of time. -For large input datasets it is advisable to allow overnight processing. - ------ - -**What it does** - -Search a *protein domain database* using a *protein query*, -using the NCBI BLAST+ rpsblast command line tool. - -The protein domain databases use position-specific scoring matrices -(PSSMs) and are available for a number of domain collections including: - -*CDD* - NCBI curarated meta-collection of domains, see -http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains - -*Kog* - PSSMs from automatically aligned sequences and sequence -fragments classified in the KOGs resource, the eukaryotic -counterpart to COGs, see http://www.ncbi.nlm.nih.gov/COG/new/ - -*Cog* - PSSMs from automatically aligned sequences and sequence -fragments classified in the COGs resource, which focuses primarily -on prokaryotes, see http://www.ncbi.nlm.nih.gov/COG/new/ - -*Pfam* - PSSMs from Pfam-A seed alignment database, see -http://pfam.sanger.ac.uk/ - -*Smart* - PSSMs from SMART domain alignment database, see -http://smart.embl-heidelberg.de/ - -*Tigr* - PSSMs from TIGRFAM database of protein families, see -http://www.jcvi.org/cms/research/projects/tigrfams/overview/ - -*Prk* - PSSms from automatically aligned stable clusters in the -Protein Clusters database, see -http://www.ncbi.nlm.nih.gov/proteinclusters?cmd=search&db=proteinclusters - -The exact list of domain databases offered will depend on how your -local Galaxy has been configured. - ------ - -**Output format** - -Because Galaxy focuses on processing tabular data, the default output of this -tool is tabular. The standard BLAST+ tabular output contains 12 columns: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 24 column tabular -BLAST output. Galaxy now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -The third option is BLAST XML output, which is designed to be parsed by -another program, and is understood by some Galaxy tools. - -You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). -The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. -The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. -The two query anchored outputs show a multiple sequence alignment between the query and all the matches, -and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). - -------- - -**References** - -Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W327-31. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_rpstblastn_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,239 +0,0 @@ -<tool id="ncbi_rpstblastn_wrapper" name="NCBI BLAST+ rpstblastn" version="0.0.4"> - <description>Search protein domain database (PSSMs) with translated nucleotide query sequence(s)</description> - <!-- If job splitting is enabled, break up the query file into parts --> - <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject" merge_outputs="output1"></parallelism> - <requirements> - <requirement type="binary">rpstblastn</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>rpstblastn -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -rpstblastn --query "$query" -#if $db_opts.db_opts_selector == "db": - -db "${db_opts.database.fields.path}" -#elif $db_opts.db_opts_selector == "histdb": - -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" -#end if --evalue $evalue_cutoff --out "$output1" -##Set the extended list here so if/when we add things, saved workflows are not affected -#if str($out_format)=="ext": - -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" -#else: - -outfmt $out_format -#end if -##Seems rpstblastn does not currently support multiple threads :( -##-num_threads 8 -#if $adv_opts.adv_opts_selector=="advanced": -$adv_opts.filter_query -## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string -## Note -max_target_seqs overrides -num_descriptions and -num_alignments -#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): --max_target_seqs $adv_opts.max_hits -#end if -#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): --word_size $adv_opts.word_size -#end if -$adv_opts.parse_deflines -## End of advanced options: -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> - <conditional name="db_opts"> - <param name="db_opts_selector" type="select" label="Protein domain database (PSSM)"> - <option value="db" selected="True">Locally installed BLAST database</option> - <!-- TODO - define new datatype - <option value="histdb">BLAST protein domain database from your history</option> - --> - </param> - <when value="db"> - <param name="database" type="select" label="Protein domain database"> - <options from_file="blastdb_d.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="hidden" value="" /> - </when> - <!-- TODO - define new datatype - <when value="histdb"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="data" format="blastdbd" label="Protein domain database" /> - <param name="subject" type="hidden" value="" /> - </when> - --> - </conditional> - <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> - <param name="out_format" type="select" label="Output format"> - <option value="6">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - <option value="5">BLAST XML</option> - <option value="0">Pairwise text</option> - <option value="0 -html">Pairwise HTML</option> - <option value="2">Query-anchored text</option> - <option value="2 -html">Query-anchored HTML</option> - <option value="4">Flat query-anchored text</option> - <option value="4 -html">Flat query-anchored HTML</option> - <!-- - <option value="-outfmt 11">BLAST archive format (ASN.1)</option> - --> - </param> - <conditional name="adv_opts"> - <param name="adv_opts_selector" type="select" label="Advanced Options"> - <option value="basic" selected="True">Hide Advanced Options</option> - <option value="advanced">Show Advanced Options</option> - </param> - <when value="basic" /> - <when value="advanced"> - <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> - <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="false" /> - <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> - <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> - <validator type="in_range" min="0" /> - </param> - <!-- I'd like word_size to be optional, with minimum 2 for rpsblast --> - <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> - <validator type="in_range" min="0" /> - </param> - <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> - </when> - </conditional> - </inputs> - <outputs> - <data name="output1" format="tabular" label="rpstblastn on ${on_string}"> - <change_format> - <when input="out_format" value="0" format="txt"/> - <when input="out_format" value="0 -html" format="html"/> - <when input="out_format" value="2" format="txt"/> - <when input="out_format" value="2 -html" format="html"/> - <when input="out_format" value="4" format="txt"/> - <when input="out_format" value="4 -html" format="html"/> - <when input="out_format" value="5" format="blastxml"/> - </change_format> - </data> - </outputs> - <help> - -.. class:: warningmark - -**Note**. Database searches may take a substantial amount of time. -For large input datasets it is advisable to allow overnight processing. - ------ - -**What it does** - -Search a *protein domain database* using a *nucleotide query*, -using the NCBI BLAST+ rpstblastn command line tool. - -The protein domain databases use position-specific scoring matrices -(PSSMs) and are available for a number of domain collections including: - -*CDD* - NCBI curarated meta-collection of domains, see -http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#NCBI_curated_domains - -*Kog* - PSSMs from automatically aligned sequences and sequence -fragments classified in the KOGs resource, the eukaryotic -counterpart to COGs, see http://www.ncbi.nlm.nih.gov/COG/new/ - -*Cog* - PSSMs from automatically aligned sequences and sequence -fragments classified in the COGs resource, which focuses primarily -on prokaryotes, see http://www.ncbi.nlm.nih.gov/COG/new/ - -*Pfam* - PSSMs from Pfam-A seed alignment database, see -http://pfam.sanger.ac.uk/ - -*Smart* - PSSMs from SMART domain alignment database, see -http://smart.embl-heidelberg.de/ - -*Tigr* - PSSMs from TIGRFAM database of protein families, see -http://www.jcvi.org/cms/research/projects/tigrfams/overview/ - -*Prk* - PSSms from automatically aligned stable clusters in the -Protein Clusters database, see -http://www.ncbi.nlm.nih.gov/proteinclusters?cmd=search&db=proteinclusters - -The exact list of domain databases offered will depend on how your -local Galaxy has been configured. - ------ - -**Output format** - -Because Galaxy focuses on processing tabular data, the default output of this -tool is tabular. The standard BLAST+ tabular output contains 12 columns: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 24 column tabular -BLAST output. Galaxy now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -The third option is BLAST XML output, which is designed to be parsed by -another program, and is understood by some Galaxy tools. - -You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). -The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. -The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. -The two query anchored outputs show a multiple sequence alignment between the query and all the matches, -and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). - -------- - -**References** - -Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W327-31. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_tblastn_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,340 +0,0 @@ -<tool id="ncbi_tblastn_wrapper" name="NCBI BLAST+ tblastn" version="0.0.20"> - <description>Search translated nucleotide database with protein query sequence(s)</description> - <!-- If job splitting is enabled, break up the query file into parts --> - <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> - <requirements> - <requirement type="binary">tblastn</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>tblastn -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -tblastn --query "$query" -#if $db_opts.db_opts_selector == "db": - -db "${db_opts.database.fields.path}" -#elif $db_opts.db_opts_selector == "histdb": - -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" -#else: - -subject "$db_opts.subject" -#end if --evalue $evalue_cutoff --out "$output1" -##Set the extended list here so if/when we add things, saved workflows are not affected -#if str($out_format)=="ext": - -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" -#else: - -outfmt $out_format -#end if --num_threads 8 -#if $adv_opts.adv_opts_selector=="advanced": --db_gencode $adv_opts.db_gencode -$adv_opts.filter_query --matrix $adv_opts.matrix -## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string -## Note -max_target_seqs overrides -num_descriptions and -num_alignments -#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): --max_target_seqs $adv_opts.max_hits -#end if -#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): --word_size $adv_opts.word_size -#end if -##Ungapped disabled for now - see comments below -##$adv_opts.ungapped -$adv_opts.parse_deflines -## End of advanced options: -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <param name="query" type="data" format="fasta" label="Protein query sequence(s)"/> - <conditional name="db_opts"> - <param name="db_opts_selector" type="select" label="Subject database/sequences"> - <option value="db" selected="True">Locally installed BLAST database</option> - <option value="histdb">BLAST database from your history</option> - <option value="file">FASTA file from your history (see warning note below)</option> - </param> - <when value="db"> - <param name="database" type="select" label="Nucleotide BLAST database"> - <options from_file="blastdb.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="histdb"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="data" format="blastdbn" label="Nucleotide BLAST database" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="file"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="data" format="fasta" label="Nucleotide FASTA file to use as database"/> - </when> - </conditional> - <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> - <param name="out_format" type="select" label="Output format"> - <option value="6">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - <option value="5">BLAST XML</option> - <option value="0">Pairwise text</option> - <option value="0 -html">Pairwise HTML</option> - <option value="2">Query-anchored text</option> - <option value="2 -html">Query-anchored HTML</option> - <option value="4">Flat query-anchored text</option> - <option value="4 -html">Flat query-anchored HTML</option> - <!-- - <option value="-outfmt 11">BLAST archive format (ASN.1)</option> - --> - </param> - <conditional name="adv_opts"> - <param name="adv_opts_selector" type="select" label="Advanced Options"> - <option value="basic" selected="True">Hide Advanced Options</option> - <option value="advanced">Show Advanced Options</option> - </param> - <when value="basic" /> - <when value="advanced"> - <param name="db_gencode" type="select" label="Database/subject genetic code"> - <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> - <option value="1" select="True">1. Standard</option> - <option value="2">2. Vertebrate Mitochondrial</option> - <option value="3">3. Yeast Mitochondrial</option> - <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> - <option value="5">5. Invertebrate Mitochondrial</option> - <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> - <option value="9">9. Echinoderm Mitochondrial</option> - <option value="10">10. Euplotid Nuclear</option> - <option value="11">11. Bacteria and Archaea</option> - <option value="12">12. Alternative Yeast Nuclear</option> - <option value="13">13. Ascidian Mitochondrial</option> - <option value="14">14. Flatworm Mitochondrial</option> - <option value="15">15. Blepharisma Macronuclear</option> - <option value="16">16. Chlorophycean Mitochondrial Code</option> - <option value="21">21. Trematode Mitochondrial Code</option> - <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> - <option value="23">23. Thraustochytrium Mitochondrial Code</option> - <option value="24">24. Pterobranchia mitochondrial code</option> - </param> - <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> - <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="true" /> - <param name="matrix" type="select" label="Scoring matrix"> - <option value="BLOSUM90">BLOSUM90</option> - <option value="BLOSUM80">BLOSUM80</option> - <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> - <option value="BLOSUM50">BLOSUM50</option> - <option value="BLOSUM45">BLOSUM45</option> - <option value="PAM250">PAM250</option> - <option value="PAM70">PAM70</option> - <option value="PAM30">PAM30</option> - </param> - <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> - <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> - <validator type="in_range" min="0" /> - </param> - <!-- I'd like word_size to be optional, with minimum 2 for blastp --> - <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> - <validator type="in_range" min="0" /> - </param> - <!-- - Can't use '-ungapped' on its own, error back is: - Composition-adjusted searched are not supported with an ungapped search, please add -comp_based_stats F or do a gapped search - Tried using '-ungapped -comp_based_stats F' and tblastn crashed with 'Attempt to access NULL pointer.' - <param name="ungapped" type="boolean" label="Perform ungapped alignment only?" truevalue="-ungapped -comp_based_stats F" falsevalue="" checked="false" /> - --> - <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> - </when> - </conditional> - </inputs> - <outputs> - <data name="output1" format="tabular" label="tblastn on ${on_string}"> - <change_format> - <when input="out_format" value="0" format="txt"/> - <when input="out_format" value="0 -html" format="html"/> - <when input="out_format" value="2" format="txt"/> - <when input="out_format" value="2 -html" format="html"/> - <when input="out_format" value="4" format="txt"/> - <when input="out_format" value="4 -html" format="html"/> - <when input="out_format" value="5" format="blastxml"/> - </change_format> - </data> - </outputs> - <tests> - <test> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="5" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="false" /> - <param name="matrix" value="BLOSUM80" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="false" /> - <output name="output1" file="tblastn_four_human_vs_rhodopsin.xml" ftype="blastxml" /> - </test> - <test> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="ext" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="false" /> - <param name="matrix" value="BLOSUM80" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="false" /> - <output name="output1" file="tblastn_four_human_vs_rhodopsin_ext.tabular" ftype="tabular" /> - </test> - <test> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="6" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="false" /> - <param name="matrix" value="BLOSUM80" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="false" /> - <output name="output1" file="tblastn_four_human_vs_rhodopsin.tabular" ftype="tabular" /> - </test> - <test> - <!-- Same as above, but parse deflines - on BLAST 2.2.25+ - 2.2.27+ makes no difference --> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="6" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="false" /> - <param name="matrix" value="BLOSUM80" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="true" /> - <output name="output1" file="tblastn_four_human_vs_rhodopsin.tabular" ftype="tabular" /> - </test> - <test> - <param name="query" value="four_human_proteins.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-10" /> - <param name="out_format" value="0 -html" /> - <param name="adv_opts_selector" value="advanced" /> - <param name="filter_query" value="false" /> - <param name="matrix" value="BLOSUM80" /> - <param name="max_hits" value="0" /> - <param name="word_size" value="0" /> - <param name="parse_deflines" value="false" /> - <output name="output1" file="tblastn_four_human_vs_rhodopsin.html" ftype="html" /> - </test> - </tests> - <help> - -.. class:: warningmark - -**Note**. Database searches may take a substantial amount of time. -For large input datasets it is advisable to allow overnight processing. - ------ - -**What it does** - -Search a *translated nucleotide database* using a *protein query*, -using the NCBI BLAST+ tblastn command line tool. - -.. class:: warningmark - -You can also search against a FASTA file of subject nucleotide -sequences. This is *not* advised because it is slower (only one -CPU is used), but more importantly gives e-values for pairwise -searches (very small e-values which will look overly signficiant). -In most cases you should instead turn the other FASTA file into a -database first using *makeblastdb* and search against that. - ------ - -**Output format** - -Because Galaxy focuses on processing tabular data, the default output of this -tool is tabular. The standard BLAST+ tabular output contains 12 columns: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 24 column tabular -BLAST output. Galaxy now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -The third option is BLAST XML output, which is designed to be parsed by -another program, and is understood by some Galaxy tools. - -You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). -The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. -The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. -The two query anchored outputs show a multiple sequence alignment between the query and all the matches, -and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). - -------- - -**References** - -Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/ncbi_tblastx_wrapper.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,294 +0,0 @@ -<tool id="ncbi_tblastx_wrapper" name="NCBI BLAST+ tblastx" version="0.0.20"> - <description>Search translated nucleotide database with translated nucleotide query sequence(s)</description> - <!-- If job splitting is enabled, break up the query file into parts --> - <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" shared_inputs="subject,histdb" merge_outputs="output1"></parallelism> - <requirements> - <requirement type="binary">tblastx</requirement> - <requirement type="package" version="2.2.26+">blast+</requirement> - </requirements> - <version_command>tblastx -version</version_command> - <command> -## The command is a Cheetah template which allows some Python based syntax. -## Lines starting hash hash are comments. Galaxy will turn newlines into spaces -tblastx --query "$query" -#if $db_opts.db_opts_selector == "db": - -db "${db_opts.database.fields.path}" -#elif $db_opts.db_opts_selector == "histdb": - -db "${os.path.join($db_opts.histdb.extra_files_path,'blastdb')}" -#else: - -subject "$db_opts.subject" -#end if --query_gencode $query_gencode --evalue $evalue_cutoff --out "$output1" -##Set the extended list here so if/when we add things, saved workflows are not affected -#if str($out_format)=="ext": - -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen" -#else: - -outfmt $out_format -#end if --num_threads 8 -#if $adv_opts.adv_opts_selector=="advanced": --db_gencode $adv_opts.db_gencode -$adv_opts.filter_query -$adv_opts.strand --matrix $adv_opts.matrix -## Need int(str(...)) because $adv_opts.max_hits is an InputValueWrapper object not a string -## Note -max_target_seqs overrides -num_descriptions and -num_alignments -#if (str($adv_opts.max_hits) and int(str($adv_opts.max_hits)) > 0): --max_target_seqs $adv_opts.max_hits -#end if -#if (str($adv_opts.word_size) and int(str($adv_opts.word_size)) > 0): --word_size $adv_opts.word_size -#end if -$adv_opts.parse_deflines -## End of advanced options: -#end if - </command> - <stdio> - <!-- Anything other than zero is an error --> - <exit_code range="1:" /> - <exit_code range=":-1" /> - <!-- In case the return code has not been set propery check stderr too --> - <regex match="Error:" /> - <regex match="Exception:" /> - </stdio> - <inputs> - <param name="query" type="data" format="fasta" label="Nucleotide query sequence(s)"/> - <conditional name="db_opts"> - <param name="db_opts_selector" type="select" label="Subject database/sequences"> - <option value="db" selected="True">Locally installed BLAST database</option> - <option value="histdb">BLAST database from your history</option> - <option value="file">FASTA file from your history (see warning note below)</option> - </param> - <when value="db"> - <param name="database" type="select" label="Nucleotide BLAST database"> - <options from_file="blastdb.loc"> - <column name="value" index="0"/> - <column name="name" index="1"/> - <column name="path" index="2"/> - </options> - </param> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="histdb"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="data" format="blastdbn" label="Nucleotide BLAST database" /> - <param name="subject" type="hidden" value="" /> - </when> - <when value="file"> - <param name="database" type="hidden" value="" /> - <param name="histdb" type="hidden" value="" /> - <param name="subject" type="data" format="fasta" label="Nucleotide FASTA file to use as database"/> - </when> - </conditional> - <param name="query_gencode" type="select" label="Query genetic code"> - <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> - <option value="1" select="True">1. Standard</option> - <option value="2">2. Vertebrate Mitochondrial</option> - <option value="3">3. Yeast Mitochondrial</option> - <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> - <option value="5">5. Invertebrate Mitochondrial</option> - <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> - <option value="9">9. Echinoderm Mitochondrial</option> - <option value="10">10. Euplotid Nuclear</option> - <option value="11">11. Bacteria and Archaea</option> - <option value="12">12. Alternative Yeast Nuclear</option> - <option value="13">13. Ascidian Mitochondrial</option> - <option value="14">14. Flatworm Mitochondrial</option> - <option value="15">15. Blepharisma Macronuclear</option> - <option value="16">16. Chlorophycean Mitochondrial Code</option> - <option value="21">21. Trematode Mitochondrial Code</option> - <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> - <option value="23">23. Thraustochytrium Mitochondrial Code</option> - <option value="24">24. Pterobranchia mitochondrial code</option> - </param> - <param name="evalue_cutoff" type="float" size="15" value="0.001" label="Set expectation value cutoff" /> - <param name="out_format" type="select" label="Output format"> - <option value="6">Tabular (standard 12 columns)</option> - <option value="ext" selected="True">Tabular (extended 24 columns)</option> - <option value="5">BLAST XML</option> - <option value="0">Pairwise text</option> - <option value="0 -html">Pairwise HTML</option> - <option value="2">Query-anchored text</option> - <option value="2 -html">Query-anchored HTML</option> - <option value="4">Flat query-anchored text</option> - <option value="4 -html">Flat query-anchored HTML</option> - <!-- - <option value="-outfmt 11">BLAST archive format (ASN.1)</option> - --> - </param> - <conditional name="adv_opts"> - <param name="adv_opts_selector" type="select" label="Advanced Options"> - <option value="basic" selected="True">Hide Advanced Options</option> - <option value="advanced">Show Advanced Options</option> - </param> - <when value="basic" /> - <when value="advanced"> - <param name="db_gencode" type="select" label="Database/subject genetic code"> - <!-- See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details --> - <option value="1" select="True">1. Standard</option> - <option value="2">2. Vertebrate Mitochondrial</option> - <option value="3">3. Yeast Mitochondrial</option> - <option value="4">4. Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code</option> - <option value="5">5. Invertebrate Mitochondrial</option> - <option value="6">6. Ciliate, Dasycladacean and Hexamita Nuclear Code</option> - <option value="9">9. Echinoderm Mitochondrial</option> - <option value="10">10. Euplotid Nuclear</option> - <option value="11">11. Bacteria and Archaea</option> - <option value="12">12. Alternative Yeast Nuclear</option> - <option value="13">13. Ascidian Mitochondrial</option> - <option value="14">14. Flatworm Mitochondrial</option> - <option value="15">15. Blepharisma Macronuclear</option> - <option value="16">16. Chlorophycean Mitochondrial Code</option> - <option value="21">21. Trematode Mitochondrial Code</option> - <option value="22">22. Scenedesmus obliquus mitochondrial Code</option> - <option value="23">23. Thraustochytrium Mitochondrial Code</option> - <option value="24">24. Pterobranchia mitochondrial code</option> - </param> - <!-- Could use a select (yes, no, other) where other allows setting 'window locut hicut' --> - <param name="filter_query" type="boolean" label="Filter out low complexity regions (with SEG)" truevalue="-seg yes" falsevalue="-seg no" checked="true" /> - <param name="strand" type="select" label="Query strand(s) to search against database/subject"> - <option value="-strand both">Both</option> - <option value="-strand plus">Plus (forward)</option> - <option value="-strand minus">Minus (reverse complement)</option> - </param> - <param name="matrix" type="select" label="Scoring matrix"> - <option value="BLOSUM90">BLOSUM90</option> - <option value="BLOSUM80">BLOSUM80</option> - <option value="BLOSUM62" selected="true">BLOSUM62 (default)</option> - <option value="BLOSUM50">BLOSUM50</option> - <option value="BLOSUM45">BLOSUM45</option> - <option value="PAM250">PAM250</option> - <option value="PAM70">PAM70</option> - <option value="PAM30">PAM30</option> - </param> - <!-- Why doesn't optional override a validator? I want to accept an empty string OR a non-negative integer --> - <param name="max_hits" type="integer" value="0" label="Maximum hits to show" help="Use zero for default limits"> - <validator type="in_range" min="0" /> - </param> - <!-- I'd like word_size to be optional, with minimum 2 for tblastx --> - <param name="word_size" type="integer" value="0" label="Word size for wordfinder algorithm" help="Use zero for default, otherwise minimum 2."> - <validator type="in_range" min="0" /> - </param> - <param name="parse_deflines" type="boolean" label="Should the query and subject defline(s) be parsed?" truevalue="-parse_deflines" falsevalue="" checked="false" help="This affects the formatting of the query/subject ID strings"/> - </when> - </conditional> - </inputs> - <outputs> - <data name="output1" format="tabular" label="tblastx on ${on_string}"> - <change_format> - <when input="out_format" value="0" format="txt"/> - <when input="out_format" value="0 -html" format="html"/> - <when input="out_format" value="2" format="txt"/> - <when input="out_format" value="2 -html" format="html"/> - <when input="out_format" value="4" format="txt"/> - <when input="out_format" value="4 -html" format="html"/> - <when input="out_format" value="5" format="blastxml"/> - </change_format> - </data> - </outputs> - <tests> - <test> - <param name="query" value="rhodopsin_nucs.fasta" ftype="fasta" /> - <param name="db_opts_selector" value="file" /> - <param name="subject" value="three_human_mRNA.fasta" ftype="fasta" /> - <param name="database" value="" /> - <param name="evalue_cutoff" value="1e-40" /> - <param name="out_format" value="6" /> - <param name="adv_opts_selector" value="basic" /> - <output name="output1" file="tblastx_rhodopsin_vs_three_human.tabular" ftype="tabular" /> - </test> - </tests> - <help> - -.. class:: warningmark - -**Note**. Database searches may take a substantial amount of time. -For large input datasets it is advisable to allow overnight processing. - ------ - -**What it does** - -Search a *translated nucleotide database* using a *protein query*, -using the NCBI BLAST+ tblastx command line tool. - -.. class:: warningmark - -You can also search against a FASTA file of subject nucleotide -sequences. This is *not* advised because it is slower (only one -CPU is used), but more importantly gives e-values for pairwise -searches (very small e-values which will look overly signficiant). -In most cases you should instead turn the other FASTA file into a -database first using *makeblastdb* and search against that. - ------ - -**Output format** - -Because Galaxy focuses on processing tabular data, the default output of this -tool is tabular. The standard BLAST+ tabular output contains 12 columns: - -====== ========= ============================================ -Column NCBI name Description ------- --------- -------------------------------------------- - 1 qseqid Query Seq-id (ID of your sequence) - 2 sseqid Subject Seq-id (ID of the database hit) - 3 pident Percentage of identical matches - 4 length Alignment length - 5 mismatch Number of mismatches - 6 gapopen Number of gap openings - 7 qstart Start of alignment in query - 8 qend End of alignment in query - 9 sstart Start of alignment in subject (database hit) - 10 send End of alignment in subject (database hit) - 11 evalue Expectation value (E-value) - 12 bitscore Bit score -====== ========= ============================================ - -The BLAST+ tools can optionally output additional columns of information, -but this takes longer to calculate. Most (but not all) of these columns are -included by selecting the extended tabular output. The extra columns are -included *after* the standard 12 columns. This is so that you can write -workflow filtering steps that accept either the 12 or 24 column tabular -BLAST output. Galaxy now uses this extended 24 column output by default. - -====== ============= =========================================== -Column NCBI name Description ------- ------------- ------------------------------------------- - 13 sallseqid All subject Seq-id(s), separated by a ';' - 14 score Raw score - 15 nident Number of identical matches - 16 positive Number of positive-scoring matches - 17 gaps Total number of gaps - 18 ppos Percentage of positive-scoring matches - 19 qframe Query frame - 20 sframe Subject frame - 21 qseq Aligned part of query sequence - 22 sseq Aligned part of subject sequence - 23 qlen Query sequence length - 24 slen Subject sequence length -====== ============= =========================================== - -The third option is BLAST XML output, which is designed to be parsed by -another program, and is understood by some Galaxy tools. - -You can also choose several plain text or HTML output formats which are designed to be read by a person (not by another program). -The HTML versions use basic webpage formatting and can include links to the hits on the NCBI website. -The pairwise output (the default on the NCBI BLAST website) shows each match as a pairwise alignment with the query. -The two query anchored outputs show a multiple sequence alignment between the query and all the matches, -and differ in how insertions are shown (marked as insertions or with gap characters added to the other sequences). - -------- - -**References** - -Altschul et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997. Nucleic Acids Res. 25:3389-3402. - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus - </help> -</tool>
--- a/tools/ncbi_blast_plus/repository_dependencies.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,5 +0,0 @@ -<?xml version="1.0"?> -<repositories description="This requires the BLAST datatype definitions (e.g. the BLAST XML format)."> -<!-- Revision 4:f9a7783ed7b6 on the main (and test) tool shed is v0.0.14 which added BLAST databases --> -<repository changeset_revision="f9a7783ed7b6" name="blast_datatypes" owner="devteam" toolshed="http://testtoolshed.g2.bx.psu.edu" /> -</repositories>
--- a/tools/ncbi_blast_plus/tool_dependencies.xml Wed May 29 10:03:48 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,20 +0,0 @@ -<?xml version="1.0"?> -<tool_dependency> - <package name="blast+" version="2.2.26+"> - <install version="1.0"> - <actions> - <action type="download_by_url">ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.26/ncbi-blast-2.2.26+-src.tar.gz</action> - <action type="shell_command">cd c++ && ./configure --prefix=$INSTALL_DIR && make && make install</action> - <action type="set_environment"> - <environment_variable name="PATH" action="prepend_to">$INSTALL_DIR/bin</environment_variable> - </action> - </actions> - </install> - <readme> -Downloads and compiles BLAST+ from the NCBI, which assumes you have -all the required build dependencies installed. See: -http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download - </readme> - </package> -</tool_dependency> -