# HG changeset patch # User peterjc # Date 1431510943 14400 # Node ID efeff76741b85421367b896a1f15f2181654764d # Parent 4abe8d59a438648095ffe3687c7beb6338da2679 planemo upload for repository https://github.com/peterjc/pico_galaxy/tools/mira4_assembler/ commit ed415dd625233aa2c2880c14ee6ac4b73aa57a93-dirty diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/README.rst --- a/tools/mira4/README.rst Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,165 +0,0 @@ -Galaxy wrapper for the MIRA assembly program (v4.0) -=================================================== - -This tool is copyright 2011-2014 by Peter Cock, The James Hutton Institute -(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. -See the licence text below (MIT licence). - -This tool is a short Python script (to collect the MIRA output and move it -to where Galaxy expects the files) and associated Galaxy wrapper XML file. - -It is available from the Galaxy Tool Shed at: -http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler - -It uses a Galaxy datatype definition 'mira' for the MIRA Assembly Format, -http://toolshed.g2.bx.psu.edu/view/peterjc/mira_datatypes - -A separate wrapper for MIRA v3.4 is available from the Galaxy Tool Shed at: -http://toolshed.g2.bx.psu.edu/view/peterjc/mira_assembler - -Automated Installation -====================== - -This should be straightforward. Via the Tool Shed, Galaxy should automatically -install the 'mira' datatype, samtools, and download and install the precompiled -binary for MIRA v4.0.2 for the Galaxy wrapper, and run any tests. - -For MIRA 4, the Galaxy wrapper has been split in two, allowing separate -cluster settings for de novo usage (high RAM) and mapping (lower RAM). -Consult the Galaxy adminstration documentation for your cluster setup. - -WARNING: For larger tasks, be aware that MIRA can require vast amounts -of RAM and run-times of over a week are possible. This tool wrapper makes -no attempt to spot and reject such large jobs. - - -Manual Installation -=================== - -First install the 'mira' datatype for Galaxy, available here: - -* http://toolshed.g2.bx.psu.edu/view/peterjc/mira_datatypes - -There are four Galaxy files to install: - -* ``mira4_de_novo.xml`` (the Galaxy tool definition for de novo usage) -* ``mira4_mapping.xml`` (the Galaxy tool definition for mapping usage) -* ``mira4_convert.xml`` (the Galaxy tool definition for converting MIRA files) -* ``mira4_bait.xml`` (the Galaxy tool definition for mirabait) -* ``mira4.py`` (the Python wrapper script) -* ``mira4_convert.py`` (the Python wrapper script for miraconvert) -* ``mira4_bait.py`` (the Python wrapper script for mirabait) -* ``mira4_validator.py`` (the XML parameter validation script) - -The suggested location is a new ``tools/mira4`` folder. You will also need to -modify the ``tools_conf.xml`` file to tell Galaxy to offer the tool, and also do -this to ``tools_conf.xml.sample`` in order to run the tests:: - - - - -You will also need to install MIRA, we used version 4.0.2, and define the -environment variable ``$MIRA4`` pointing at the folder containing the binaries. -See: - -* http://chevreux.org/projects_mira.html -* http://sourceforge.net/projects/mira-assembler/ - -You may wish to use different cluster setups for the de novo and mapping -tools, see above. - -You will also need to install samtools (for generating a BAM file from MIRA's -SAM output). - -After copying (or symlinking) the ``test-data`` files under Galaxy's ``test-data`` -folder, you can run the tests with:: - - $ ./run_functional_tests.sh -id mira_4_0_bait - $ ./run_functional_tests.sh -id mira_4_0_de_novo - $ ./run_functional_tests.sh -id mira_4_0_mapping - $ ./run_functional_tests.sh -id mira_4_0_convert - - -History -======= - -======= ====================================================================== -Version Changes -------- ---------------------------------------------------------------------- -v0.0.1 - Initial version (prototype for MIRA 4.0 RC4, based on wrapper for v3.4) -v0.0.2 - Include BAM output (using ``miraconvert`` and ``samtools``). - - Updated to target MIRA 4.0.1 - - Simplified XML to apply input format to output data. - - Sets temporary folder at run time to respect environment variables - (``$TMPDIR``, ``$TEMP``, or ``$TMP`` in that order). This was - previously hard coded as ``/tmp``. -v0.0.3 - Updated to target MIRA 4.0.2 -v0.0.4 - Using optparse for the Python wrapper script API - - Made MAF and BAM outputs optional - - Include wrapper for ``miraconvert`` -======= ====================================================================== - - -Developers -========== - -Development is on a dedicated GitHub repository: -https://github.com/peterjc/pico_galaxy/tree/master/tools/mira4 - -For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use -the following command from the Galaxy root folder:: - - $ tar -czf mira4_wrapper.tar.gz tools/mira4/README.rst tools/mira4/mira4_de_novo.xml tools/mira4/mira4_mapping.xml tools/mira4/mira4_bait.xml tools/mira4/mira4_convert.xml tools/mira4/mira4.py tools/mira4/mira4_make_bam.py tools/mira4/mira4_validator.py tools/mira4/mira4_convert.py tools/mira4/mira4_bait.py tools/mira4/tool_dependencies.xml tools/mira4/repository_dependencies.xml test-data/U13small_m.fastq test-data/U13small_m.mira4_de_novo.fasta test-data/tvc_mini.fastq test-data/tvc_contigs.fasta test-data/tvc_map_ref_strain.fasta test-data/tvc_map_same_strain.fasta test-data/tvc_bait.fasta test-data/tvc_mini_bait_pos.fastq test-data/tvc_mini_bait_strict.fastq test-data/tvc_mini_bait_neg.fastq test-data/ecoli.fastq test-data/ecoli.mira4_de_novo.fasta test-data/header.mira test-data/empty_file.dat - -Check this worked:: - - $ tar -tzf mira4_wrapper.tar.gz - tools/mira4/README.rst - tools/mira4/mira4_de_novo.xml - tools/mira4/mira4_mapping.xml - tools/mira4/mira4_bait.xml - tools/mira4/mira4_convert.xml - tools/mira4/mira4.py - tools/mira4/mira4_make_bam.py - tools/mira4/mira4_validator.py - tools/mira4/mira4_convert.py - tools/mira4/mira4_bait.py - tools/mira4/tool_dependencies.xml - tools/mira4/repository_dependencies.xml - test-data/U13small_m.fastq - test-data/U13small_m.mira4_de_novo.fasta - test-data/tvc_mini.fastq - test-data/tvc_contigs.fasta - test-data/tvc_map_ref_strain.fasta - test-data/tvc_map_same_strain.fasta - test-data/tvc_bait.fasta - test-data/tvc_mini_bait_pos.fastq - test-data/tvc_mini_bait_strict.fastq - test-data/tvc_mini_bait_neg.fastq - test-data/ecoli.fastq - test-data/ecoli.mira4_de_novo.fasta - test-data/header.mira - test-data/empty_file.dat - - - -Licence (MIT) -============= - -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: - -The above copyright notice and this permission notice shall be included in -all copies or substantial portions of the Software. - -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN -THE SOFTWARE. diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4.py --- a/tools/mira4/mira4.py Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,313 +0,0 @@ -#!/usr/bin/env python -"""A simple wrapper script to call MIRA and collect its output. -""" -import os -import sys -import subprocess -import shutil -import time -import tempfile -from optparse import OptionParser - -#Do we need any PYTHONPATH magic? -from mira4_make_bam import make_bam - -WRAPPER_VER = "0.0.4" #Keep in sync with the XML file - -def stop_err(msg, err=1): - sys.stderr.write(msg+"\n") - sys.exit(err) - - -def get_version(mira_binary): - """Run MIRA to find its version number""" - # At the commend line I would use: mira -v | head -n 1 - # however there is some pipe error when doing that here. - cmd = [mira_binary, "-v"] - try: - child = subprocess.Popen(cmd, - stdout=subprocess.PIPE, - stderr=subprocess.STDOUT) - except Exception, err: - sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) - sys.exit(1) - ver, tmp = child.communicate() - del child - return ver.split("\n", 1)[0].strip() - -#Parse Command Line -usage = """Galaxy MIRA4 wrapper script v%s - use as follows: - -$ python mira4.py ... - -This will run the MIRA binary and collect its output files as directed. -""" % WRAPPER_VER -parser = OptionParser(usage=usage) -parser.add_option("-m", "--manifest", dest="manifest", - default=None, metavar="FILE", - help="MIRA manifest filename") -parser.add_option("--maf", dest="maf", - default="-", metavar="FILE", - help="MIRA MAF output filename") -parser.add_option("--bam", dest="bam", - default="-", metavar="FILE", - help="Unpadded BAM output filename") -parser.add_option("--fasta", dest="fasta", - default="-", metavar="FILE", - help="Unpadded FASTA output filename") -parser.add_option("--log", dest="log", - default="-", metavar="FILE", - help="MIRA logging output filename") -parser.add_option("-v", "--version", dest="version", - default=False, action="store_true", - help="Show version and quit") -options, args = parser.parse_args() -manifest = options.manifest -out_maf = options.maf -out_bam = options.bam -out_fasta = options.fasta -out_log = options.log - -try: - mira_path = os.environ["MIRA4"] -except KeyError: - stop_err("Environment variable $MIRA4 not set") -mira_binary = os.path.join(mira_path, "mira") -if not os.path.isfile(mira_binary): - stop_err("Missing mira under $MIRA4, %r\nFolder contained: %s" - % (mira_binary, ", ".join(os.listdir(mira_path)))) -mira_convert = os.path.join(mira_path, "miraconvert") -if not os.path.isfile(mira_convert): - stop_err("Missing miraconvert under $MIRA4, %r\nFolder contained: %s" - % (mira_convert, ", ".join(os.listdir(mira_path)))) - -mira_ver = get_version(mira_binary) -if not mira_ver.strip().startswith("4.0"): - stop_err("This wrapper is for MIRA V4.0, not:\n%s\n%s" % (mira_ver, mira_binary)) -mira_convert_ver = get_version(mira_convert) -if not mira_convert_ver.strip().startswith("4.0"): - stop_err("This wrapper is for MIRA V4.0, not:\n%s\n%s" % (mira_ver, mira_convert)) -if options.version: - print "%s, MIRA wrapper version %s" % (mira_ver, WRAPPER_VER) - if mira_ver != mira_convert_ver: - print "WARNING: miraconvert %s" % mira_convert_ver - sys.exit(0) - -if not manifest: - stop_err("Manifest is required") -elif not os.path.isfile(manifest): - stop_err("Missing input MIRA manifest file: %r" % manifest) - - -try: - threads = int(os.environ.get("GALAXY_SLOTS", "1")) -except ValueError: - threads = 1 -assert 1 <= threads, threads - - -def override_temp(manifest): - """Override ``-DI:trt=/tmp`` in manifest with environment variable. - - Currently MIRA 4 does not allow envronment variables like ``$TMP`` - inside the manifest, which is a problem if you need to override - the default at run time. - - The tool XML will ``/tmp`` and we replace that here with - ``tempfile.gettempdir()`` which will respect $TMPDIR, $TEMP, $TMP - as explained in the Python standard library documentation: - http://docs.python.org/2/library/tempfile.html#tempfile.tempdir - - By default MIRA 4 would write its temporary files within the output - folder, which is a problem if that is a network drive. - """ - handle = open(manifest, "r") - text = handle.read() - handle.close() - - #At time of writing, this is at the end of a file, - #but could be followed by a space in future... - text = text.replace("-DI:trt=/tmp", "-DI:trt=" + tempfile.gettempdir()) - - #Want to try to ensure this gets written to disk before MIRA attempts - #to open it - any networked file system may impose a delay... - handle = open(manifest, "w") - handle.write(text) - handle.flush() - os.fsync(handle.fileno()) - handle.close() - - -def log_manifest(manifest): - """Write the manifest file to stderr.""" - sys.stderr.write("\n%s\nManifest file\n%s\n" % ("="*60, "="*60)) - with open(manifest) as h: - for line in h: - sys.stderr.write(line) - sys.stderr.write("\n%s\nEnd of manifest\n%s\n" % ("="*60, "="*60)) - - -def collect_output(temp, name, handle): - """Moves files to the output filenames (global variables).""" - n3 = (temp, name, name, name) - f = "%s/%s_assembly/%s_d_results" % (temp, name, name) - if not os.path.isdir(f): - log_manifest(manifest) - stop_err("Missing output folder") - if not os.listdir(f): - log_manifest(manifest) - stop_err("Empty output folder") - missing = [] - - old_maf = "%s/%s_out.maf" % (f, name) - if not os.path.isfile(old_maf): - #Triggered extractLargeContigs.sh? - old_maf = "%s/%s_LargeContigs_out.maf" % (f, name) - - #De novo or single strain mapping, - old_fasta = "%s/%s_out.unpadded.fasta" % (f, name) - ref_fasta = "%s/%s_out.padded.fasta" % (f, name) - if not os.path.isfile(old_fasta): - #Mapping (StrainX versus reference) or de novo - old_fasta = "%s/%s_out_StrainX.unpadded.fasta" % (f, name) - ref_fasta = "%s/%s_out_StrainX.padded.fasta" % (f, name) - if not os.path.isfile(old_fasta): - old_fasta = "%s/%s_out_ReferenceStrain.unpadded.fasta" % (f, name) - ref_fasta = "%s/%s_out_ReferenceStrain.padded.fasta" % (f, name) - - - missing = False - for old, new in [(old_maf, out_maf), - (old_fasta, out_fasta)]: - if not os.path.isfile(old): - missing = True - elif not new or new == "-": - handle.write("Ignoring %s\n" % old) - else: - handle.write("Capturing %s\n" % old) - shutil.move(old, new) - if missing: - log_manifest(manifest) - sys.stderr.write("Contents of %r:\n" % f) - for filename in sorted(os.listdir(f)): - sys.stderr.write("%s\n" % filename) - - #For mapping mode, probably most people would expect a BAM file - #using the reference FASTA file... - if out_bam and out_bam != "-": - if out_maf and out_maf != "-": - msg = make_bam(mira_convert, out_maf, ref_fasta, out_bam, handle) - else: - #Not collecting the MAF file, use original location - msg = make_bam(mira_convert, old_maf, ref_fasta, out_bam, handle) - if msg: - stop_err(msg) - -def clean_up(temp, name): - folder = "%s/%s_assembly" % (temp, name) - if os.path.isdir(folder): - shutil.rmtree(folder) - -#TODO - Run MIRA in /tmp or a configurable directory? -#Currently Galaxy puts us somewhere safe like: -#/opt/galaxy-dist/database/job_working_directory/846/ -temp = "." - -name = "MIRA" - -override_temp(manifest) - -start_time = time.time() -cmd_list = [mira_binary, "-t", str(threads), manifest] -cmd = " ".join(cmd_list) - -assert os.path.isdir(temp) -d = "%s_assembly" % name -#This can fail on my development machine if stale folders exist -#under Galaxy's .../database/job_working_directory/ tree: -assert not os.path.isdir(d), "Path %r already exists:\n%s" % (d, os.path.abspath(d)) -try: - #Check path access - os.mkdir(d) -except Exception, err: - log_manifest(manifest) - sys.stderr.write("Error making directory %s\n%s" % (d, err)) - sys.exit(1) - -#print os.path.abspath(".") -#print cmd - -if out_log and out_log != "-": - handle = open(out_log, "w") -else: - handle = open(os.devnull, "w") -handle.write("======================== MIRA manifest (instructions) ========================\n") -m = open(manifest, "rU") -for line in m: - handle.write(line) -m.close() -del m -handle.write("\n") -handle.write("============================ Starting MIRA now ===============================\n") -handle.flush() -try: - #Run MIRA - child = subprocess.Popen(cmd_list, - stdout=handle, - stderr=subprocess.STDOUT) -except Exception, err: - log_manifest(manifest) - sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) - #TODO - call clean up? - handle.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) - handle.close() - sys.exit(1) -#Use .communicate as can get deadlocks with .wait(), -stdout, stderr = child.communicate() -assert not stdout and not stderr #Should be empty as sent to handle -run_time = time.time() - start_time -return_code = child.returncode -handle.write("\n") -handle.write("============================ MIRA has finished ===============================\n") -handle.write("MIRA took %0.2f hours\n" % (run_time / 3600.0)) -if return_code: - print "MIRA took %0.2f hours" % (run_time / 3600.0) - handle.write("Return error code %i from command:\n" % return_code) - handle.write(cmd + "\n") - handle.close() - clean_up(temp, name) - log_manifest(manifest) - stop_err("Return error code %i from command:\n%s" % (return_code, cmd), - return_code) -handle.flush() - -if os.path.isfile("MIRA_assembly/MIRA_d_results/ec.log"): - handle.write("\n") - handle.write("====================== Extract Large Contigs failed ==========================\n") - e = open("MIRA_assembly/MIRA_d_results/ec.log", "rU") - for line in e: - handle.write(line) - e.close() - handle.write("============================ (end of ec.log) =================================\n") - handle.flush() - -#print "Collecting output..." -start_time = time.time() -collect_output(temp, name, handle) -collect_time = time.time() - start_time -handle.write("MIRA took %0.2f hours; collecting output %0.2f minutes\n" % (run_time / 3600.0, collect_time / 60.0)) -print("MIRA took %0.2f hours; collecting output %0.2f minutes\n" % (run_time / 3600.0, collect_time / 60.0)) - -if os.path.isfile("MIRA_assembly/MIRA_d_results/ec.log"): - #Treat as an error, but doing this AFTER collect_output - sys.stderr.write("Extract Large Contigs failed\n") - handle.write("Extract Large Contigs failed\n") - handle.close() - sys.exit(1) - -#print "Cleaning up..." -clean_up(temp, name) - -handle.write("\nDone\n") -handle.close() -print("Done") diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_bait.py --- a/tools/mira4/mira4_bait.py Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,115 +0,0 @@ -#!/usr/bin/env python -"""A simple wrapper script to call MIRA4's mirabait and collect its output. -""" -import os -import sys -import subprocess -import shutil -import time - -WRAPPER_VER = "0.0.1" #Keep in sync with the XML file - -def stop_err(msg, err=1): - sys.stderr.write(msg+"\n") - sys.exit(err) - - -def get_version(mira_binary): - """Run MIRA to find its version number""" - # At the commend line I would use: mira -v | head -n 1 - # however there is some pipe error when doing that here. - cmd = [mira_binary, "-v"] - try: - child = subprocess.Popen(cmd, - stdout=subprocess.PIPE, - stderr=subprocess.STDOUT) - except Exception, err: - sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) - sys.exit(1) - ver, tmp = child.communicate() - del child - #Workaround for -v not working in mirabait 4.0RC4 - if "invalid option" in ver.split("\n", 1)[0]: - for line in ver.split("\n", 1): - if " version " in line: - line = line.split() - return line[line.index("version")+1].rstrip(")") - stop_err("Could not determine MIRA version:\n%s" % ver) - return ver.split("\n", 1)[0] - -try: - mira_path = os.environ["MIRA4"] -except KeyError: - stop_err("Environment variable $MIRA4 not set") -mira_binary = os.path.join(mira_path, "mirabait") -if not os.path.isfile(mira_binary): - stop_err("Missing mirabait under $MIRA4, %r\nFolder contained: %s" - % (mira_binary, ", ".join(os.listdir(mira_path)))) -mira_ver = get_version(mira_binary) -if not mira_ver.strip().startswith("4.0"): - stop_err("This wrapper is for MIRA V4.0, not:\n%s" % mira_ver) -if "-v" in sys.argv or "--version" in sys.argv: - print "%s, MIRA wrapper version %s" % (mira_ver, WRAPPER_VER) - sys.exit(0) - - -format, output_choice, strand_choice, kmer_length, min_occurance, bait_file, in_file, out_file = sys.argv[1:] - -if format.startswith("fastq"): - format = "fastq" -elif format == "mira": - format = "maf" -elif format != "fasta": - stop_err("Was not expected format %r" % format) - -assert out_file.endswith(".dat") -out_file_stem = out_file[:-4] - -cmd_list = [mira_binary, "-f", format, "-t", format, - "-k", kmer_length, "-n", min_occurance, - bait_file, in_file, out_file_stem] -if output_choice == "pos": - pass -elif output_choice == "neg": - #Invert the selection... - cmd_list.insert(1, "-i") -else: - stop_err("Output choice should be 'pos' or 'neg', not %r" % output_choice) -if strand_choice == "both": - pass -elif strand_choice == "fwd": - #Ingore reverse strand... - cmd_list.insert(1, "-r") -else: - stop_err("Strand choice should be 'both' or 'fwd', not %r" % strand_choice) - -cmd = " ".join(cmd_list) -#print cmd -start_time = time.time() -try: - #Run MIRA - child = subprocess.Popen(cmd_list, - stdout=subprocess.PIPE, - stderr=subprocess.STDOUT) -except Exception, err: - log_manifest(manifest) - sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) - sys.exit(1) -#Use .communicate as can get deadlocks with .wait(), -stdout, stderr = child.communicate() -assert stderr is None # Due to way we ran with subprocess -run_time = time.time() - start_time -return_code = child.returncode -print "mirabait took %0.2f minutes" % (run_time / 60.0) - -if return_code: - sys.stderr.write(stdout) - stop_err("Return error code %i from command:\n%s" % (return_code, cmd), - return_code) - -#Capture output -out_tmp = out_file_stem + "." + format -if not os.path.isfile(out_tmp): - sys.stderr.write(stdout) - stop_err("Missing output file from mirabait: %s" % out_tmp) -shutil.move(out_tmp, out_file) diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_bait.xml --- a/tools/mira4/mira4_bait.xml Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,112 +0,0 @@ - - Filter reads using kmer matches - - mirabait - MIRA - - mira4_bait.py --version - -mira4_bait.py $input_reads.ext $output_choice $strand_choice $kmer_length $min_occurence "$bait_file" "$input_reads" "$output_reads" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -**What it does** - -Runs the ``mirabait`` utility from MIRA v4.0 to filter your input reads -according to whether or not they contain perfect kmer matches to your -bait file. By default this looks for 31-mers (kmers or *k*-mers where -the fragment length *k* is 31), and only requires a single matching kmer. - -The ``mirabait`` utility is useful in many applications and pipelines -outside of using the main MIRA tool for assembly or mapping. - -.. class:: warningmark - -Note ``mirabait`` cannot be used on protein (amino acid) sequences. - -**Example Usage** - -To remove over abundant entries like rRNA sequences, run ``mirabait`` with -known rRNA sequences as the bait and select the *negative* matches. - -To do targeted assembly by fishing out reads belonging to a gene and just -assemble these, run ``mirabait`` with the gene of interest as the bait and -select the *positive* matches. - -To iteratively reconstruct mitochondria you could start by fishing out reads -matching any known mitochondrial sequence, assembly those, and repeat. - - -**Notes on paired read** - -.. class:: warningmark - -While MIRA4 is aware of many read naming conventions to identify paired read -partners, the ``mirabait`` tool considers each read in isolation. Applying -it to paired read files may leave you with orphaned reads. - - -**Citation** - -If you use this Galaxy tool in work leading to a scientific publication please -cite the following papers: - -Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). -Galaxy tools and workflows for sequence analysis with applications -in molecular plant pathology. PeerJ 1:e167 -http://dx.doi.org/10.7717/peerj.167 - -Bastien Chevreux, Thomas Wetter and Sándor Suhai (1999). -Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. -Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. -http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler - - diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_convert.py --- a/tools/mira4/mira4_convert.py Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,225 +0,0 @@ -#!/usr/bin/env python -"""A simple wrapper script to call MIRA and collect its output. - -This focuses on the miraconvert binary. -""" -import os -import sys -import subprocess -import shutil -import time -import tempfile -from optparse import OptionParser -try: - from io import BytesIO -except ImportError: - #Should we worry about Python 2.5 or older? - from StringIO import StringIO as BytesIO - -#Do we need any PYTHONPATH magic? -from mira4_make_bam import depad - -WRAPPER_VER = "0.0.5" #Keep in sync with the XML file - -def stop_err(msg, err=1): - sys.stderr.write(msg+"\n") - sys.exit(err) - -def run(cmd): - #Avoid using shell=True when we call subprocess to ensure if the Python - #script is killed, so too is the child process. - try: - child = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) - except Exception, err: - stop_err("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) - #Use .communicate as can get deadlocks with .wait(), - stdout, stderr = child.communicate() - return_code = child.returncode - if return_code: - if stderr and stdout: - stop_err("Return code %i from command:\n%s\n\n%s\n\n%s" % (return_code, err, stdout, stderr)) - else: - stop_err("Return code %i from command:\n%s\n%s" % (return_code, err, stderr)) - -def get_version(mira_binary): - """Run MIRA to find its version number""" - # At the commend line I would use: mira -v | head -n 1 - # however there is some pipe error when doing that here. - cmd = [mira_binary, "-v"] - try: - child = subprocess.Popen(cmd, - stdout=subprocess.PIPE, - stderr=subprocess.STDOUT) - except Exception, err: - sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) - sys.exit(1) - ver, tmp = child.communicate() - del child - return ver.split("\n", 1)[0].strip() - -#Parse Command Line -usage = """Galaxy MIRA4 wrapper script v%s - use as follows: - -$ python mira4_convert.py ... - -This will run the MIRA miraconvert binary and collect its output files as directed. -""" % WRAPPER_VER -parser = OptionParser(usage=usage) -parser.add_option("--input", dest="input", - default=None, metavar="FILE", - help="MIRA input filename") -parser.add_option("-x", "--min_length", dest="min_length", - default="0", - help="Minimum contig length") -parser.add_option("-y", "--min_cover", dest="min_cover", - default="0", - help="Minimum average contig coverage") -parser.add_option("-z", "--min_reads", dest="min_reads", - default="0", - help="Minimum reads per contig") -parser.add_option("--maf", dest="maf", - default="", metavar="FILE", - help="MIRA MAF output filename") -parser.add_option("--ace", dest="ace", - default="", metavar="FILE", - help="ACE output filename") -parser.add_option("--bam", dest="bam", - default="", metavar="FILE", - help="Unpadded BAM output filename") -parser.add_option("--fasta", dest="fasta", - default="", metavar="FILE", - help="Unpadded FASTA output filename") -parser.add_option("--cstats", dest="cstats", - default="", metavar="FILE", - help="Contig statistics filename") -parser.add_option("-v", "--version", dest="version", - default=False, action="store_true", - help="Show version and quit") -options, args = parser.parse_args() -if args: - stop_err("Expected options (e.g. --input example.maf), not arguments") - -input_maf = options.input -out_maf = options.maf -out_bam = options.bam -out_fasta = options.fasta -out_ace = options.ace -out_cstats = options.cstats - -try: - mira_path = os.environ["MIRA4"] -except KeyError: - stop_err("Environment variable $MIRA4 not set") -mira_convert = os.path.join(mira_path, "miraconvert") -if not os.path.isfile(mira_convert): - stop_err("Missing miraconvert under $MIRA4, %r\nFolder contained: %s" - % (mira_convert, ", ".join(os.listdir(mira_path)))) - -mira_convert_ver = get_version(mira_convert) -if not mira_convert_ver.strip().startswith("4.0"): - stop_err("This wrapper is for MIRA V4.0, not:\n%s\n%s" % (mira_ver, mira_convert)) -if options.version: - print "%s, MIRA wrapper version %s" % (mira_convert_ver, WRAPPER_VER) - sys.exit(0) - -if not input_maf: - stop_err("Input MIRA file is required") -elif not os.path.isfile(input_maf): - stop_err("Missing input MIRA file: %r" % input_maf) - -if not (out_maf or out_bam or out_fasta or out_ace or out_cstats): - stop_err("No output requested") - - -def check_min_int(value, name): - try: - i = int(value) - except: - stop_err("Bad %s setting, %r" % (name, value)) - if i < 0: - stop_err("Negative %s setting, %r" % (name, value)) - return i - -min_length = check_min_int(options.min_length, "minimum length") -min_cover = check_min_int(options.min_cover, "minimum cover") -min_reads = check_min_int(options.min_reads, "minimum reads") - -#TODO - Run MIRA in /tmp or a configurable directory? -#Currently Galaxy puts us somewhere safe like: -#/opt/galaxy-dist/database/job_working_directory/846/ -temp = "." - - -cmd_list = [mira_convert] -if min_length: - cmd_list.extend(["-x", str(min_length)]) -if min_cover: - cmd_list.extend(["-y", str(min_cover)]) -if min_reads: - cmd_list.extend(["-z", str(min_reads)]) -cmd_list.extend(["-f", "maf", input_maf, os.path.join(temp, "converted")]) -if out_maf: - cmd_list.append("maf") -if out_bam: - cmd_list.append("samnbb") - if not out_fasta: - #Need this for samtools depad - out_fasta = os.path.join(temp, "depadded.fasta") -if out_fasta: - cmd_list.append("fasta") -if out_ace: - cmd_list.append("ace") -if out_cstats: - cmd_list.append("cstats") -run(cmd_list) - -def collect(old, new): - if not os.path.isfile(old): - stop_err("Missing expected output file %s" % old) - shutil.move(old, new) - -if out_maf: - collect(os.path.join(temp, "converted.maf"), out_maf) -if out_fasta: - #Can we look at the MAF file to see if there are multiple strains? - old = os.path.join(temp, "converted_AllStrains.unpadded.fasta") - if os.path.isfile(old): - collect(old, out_fasta) - else: - #Might the output be filtered down to zero contigs? - old = os.path.join(temp, "converted.fasta") - if not os.path.isfile(old): - stop_err("Missing expected output FASTA file") - elif os.path.getsize(old) == 0: - print("Warning - no contigs (harsh filters?)") - collect(old, out_fasta) - else: - stop_err("Missing expected output FASTA file (only generic file present)") -if out_ace: - collect(os.path.join(temp, "converted.maf"), out_ace) -if out_cstats: - collect(os.path.join(temp, "converted_info_contigstats.txt"), out_cstats) - -if out_bam: - assert os.path.isfile(out_fasta) - old = os.path.join(temp, "converted.samnbb") - if not os.path.isfile(old): - old = os.path.join(temp, "converted.sam") - if not os.path.isfile(old): - stop_err("Missing expected intermediate file %s" % old) - h = BytesIO() - msg = depad(out_fasta, old, out_bam, h) - if msg: - print(msg) - print(h.getvalue()) - h.close() - sys.exit(1) - h.close() - if out_fasta == os.path.join(temp, "depadded.fasta"): - #Not asked for by Galaxy, no longer needed - os.remove(out_fasta) - -if min_length or min_cover or min_reads: - print("Filtered.") -else: - print("Converted.") diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_convert.xml --- a/tools/mira4/mira4_convert.xml Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,114 +0,0 @@ - - Convert MIRA assembly to FASTA/SAM/BAM - - miraconvert - MIRA - - mira4_convert.py --version - -mira4_convert.py ---input "$mira_file" ---min_length $min_length ---min_cover $min_cover ---min_reads $min_reads -#if str($maf_wanted)=="true": ---maf "$out_maf" -#end if -#if str($fasta_wanted)=="true": ---fasta "$out_fasta" -#end if -#if str($bam_wanted)=="true": ---bam "$out_bam" -#end if -##Don't yet have a Galaxy datatype defined for ace: -## #if str($ace_wanted)=="true": -## --ace "$out_ace" -## #end if -#if str($cstats_wanted)=="true": ---cstats "$out_cstats" -#end if - - - - - - - - - - - - - - - - - - - - - maf_wanted is True - - - fasta_wanted is True - - - bam_wanted is True - - - - cstats_wanted is True - - - - - - -**What it does** - -Runs the ``miraconvert`` utility from MIRA v4.0 to filter and/or convert -a MIRA Assembly Format file produced by a *mapping* or *de novo* assembly. - -**Example Usage** - -You want to remove all the low coverage contigs from a transcriptome -assembly to focus on those with higher coverage. - -You want to convert your MIRA assembly into SAM/BAM to run a standard -SNP finding tool. - -You've lost the FASTA consensus from your MIRA assembly and need to -regenerate it. - - -**Citation** - -If you use this Galaxy tool in work leading to a scientific publication please -cite the following papers: - -Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). -Galaxy tools and workflows for sequence analysis with applications -in molecular plant pathology. PeerJ 1:e167 -http://dx.doi.org/10.7717/peerj.167 - -Bastien Chevreux, Thomas Wetter and Sándor Suhai (1999). -Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. -Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. -http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler - - diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_de_novo.xml --- a/tools/mira4/mira4_de_novo.xml Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,263 +0,0 @@ - - Takes Sanger, Roche 454, Solexa/Illumina, Ion Torrent and PacBio reads - - mira - miraconvert - MIRA - samtools - samtools - - mira4.py --version - mira4.py ---manifest "$manifest" -#if str($maf_wanted)=="true": ---maf "$out_maf" -#end if -#if str($bam_wanted)=="true": ---bam "$out_bam" -#end if ---fasta "$out_fasta" ---log "$out_log" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - bam_wanted is True - - - maf_wanted is True - - - - - - -project = MIRA -job = denovo,${job_type},${job_quality} -parameters = -NW:cmrnl=no -DI:trt=/tmp -OUT:orc=no -## -GE:not is short for -GENERAL:number_of_threads and using one (1) -## can be useful for repeatability of assemblies and bug hunting. -## This is overriden by the command line -t switch which is easier -## to set from within Galaxy. -## -## -NW:cmrnl is short for -NAG_AND_WARN:check_maxreadnamelength -## and without this MIRA aborts with read names over 40 characters -## due to limitations of some downstream tools. -## -## -DI:trt is short for -DIRECTORY:tmp_redirected_to and should -## point to a local hard drive (not something like NFS on network). -## We replace /tmp with an environment variable via mira4.py -## -## -OUT:orc=no is short for -OUTPUT:output_result_caf=no -## which turns off an output file we don't want anyway. - -#for $rg in $read_group - -##This bar goes into the manifest as a comment line -#------------------------------------------------------------------------------ - -readgroup -technology = ${rg.technology} -##Record the segment placement (if any) -#if str($rg.segments.type) == "paired" -segment_placement = ${rg.segments.placement} -segment_naming = ${rg.segments.naming} -#if str($rg.segments.min_size) != "" or str($rg.segments.max_size) != "" -##If our min/max validation failed I trust MIRA to give an error message... -template_size = $rg.segments.min_size $rg.segments.max_size -#end if -#end if -##if str($rg.segments.type) == "none" -##MIRA4 manual says use segment_placement = unknown or ? for unpaired data -##but this stopped working in MIRA 4.0 RC5 and 4.0 (final). See: -##http://www.freelists.org/post/mira_talk/Unpaired-reads-and-segment-placement--or-unknown -##segment_placement = ? -##end if -##MIRA will accept multiple filenames on one data line, or multiple data lines -#for $f in $rg.filenames -##Must now map Galaxy datatypes to MIRA file types... -#if $f.ext.startswith("fastq") -##MIRA doesn't like fastqsanger etc, just plain old fastq: -data = fastq::$f -#elif $f.ext == "mira" -##We're calling *.maf the "mira" format in Galaxy (name space collision) -data = maf::$f -#else -##MIRA is happy with fasta as name, -data = ${f.ext}::$f -#end if -#end for -#end for - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -**What it does** - -Runs MIRA v4.0 in de novo mode, collects the output, generates a sorted BAM -file, and then throws away all the temporary files. - -MIRA is an open source assembly tool capable of handling sequence data from -a range of platforms (Sanger capillary, Solexa/Illumina, Roche 454, Ion Torrent -and also PacBio). - -It is particularly suited to small genomes such as bacteria. - - -**Notes on paired reads** - -.. class:: warningmark - -MIRA uses read naming conventions to identify paired read partners -(and does not care about their order in the input files). In most cases, -the Solexa/Illumina setting is fine. For Sanger capillary sequencing, -you may need to rename your reads to match one of the standard conventions -supported by MIRA. For Roche 454 or Ion Torrent the appropriate settings -depend on how the FASTQ file was produced: - -* If using Roche's ``sffinfo`` or older versions of ``sff_extract`` - to convert SFF files to FASTQ, your reads will probably have the - ``---> <---`` orientation and use the ``.f`` and ``.r`` - suffixes (FR naming). - -* If using a recent version of ``sff_extract``, then the ``/1`` and ``/2`` - suffixes are used (Solexa/Illumina style naming) and the original - ``2---> 1--->`` orientation is preserved. - -The reason for this is the raw data for Roche 454 and Ion Torrent paired-end -libraries sequences a circularised fragment such that the raw data begins -with the end of the fragment, a linker, then the start of the fragment. -This means both the start and end are sequenced from the same strand, and -have the orientation ``2---> 1--->``. However, in order to use the data -with traditional tools expecting Sanger capillary style ``---> <---`` -orientation it was common to reverse complement one of the pair to mimic this. - - -**Citation** - -If you use this Galaxy tool in work leading to a scientific publication please -cite the following papers: - -Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). -Galaxy tools and workflows for sequence analysis with applications -in molecular plant pathology. PeerJ 1:e167 -http://dx.doi.org/10.7717/peerj.167 - -Bastien Chevreux, Thomas Wetter and Sándor Suhai (1999). -Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. -Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. -http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler - - diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_make_bam.py --- a/tools/mira4/mira4_make_bam.py Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,92 +0,0 @@ -#!/usr/bin/env python -"""Wrapper script using miraconvert & samtools to get BAM from MIRA. -""" -import os -import sys -import shutil -import subprocess -import tempfile - -def stop_err(msg, err=1): - sys.stderr.write(msg+"\n") - sys.exit(err) - -def run(cmd, log_handle): - try: - child = subprocess.Popen(cmd, shell=True, - stdout=subprocess.PIPE, - stderr=subprocess.STDOUT) - except Exception, err: - sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) - #TODO - call clean up? - log_handle.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) - sys.exit(1) - #Use .communicate as can get deadlocks with .wait(), - stdout, stderr = child.communicate() - assert not stderr #Should be empty as sent to stdout - if len(stdout) > 10000: - #miraconvert can be very verbose (is holding stdout in RAM a problem?) - stdout = stdout.split("\n") - stdout = stdout[:10] + ["...", "", "..."] + stdout[-10:] - stdout = "\n".join(stdout) - log_handle.write(stdout) - return child.returncode - -def depad(fasta_file, sam_file, bam_file, log_handle): - log_handle.write("\n================= Converting MIRA assembly from SAM to BAM ===================\n") - #Also doing SAM to (uncompressed) BAM during depad - bam_stem = bam_file + ".tmp" # Have write permissions and want final file in this folder - cmd = 'samtools depad -S -u -T "%s" "%s" | samtools sort - "%s"' % (fasta_file, sam_file, bam_stem) - return_code = run(cmd, log_handle) - if return_code: - return "Error %i from command:\n%s" % (return_code, cmd) - if not os.path.isfile(bam_stem + ".bam"): - return "samtools depad or sort failed to produce BAM file" - - log_handle.write("\n====================== Indexing MIRA assembly BAM file =======================\n") - cmd = 'samtools index "%s.bam"' % bam_stem - return_code = run(cmd, log_handle) - if return_code: - return "Error %i from command:\n%s" % (return_code, cmd) - if not os.path.isfile(bam_stem + ".bam.bai"): - return "samtools indexing of BAM file failed to produce BAI file" - - shutil.move(bam_stem + ".bam", bam_file) - os.remove(bam_stem + ".bam.bai") #Let Galaxy handle that... - - -def make_bam(mira_convert, maf_file, fasta_file, bam_file, log_handle): - if not os.path.isfile(mira_convert): - return "Missing binary %r" % mira_convert - if not os.path.isfile(maf_file): - return "Missing input MIRA file: %r" % maf_file - if not os.path.isfile(fasta_file): - return "Missing padded FASTA file: %r" % fasta_file - - log_handle.write("\n====================== Converting MIRA assembly to SAM =======================\n") - tmp_dir = tempfile.mkdtemp() - sam_file = os.path.join(tmp_dir, "x.sam") - - # Note add nbb to the template name, possible MIRA 4.0 RC4 bug - cmd = '"%s" -f maf -t samnbb "%s" "%snbb"' % (mira_convert, maf_file, sam_file) - return_code = run(cmd, log_handle) - if return_code: - return "Error %i from command:\n%s" % (return_code, cmd) - if not os.path.isfile(sam_file): - return "Conversion from MIRA to SAM failed" - - #Also doing SAM to (uncompressed) BAM during depad - msg = depad(fasta_file, sam_file, bam_file, log_handle) - if msg: - return msg - - os.remove(sam_file) - os.rmdir(tmp_dir) - - return None #Good :) - -if __name__ == "__main__": - mira_convert, maf_file, fasta_file, bam_file = sys.argv[1:] - msg = make_bam(mira_convert, maf_file, fasta_file, bam_file, sys.stdout) - if msg: - stop_err(msg) diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_mapping.xml --- a/tools/mira4/mira4_mapping.xml Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,267 +0,0 @@ - - Maps Sanger, Roche 454, Solexa/Illumina, Ion Torrent and PacBio reads - - mira - miraconvert - MIRA - samtools - samtools - - mira4.py --version - mira4.py ---manifest "$manifest" -#if str($maf_wanted) == "true": ---maf "$out_maf" -#end if -#if str($bam_wanted) == "true": ---bam "$out_bam" -#end if ---fasta "$out_fasta" ---log "$out_log" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - bam_wanted is True - - - maf_wanted is True - - - - - -project = MIRA -job = mapping,${job_type},${job_quality} -parameters = -NW:cmrnl=no -DI:trt=/tmp -OUT:orc=no -## -GE:not is short for -GENERAL:number_of_threads and using one (1) -## can be useful for repeatability of assemblies and bug hunting. -## This is overriden by the command line -t switch which is easier -## to set from within Galaxy. -## -## -NW:cmrnl is short for -NAG_AND_WARN:check_maxreadnamelength -## and without this MIRA aborts with read names over 40 characters -## due to limitations of some downstream tools. -## -## -DI:trt is short for -DIRECTORY:tmp_redirected_to and should -## point to a local hard drive (not something like NFS on network). -## We replace /tmp with an environment variable via mira4.py -## -## -OUT:orc=no is short for -OUTPUT:output_result_caf=no -## which turns off an output file we don't want anyway. - -##This bar goes into the manifest as a comment line -#------------------------------------------------------------------------------ - -readgroup -is_reference -#if str($strain_setup)=="same" -strain = StrainX -#end if -#for $f in $references -##Must now map Galaxy datatypes to MIRA file types... -#if $f.ext.startswith("fastq") -##MIRA doesn't like fastqsanger etc, just plain old fastq: -data = fastq::$f -#elif $f.ext == "mira" -##We're calling *.maf the "mira" format in Galaxy (name space collision) -data = maf::$f -#elif $f.ext == "fasta" -##We're calling MIRA with the file type as "fna" as otherwise it wants quals -data = fna::$f -#else -##Currently don't expect anything else... -data = ${f.ext}::$f -#end if -#end for -#for $rg in $read_group - -##This bar goes into the manifest as a comment line -#------------------------------------------------------------------------------ - -readgroup -technology = ${rg.technology} -#if str($strain_setup)=="same" -##This is perhaps redundant as MIRA defaults to StrainX for the reads: -strain = StrainX -#end if -##Record the segment placement (if any) -#if str($rg.segments.type) == "paired" -segment_placement = ${rg.segments.placement} -segment_naming = ${rg.segments.naming} -#end if -##if str($rg.segments.type) == "none" -##MIRA4 manual says use segment_placement = unknown or ? for unpaired data -##but this stopped working in MIRA 4.0 RC5 and 4.0 (final). See: -##http://www.freelists.org/post/mira_talk/Unpaired-reads-and-segment-placement--or-unknown -##segment_placement = ? -##end if -##MIRA will accept multiple filenames on one data line, or multiple data lines -#for $f in $rg.filenames -##Must now map Galaxy datatypes to MIRA file types... -#if $f.ext.startswith("fastq") -##MIRA doesn't like fastqsanger etc, just plain old fastq: -data = fastq::$f -#elif $f.ext == "mira" -##We're calling *.maf the "mira" format in Galaxy (name space collision) -data = maf::$f -#else -##Currently don't expect anything else... -data = ${f.ext}::$f -#end if -#end for -#end for - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -**What it does** - -Runs MIRA v4.0 in mapping mode, collects the output, generates a sorted BAM -file, and throws away all the temporary files. - -MIRA is an open source assembly tool capable of handling sequence data from -a range of platforms (Sanger capillary, Solexa/Illumina, Roche 454, Ion Torrent -and also PacBio). - -It is particularly suited to small genomes such as bacteria. - - -**Notes on paired reads** - -.. class:: warningmark - -MIRA uses read naming conventions to identify paired read partners -(and does not care about their order in the input files). In most cases, -the Solexa/Illumina setting is fine. For Sanger capillary sequencing, -you may need to rename your reads to match one of the standard conventions -supported by MIRA. For Roche 454 or Ion Torrent the appropriate settings -depend on how the FASTQ file was produced: - -* If using Roche's ``sffinfo`` or older versions of ``sff_extract`` - to convert SFF files to FASTQ, your reads will probably have the - ``---> <---`` orientation and use the ``.f`` and ``.r`` - suffixes (FR naming). - -* If using a recent version of ``sff_extract``, then the ``/1`` and ``/2`` - suffixes are used (Solexa/Illumina style naming) and the original - ``2---> 1--->`` orientation is preserved. - -The reason for this is the raw data for Roche 454 and Ion Torrent paired-end -libraries sequences a circularised fragment such that the raw data begins -with the end of the fragment, a linker, then the start of the fragment. -This means both the start and end are sequenced from the same strand, and -have the orientation ``2---> 1--->``. However, in order to use the data -with traditional tools expecting Sanger capillary style ``---> <---`` -orientation it was common to reverse complement one of the pair to mimic this. - - -**Citation** - -If you use this Galaxy tool in work leading to a scientific publication please -cite the following papers: - -Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). -Galaxy tools and workflows for sequence analysis with applications -in molecular plant pathology. PeerJ 1:e167 -http://dx.doi.org/10.7717/peerj.167 - -Bastien Chevreux, Thomas Wetter and Sándor Suhai (1999). -Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. -Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. -http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html - -This wrapper is available to install into other Galaxy Instances via the Galaxy -Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler - - diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/mira4_validator.py --- a/tools/mira4/mira4_validator.py Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,64 +0,0 @@ -#Called from the Galaxy Tool XML file -#import sys - -def validate_input(trans, error_map, param_values, page_param_map): - """Validates the min_size/max_size user input, before execution.""" - err_list = [] - for read_group in param_values["read_group"]: - err = dict() - segments = read_group["segments"] - if str(segments["type"]) != "paired": - err_list.append(dict()) - continue - - min_size = str(segments["min_size"]).strip() - max_size = str(segments["max_size"]).strip() - #sys.stderr.write("DEBUG min_size=%r, max_size=%r\n" % (min_size, max_size)) - - #Somehow Galaxy seems to turn an empty field into string "None"... - if min_size=="None": - min_size = "" - if max_size=="None": - max_size = "" - - if min_size=="" and max_size=="": - #Both missing is good - pass - elif min_size=="": - err["min_size"] = "Minimum size required if maximum size given" - elif max_size=="": - err["max_size"] = "Maximum size required if minimum size given" - - if min_size: - try: - min_size_int = int(min_size) - if min_size_int < 0: - err["min_size"] = "Minumum size must not be negative (%i)" % min_size_int - min_size = None # Avoid doing comparison below - except ValueError: - err["min_size"] = "Minimum size is not an integer (%s)" % min_size - min_size = None # Avoid doing comparison below - - if max_size: - try: - max_size_int = int(max_size) - if max_size_int< 0: - err["max_size"] = "Maximum size must not be negative (%i)" % max_size_int - max_size = None # Avoid doing comparison below - except ValueError: - err["max_size"] = "Maximum size is not an integer (%s)" % max_size - max_size = None # Avoid doing comparison below - - if min_size and max_size and min_size_int > max_size_int: - msg = "Minimum size must be less than maximum size (%i vs %i)" % (min_size_int, max_size_int) - err["min_size"] = msg - err["max_size"] = msg - - if err: - err_list.append({"segments":err}) - else: - err_list.append(dict()) - - if any(err_list): - #Return an error map only if any readgroup gave errors - error_map["read_group"] = err_list diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/repository_dependencies.xml --- a/tools/mira4/repository_dependencies.xml Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,4 +0,0 @@ - - - - diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4/tool_dependencies.xml --- a/tools/mira4/tool_dependencies.xml Tue Oct 28 08:29:59 2014 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,55 +0,0 @@ - - - - - - - - - - - http://downloads.sourceforge.net/project/mira-assembler/MIRA/stable/mira_4.0.2_darwin13.1.0_x86_64_static.tar.bz2 - - bin - $INSTALL_DIR - - - - - http://downloads.sourceforge.net/project/mira-assembler/MIRA/stable/mira_4.0.2_linux-gnu_x86_64_static.tar.bz2 - - bin - $INSTALL_DIR - - - - - echo "ERROR: Automated installation on your operating system and CPU architecture combination is not yet supported." - echo "Your machine details (the output from 'uname' and 'arch'):" - uname - arch - echo "If pre-compiled MIRA binaries are now available for this, please report this" - echo "via https://github.com/peterjc/pico_galaxyt/issues - thank you!" - false - - - - - $INSTALL_DIR - - - $INSTALL_DIR - - - - -Downloads MIRA v4.0.2 from Sourceforge, requesting Bastien's precompiled binaries -for 64 bit (x86_64) Linux or Mac OS X. Other platforms where compilation from -source would be required (e.g. 32 bit Linux) are not supported by this automated -installation script. - -http://chevreux.org/projects_mira.html -http://sourceforge.net/projects/mira-assembler/ - - - diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/README.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/README.rst Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,176 @@ +Galaxy wrapper for the MIRA assembly program (v4.0) +=================================================== + +This tool is copyright 2011-2014 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. +See the licence text below (MIT licence). + +This tool is a short Python script (to collect the MIRA output and move it +to where Galaxy expects the files) and associated Galaxy wrapper XML file. + +It is available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler + +It uses a Galaxy datatype definition 'mira' for the MIRA Assembly Format, +http://toolshed.g2.bx.psu.edu/view/peterjc/mira_datatypes + +A separate wrapper for MIRA v3.4 is available from the Galaxy Tool Shed at: +http://toolshed.g2.bx.psu.edu/view/peterjc/mira_assembler + +Automated Installation +====================== + +This should be straightforward. Via the Tool Shed, Galaxy should automatically +install the 'mira' datatype, samtools, and download and install the precompiled +binary for MIRA v4.0.2 for the Galaxy wrapper, and run any tests. + +For MIRA 4, the Galaxy wrapper has been split in two, allowing separate +cluster settings for de novo usage (high RAM) and mapping (lower RAM). +Consult the Galaxy adminstration documentation for your cluster setup. + +WARNING: For larger tasks, be aware that MIRA can require vast amounts +of RAM and run-times of over a week are possible. This tool wrapper makes +no attempt to spot and reject such large jobs. + + +Manual Installation +=================== + +First install the 'mira' datatype for Galaxy, available here: + +* http://toolshed.g2.bx.psu.edu/view/peterjc/mira_datatypes + +There are four Galaxy files to install: + +* ``mira4_de_novo.xml`` (the Galaxy tool definition for de novo usage) +* ``mira4_mapping.xml`` (the Galaxy tool definition for mapping usage) +* ``mira4_convert.xml`` (the Galaxy tool definition for converting MIRA files) +* ``mira4_bait.xml`` (the Galaxy tool definition for mirabait) +* ``mira4.py`` (the Python wrapper script) +* ``mira4_convert.py`` (the Python wrapper script for miraconvert) +* ``mira4_bait.py`` (the Python wrapper script for mirabait) +* ``mira4_validator.py`` (the XML parameter validation script) + +The suggested location is a new ``tools/mira4`` folder. You will also need to +modify the ``tools_conf.xml`` file to tell Galaxy to offer the tool:: + + + + +You will also need to install MIRA, we used version 4.0.2, and define the +environment variable ``$MIRA4`` pointing at the folder containing the binaries. +See: + +* http://chevreux.org/projects_mira.html +* http://sourceforge.net/projects/mira-assembler/ + +You may wish to use different cluster setups for the de novo and mapping +tools, see above. + +You will also need to install samtools (for generating a BAM file from MIRA's +SAM output). + +If you wish to run the unit tests, also move/copy the ``test-data/`` files +under Galaxy's ``test-data/`` folder. Then:: + + $ ./run_tests.sh -id mira_4_0_bait + $ ./run_tests.sh -id mira_4_0_de_novo + $ ./run_tests.sh -id mira_4_0_mapping + $ ./run_tests.sh -id mira_4_0_convert + + +History +======= + +======= ====================================================================== +Version Changes +------- ---------------------------------------------------------------------- +v0.0.1 - Initial version (prototype for MIRA 4.0 RC4, based on wrapper for v3.4) +v0.0.2 - Include BAM output (using ``miraconvert`` and ``samtools``). + - Updated to target MIRA 4.0.1 + - Simplified XML to apply input format to output data. + - Sets temporary folder at run time to respect environment variables + (``$TMPDIR``, ``$TEMP``, or ``$TMP`` in that order). This was + previously hard coded as ``/tmp``. +v0.0.3 - Updated to target MIRA 4.0.2 +v0.0.4 - Using ``optparse`` for the Python wrapper script API + - Made MAF and BAM outputs optional + - Include wrapper for ``miraconvert`` +v0.0.5 - Tool definition now embeds citation information. +v0.0.6 - Fixed error handling in ``mira4_convert.py``. +v0.0.7 - Renamed folder (internal change only). + - Reorder XML elements (internal change only). + - Use the ``format_source=...`` tag in the MIRA bait wrapper. + - Planemo for Tool Shed upload (``.shed.yml``, internal change only). +======= ====================================================================== + + +Developers +========== + +Development is on a dedicated GitHub repository: +https://github.com/peterjc/pico_galaxy/tree/master/tools/mira4_assembler + +For pushing a release to the test or main "Galaxy Tool Shed", use the following +Planemo commands (which requires you have set your Tool Shed access details in +``~/.planemo.yml`` and that you have access rights on the Tool Shed):: + + $ planemo shed_upload --shed_target testtoolshed --check_diff ~/repositories/pico_galaxy/tools/mira4_assembler/ + ... + +or:: + + $ planemo shed_upload --shed_target toolshed --check_diff ~/repositories/pico_galaxy/tools/mira4_assembler/ + ... + +To just build and check the tar ball, use:: + + $ planemo shed_upload --tar_only ~/repositories/pico_galaxy/tools/mira4_assembler/ + ... + $ tar -tzf shed_upload.tar.gz + test-data/U13small_m.fastq + test-data/U13small_m.mira4_de_novo.fasta + test-data/ecoli.fastq + test-data/ecoli.mira4_de_novo.fasta + test-data/empty_file.dat + test-data/header.mira + test-data/tvc_mini.fastq + test-data/tvc_contigs.fasta + test-data/tvc_map_ref_strain.fasta + test-data/tvc_map_same_strain.fasta + test-data/tvc_bait.fasta + test-data/tvc_mini_bait_neg.fastq + test-data/tvc_mini_bait_pos.fastq + test-data/tvc_mini_bait_strict.fastq + tools/mira4_assembler/README.rst + tools/mira4_assembler/mira4.py + tools/mira4_assembler/mira4_bait.py + tools/mira4_assembler/mira4_convert.py + tools/mira4_assembler/mira4_de_novo.xml + tools/mira4_assembler/mira4_make_bam.py + tools/mira4_assembler/mira4_mapping.xml + tools/mira4_assembler/mira4_validator.py + tools/mira4_assembler/repository_dependencies.xml + tools/mira4_assembler/tool_dependencies.xml + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE. diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/mira4.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/mira4.py Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,313 @@ +#!/usr/bin/env python +"""A simple wrapper script to call MIRA and collect its output. +""" +import os +import sys +import subprocess +import shutil +import time +import tempfile +from optparse import OptionParser + +#Do we need any PYTHONPATH magic? +from mira4_make_bam import make_bam + +WRAPPER_VER = "0.0.4" #Keep in sync with the XML file + +def sys_exit(msg, err=1): + sys.stderr.write(msg+"\n") + sys.exit(err) + + +def get_version(mira_binary): + """Run MIRA to find its version number""" + # At the commend line I would use: mira -v | head -n 1 + # however there is some pipe error when doing that here. + cmd = [mira_binary, "-v"] + try: + child = subprocess.Popen(cmd, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT) + except Exception, err: + sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) + sys.exit(1) + ver, tmp = child.communicate() + del child + return ver.split("\n", 1)[0].strip() + +#Parse Command Line +usage = """Galaxy MIRA4 wrapper script v%s - use as follows: + +$ python mira4.py ... + +This will run the MIRA binary and collect its output files as directed. +""" % WRAPPER_VER +parser = OptionParser(usage=usage) +parser.add_option("-m", "--manifest", dest="manifest", + default=None, metavar="FILE", + help="MIRA manifest filename") +parser.add_option("--maf", dest="maf", + default="-", metavar="FILE", + help="MIRA MAF output filename") +parser.add_option("--bam", dest="bam", + default="-", metavar="FILE", + help="Unpadded BAM output filename") +parser.add_option("--fasta", dest="fasta", + default="-", metavar="FILE", + help="Unpadded FASTA output filename") +parser.add_option("--log", dest="log", + default="-", metavar="FILE", + help="MIRA logging output filename") +parser.add_option("-v", "--version", dest="version", + default=False, action="store_true", + help="Show version and quit") +options, args = parser.parse_args() +manifest = options.manifest +out_maf = options.maf +out_bam = options.bam +out_fasta = options.fasta +out_log = options.log + +try: + mira_path = os.environ["MIRA4"] +except KeyError: + sys_exit("Environment variable $MIRA4 not set") +mira_binary = os.path.join(mira_path, "mira") +if not os.path.isfile(mira_binary): + sys_exit("Missing mira under $MIRA4, %r\nFolder contained: %s" + % (mira_binary, ", ".join(os.listdir(mira_path)))) +mira_convert = os.path.join(mira_path, "miraconvert") +if not os.path.isfile(mira_convert): + sys_exit("Missing miraconvert under $MIRA4, %r\nFolder contained: %s" + % (mira_convert, ", ".join(os.listdir(mira_path)))) + +mira_ver = get_version(mira_binary) +if not mira_ver.strip().startswith("4.0"): + sys_exit("This wrapper is for MIRA V4.0, not:\n%s\n%s" % (mira_ver, mira_binary)) +mira_convert_ver = get_version(mira_convert) +if not mira_convert_ver.strip().startswith("4.0"): + sys_exit("This wrapper is for MIRA V4.0, not:\n%s\n%s" % (mira_ver, mira_convert)) +if options.version: + print "%s, MIRA wrapper version %s" % (mira_ver, WRAPPER_VER) + if mira_ver != mira_convert_ver: + print "WARNING: miraconvert %s" % mira_convert_ver + sys.exit(0) + +if not manifest: + sys_exit("Manifest is required") +elif not os.path.isfile(manifest): + sys_exit("Missing input MIRA manifest file: %r" % manifest) + + +try: + threads = int(os.environ.get("GALAXY_SLOTS", "1")) +except ValueError: + threads = 1 +assert 1 <= threads, threads + + +def override_temp(manifest): + """Override ``-DI:trt=/tmp`` in manifest with environment variable. + + Currently MIRA 4 does not allow envronment variables like ``$TMP`` + inside the manifest, which is a problem if you need to override + the default at run time. + + The tool XML will ``/tmp`` and we replace that here with + ``tempfile.gettempdir()`` which will respect $TMPDIR, $TEMP, $TMP + as explained in the Python standard library documentation: + http://docs.python.org/2/library/tempfile.html#tempfile.tempdir + + By default MIRA 4 would write its temporary files within the output + folder, which is a problem if that is a network drive. + """ + handle = open(manifest, "r") + text = handle.read() + handle.close() + + #At time of writing, this is at the end of a file, + #but could be followed by a space in future... + text = text.replace("-DI:trt=/tmp", "-DI:trt=" + tempfile.gettempdir()) + + #Want to try to ensure this gets written to disk before MIRA attempts + #to open it - any networked file system may impose a delay... + handle = open(manifest, "w") + handle.write(text) + handle.flush() + os.fsync(handle.fileno()) + handle.close() + + +def log_manifest(manifest): + """Write the manifest file to stderr.""" + sys.stderr.write("\n%s\nManifest file\n%s\n" % ("="*60, "="*60)) + with open(manifest) as h: + for line in h: + sys.stderr.write(line) + sys.stderr.write("\n%s\nEnd of manifest\n%s\n" % ("="*60, "="*60)) + + +def collect_output(temp, name, handle): + """Moves files to the output filenames (global variables).""" + n3 = (temp, name, name, name) + f = "%s/%s_assembly/%s_d_results" % (temp, name, name) + if not os.path.isdir(f): + log_manifest(manifest) + sys_exit("Missing output folder") + if not os.listdir(f): + log_manifest(manifest) + sys_exit("Empty output folder") + missing = [] + + old_maf = "%s/%s_out.maf" % (f, name) + if not os.path.isfile(old_maf): + #Triggered extractLargeContigs.sh? + old_maf = "%s/%s_LargeContigs_out.maf" % (f, name) + + #De novo or single strain mapping, + old_fasta = "%s/%s_out.unpadded.fasta" % (f, name) + ref_fasta = "%s/%s_out.padded.fasta" % (f, name) + if not os.path.isfile(old_fasta): + #Mapping (StrainX versus reference) or de novo + old_fasta = "%s/%s_out_StrainX.unpadded.fasta" % (f, name) + ref_fasta = "%s/%s_out_StrainX.padded.fasta" % (f, name) + if not os.path.isfile(old_fasta): + old_fasta = "%s/%s_out_ReferenceStrain.unpadded.fasta" % (f, name) + ref_fasta = "%s/%s_out_ReferenceStrain.padded.fasta" % (f, name) + + + missing = False + for old, new in [(old_maf, out_maf), + (old_fasta, out_fasta)]: + if not os.path.isfile(old): + missing = True + elif not new or new == "-": + handle.write("Ignoring %s\n" % old) + else: + handle.write("Capturing %s\n" % old) + shutil.move(old, new) + if missing: + log_manifest(manifest) + sys.stderr.write("Contents of %r:\n" % f) + for filename in sorted(os.listdir(f)): + sys.stderr.write("%s\n" % filename) + + #For mapping mode, probably most people would expect a BAM file + #using the reference FASTA file... + if out_bam and out_bam != "-": + if out_maf and out_maf != "-": + msg = make_bam(mira_convert, out_maf, ref_fasta, out_bam, handle) + else: + #Not collecting the MAF file, use original location + msg = make_bam(mira_convert, old_maf, ref_fasta, out_bam, handle) + if msg: + sys_exit(msg) + +def clean_up(temp, name): + folder = "%s/%s_assembly" % (temp, name) + if os.path.isdir(folder): + shutil.rmtree(folder) + +#TODO - Run MIRA in /tmp or a configurable directory? +#Currently Galaxy puts us somewhere safe like: +#/opt/galaxy-dist/database/job_working_directory/846/ +temp = "." + +name = "MIRA" + +override_temp(manifest) + +start_time = time.time() +cmd_list = [mira_binary, "-t", str(threads), manifest] +cmd = " ".join(cmd_list) + +assert os.path.isdir(temp) +d = "%s_assembly" % name +#This can fail on my development machine if stale folders exist +#under Galaxy's .../database/job_working_directory/ tree: +assert not os.path.isdir(d), "Path %r already exists:\n%s" % (d, os.path.abspath(d)) +try: + #Check path access + os.mkdir(d) +except Exception, err: + log_manifest(manifest) + sys.stderr.write("Error making directory %s\n%s" % (d, err)) + sys.exit(1) + +#print os.path.abspath(".") +#print cmd + +if out_log and out_log != "-": + handle = open(out_log, "w") +else: + handle = open(os.devnull, "w") +handle.write("======================== MIRA manifest (instructions) ========================\n") +m = open(manifest, "rU") +for line in m: + handle.write(line) +m.close() +del m +handle.write("\n") +handle.write("============================ Starting MIRA now ===============================\n") +handle.flush() +try: + #Run MIRA + child = subprocess.Popen(cmd_list, + stdout=handle, + stderr=subprocess.STDOUT) +except Exception, err: + log_manifest(manifest) + sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) + #TODO - call clean up? + handle.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) + handle.close() + sys.exit(1) +#Use .communicate as can get deadlocks with .wait(), +stdout, stderr = child.communicate() +assert not stdout and not stderr #Should be empty as sent to handle +run_time = time.time() - start_time +return_code = child.returncode +handle.write("\n") +handle.write("============================ MIRA has finished ===============================\n") +handle.write("MIRA took %0.2f hours\n" % (run_time / 3600.0)) +if return_code: + print "MIRA took %0.2f hours" % (run_time / 3600.0) + handle.write("Return error code %i from command:\n" % return_code) + handle.write(cmd + "\n") + handle.close() + clean_up(temp, name) + log_manifest(manifest) + sys_exit("Return error code %i from command:\n%s" % (return_code, cmd), + return_code) +handle.flush() + +if os.path.isfile("MIRA_assembly/MIRA_d_results/ec.log"): + handle.write("\n") + handle.write("====================== Extract Large Contigs failed ==========================\n") + e = open("MIRA_assembly/MIRA_d_results/ec.log", "rU") + for line in e: + handle.write(line) + e.close() + handle.write("============================ (end of ec.log) =================================\n") + handle.flush() + +#print "Collecting output..." +start_time = time.time() +collect_output(temp, name, handle) +collect_time = time.time() - start_time +handle.write("MIRA took %0.2f hours; collecting output %0.2f minutes\n" % (run_time / 3600.0, collect_time / 60.0)) +print("MIRA took %0.2f hours; collecting output %0.2f minutes\n" % (run_time / 3600.0, collect_time / 60.0)) + +if os.path.isfile("MIRA_assembly/MIRA_d_results/ec.log"): + #Treat as an error, but doing this AFTER collect_output + sys.stderr.write("Extract Large Contigs failed\n") + handle.write("Extract Large Contigs failed\n") + handle.close() + sys.exit(1) + +#print "Cleaning up..." +clean_up(temp, name) + +handle.write("\nDone\n") +handle.close() +print("Done") diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/mira4_bait.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/mira4_bait.py Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,114 @@ +#!/usr/bin/env python +"""A simple wrapper script to call MIRA4's mirabait and collect its output. +""" +import os +import sys +import subprocess +import shutil +import time + +WRAPPER_VER = "0.0.5" #Keep in sync with the XML file + +def sys_exit(msg, err=1): + sys.stderr.write(msg+"\n") + sys.exit(err) + + +def get_version(mira_binary): + """Run MIRA to find its version number""" + # At the commend line I would use: mira -v | head -n 1 + # however there is some pipe error when doing that here. + cmd = [mira_binary, "-v"] + try: + child = subprocess.Popen(cmd, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT) + except Exception, err: + sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) + sys.exit(1) + ver, tmp = child.communicate() + del child + #Workaround for -v not working in mirabait 4.0RC4 + if "invalid option" in ver.split("\n", 1)[0]: + for line in ver.split("\n", 1): + if " version " in line: + line = line.split() + return line[line.index("version")+1].rstrip(")") + sys_exit("Could not determine MIRA version:\n%s" % ver) + return ver.split("\n", 1)[0] + +try: + mira_path = os.environ["MIRA4"] +except KeyError: + sys_exit("Environment variable $MIRA4 not set") +mira_binary = os.path.join(mira_path, "mirabait") +if not os.path.isfile(mira_binary): + sys_exit("Missing mirabait under $MIRA4, %r\nFolder contained: %s" + % (mira_binary, ", ".join(os.listdir(mira_path)))) +mira_ver = get_version(mira_binary) +if not mira_ver.strip().startswith("4.0"): + sys_exit("This wrapper is for MIRA V4.0, not:\n%s" % mira_ver) +if "-v" in sys.argv or "--version" in sys.argv: + print "%s, MIRA wrapper version %s" % (mira_ver, WRAPPER_VER) + sys.exit(0) + + +format, output_choice, strand_choice, kmer_length, min_occurance, bait_file, in_file, out_file = sys.argv[1:] + +if format.startswith("fastq"): + format = "fastq" +elif format == "mira": + format = "maf" +elif format != "fasta": + sys_exit("Was not expected format %r" % format) + +assert out_file.endswith(".dat") +out_file_stem = out_file[:-4] + +cmd_list = [mira_binary, "-f", format, "-t", format, + "-k", kmer_length, "-n", min_occurance, + bait_file, in_file, out_file_stem] +if output_choice == "pos": + pass +elif output_choice == "neg": + #Invert the selection... + cmd_list.insert(1, "-i") +else: + sys_exit("Output choice should be 'pos' or 'neg', not %r" % output_choice) +if strand_choice == "both": + pass +elif strand_choice == "fwd": + #Ingore reverse strand... + cmd_list.insert(1, "-r") +else: + sys_exit("Strand choice should be 'both' or 'fwd', not %r" % strand_choice) + +cmd = " ".join(cmd_list) +#print cmd +start_time = time.time() +try: + #Run MIRA + child = subprocess.Popen(cmd_list, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT) +except Exception, err: + sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) + sys.exit(1) +#Use .communicate as can get deadlocks with .wait(), +stdout, stderr = child.communicate() +assert stderr is None # Due to way we ran with subprocess +run_time = time.time() - start_time +return_code = child.returncode +print "mirabait took %0.2f minutes" % (run_time / 60.0) + +if return_code: + sys.stderr.write(stdout) + sys_exit("Return error code %i from command:\n%s" % (return_code, cmd), + return_code) + +#Capture output +out_tmp = out_file_stem + "." + format +if not os.path.isfile(out_tmp): + sys.stderr.write(stdout) + sys_exit("Missing output file from mirabait: %s" % out_tmp) +shutil.move(out_tmp, out_file) diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/mira4_convert.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/mira4_convert.py Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,226 @@ +#!/usr/bin/env python +"""A simple wrapper script to call MIRA and collect its output. + +This focuses on the miraconvert binary. +""" +import os +import sys +import subprocess +import shutil +import time +import tempfile +from optparse import OptionParser +try: + from io import BytesIO +except ImportError: + #Should we worry about Python 2.5 or older? + from StringIO import StringIO as BytesIO + +#Do we need any PYTHONPATH magic? +from mira4_make_bam import depad + +WRAPPER_VER = "0.0.7" # Keep in sync with the XML file + +def sys_exit(msg, err=1): + sys.stderr.write(msg+"\n") + sys.exit(err) + +def run(cmd): + #Avoid using shell=True when we call subprocess to ensure if the Python + #script is killed, so too is the child process. + try: + child = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + except Exception, err: + sys_exit("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) + #Use .communicate as can get deadlocks with .wait(), + stdout, stderr = child.communicate() + return_code = child.returncode + if return_code: + cmd_str = " ".join(cmd) # doesn't quote spaces etc + if stderr and stdout: + sys_exit("Return code %i from command:\n%s\n\n%s\n\n%s" % (return_code, cmd_str, stdout, stderr)) + else: + sys_exit("Return code %i from command:\n%s\n%s" % (return_code, cmd_str, stderr)) + +def get_version(mira_binary): + """Run MIRA to find its version number""" + # At the commend line I would use: mira -v | head -n 1 + # however there is some pipe error when doing that here. + cmd = [mira_binary, "-v"] + try: + child = subprocess.Popen(cmd, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT) + except Exception, err: + sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (" ".join(cmd), err)) + sys.exit(1) + ver, tmp = child.communicate() + del child + return ver.split("\n", 1)[0].strip() + +#Parse Command Line +usage = """Galaxy MIRA4 wrapper script v%s - use as follows: + +$ python mira4_convert.py ... + +This will run the MIRA miraconvert binary and collect its output files as directed. +""" % WRAPPER_VER +parser = OptionParser(usage=usage) +parser.add_option("--input", dest="input", + default=None, metavar="FILE", + help="MIRA input filename") +parser.add_option("-x", "--min_length", dest="min_length", + default="0", + help="Minimum contig length") +parser.add_option("-y", "--min_cover", dest="min_cover", + default="0", + help="Minimum average contig coverage") +parser.add_option("-z", "--min_reads", dest="min_reads", + default="0", + help="Minimum reads per contig") +parser.add_option("--maf", dest="maf", + default="", metavar="FILE", + help="MIRA MAF output filename") +parser.add_option("--ace", dest="ace", + default="", metavar="FILE", + help="ACE output filename") +parser.add_option("--bam", dest="bam", + default="", metavar="FILE", + help="Unpadded BAM output filename") +parser.add_option("--fasta", dest="fasta", + default="", metavar="FILE", + help="Unpadded FASTA output filename") +parser.add_option("--cstats", dest="cstats", + default="", metavar="FILE", + help="Contig statistics filename") +parser.add_option("-v", "--version", dest="version", + default=False, action="store_true", + help="Show version and quit") +options, args = parser.parse_args() +if args: + sys_exit("Expected options (e.g. --input example.maf), not arguments") + +input_maf = options.input +out_maf = options.maf +out_bam = options.bam +out_fasta = options.fasta +out_ace = options.ace +out_cstats = options.cstats + +try: + mira_path = os.environ["MIRA4"] +except KeyError: + sys_exit("Environment variable $MIRA4 not set") +mira_convert = os.path.join(mira_path, "miraconvert") +if not os.path.isfile(mira_convert): + sys_exit("Missing miraconvert under $MIRA4, %r\nFolder contained: %s" + % (mira_convert, ", ".join(os.listdir(mira_path)))) + +mira_convert_ver = get_version(mira_convert) +if not mira_convert_ver.strip().startswith("4.0"): + sys_exit("This wrapper is for MIRA V4.0, not:\n%s\n%s" % (mira_convert_ver, mira_convert)) +if options.version: + print("%s, MIRA wrapper version %s" % (mira_convert_ver, WRAPPER_VER)) + sys.exit(0) + +if not input_maf: + sys_exit("Input MIRA file is required") +elif not os.path.isfile(input_maf): + sys_exit("Missing input MIRA file: %r" % input_maf) + +if not (out_maf or out_bam or out_fasta or out_ace or out_cstats): + sys_exit("No output requested") + + +def check_min_int(value, name): + try: + i = int(value) + except: + sys_exit("Bad %s setting, %r" % (name, value)) + if i < 0: + sys_exit("Negative %s setting, %r" % (name, value)) + return i + +min_length = check_min_int(options.min_length, "minimum length") +min_cover = check_min_int(options.min_cover, "minimum cover") +min_reads = check_min_int(options.min_reads, "minimum reads") + +#TODO - Run MIRA in /tmp or a configurable directory? +#Currently Galaxy puts us somewhere safe like: +#/opt/galaxy-dist/database/job_working_directory/846/ +temp = "." + + +cmd_list = [mira_convert] +if min_length: + cmd_list.extend(["-x", str(min_length)]) +if min_cover: + cmd_list.extend(["-y", str(min_cover)]) +if min_reads: + cmd_list.extend(["-z", str(min_reads)]) +cmd_list.extend(["-f", "maf", input_maf, os.path.join(temp, "converted")]) +if out_maf: + cmd_list.append("maf") +if out_bam: + cmd_list.append("samnbb") + if not out_fasta: + #Need this for samtools depad + out_fasta = os.path.join(temp, "depadded.fasta") +if out_fasta: + cmd_list.append("fasta") +if out_ace: + cmd_list.append("ace") +if out_cstats: + cmd_list.append("cstats") +run(cmd_list) + +def collect(old, new): + if not os.path.isfile(old): + sys_exit("Missing expected output file %s" % old) + shutil.move(old, new) + +if out_maf: + collect(os.path.join(temp, "converted.maf"), out_maf) +if out_fasta: + #Can we look at the MAF file to see if there are multiple strains? + old = os.path.join(temp, "converted_AllStrains.unpadded.fasta") + if os.path.isfile(old): + collect(old, out_fasta) + else: + #Might the output be filtered down to zero contigs? + old = os.path.join(temp, "converted.fasta") + if not os.path.isfile(old): + sys_exit("Missing expected output FASTA file") + elif os.path.getsize(old) == 0: + print("Warning - no contigs (harsh filters?)") + collect(old, out_fasta) + else: + sys_exit("Missing expected output FASTA file (only generic file present)") +if out_ace: + collect(os.path.join(temp, "converted.maf"), out_ace) +if out_cstats: + collect(os.path.join(temp, "converted_info_contigstats.txt"), out_cstats) + +if out_bam: + assert os.path.isfile(out_fasta) + old = os.path.join(temp, "converted.samnbb") + if not os.path.isfile(old): + old = os.path.join(temp, "converted.sam") + if not os.path.isfile(old): + sys_exit("Missing expected intermediate file %s" % old) + h = BytesIO() + msg = depad(out_fasta, old, out_bam, h) + if msg: + print(msg) + print(h.getvalue()) + h.close() + sys.exit(1) + h.close() + if out_fasta == os.path.join(temp, "depadded.fasta"): + #Not asked for by Galaxy, no longer needed + os.remove(out_fasta) + +if min_length or min_cover or min_reads: + print("Filtered.") +else: + print("Converted.") diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/mira4_de_novo.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/mira4_de_novo.xml Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,275 @@ + + Takes Sanger, Roche 454, Solexa/Illumina, Ion Torrent and PacBio reads + + mira + miraconvert + MIRA + samtools + samtools + + + + + + + + mira4.py --version + mira4.py +--manifest "$manifest" +#if str($maf_wanted)=="true": +--maf "$out_maf" +#end if +#if str($bam_wanted)=="true": +--bam "$out_bam" +#end if +--fasta "$out_fasta" +--log "$out_log" + + + +project = MIRA +job = denovo,${job_type},${job_quality} +parameters = -NW:cmrnl=no -DI:trt=/tmp -OUT:orc=no +## -GE:not is short for -GENERAL:number_of_threads and using one (1) +## can be useful for repeatability of assemblies and bug hunting. +## This is overriden by the command line -t switch which is easier +## to set from within Galaxy. +## +## -NW:cmrnl is short for -NAG_AND_WARN:check_maxreadnamelength +## and without this MIRA aborts with read names over 40 characters +## due to limitations of some downstream tools. +## +## -DI:trt is short for -DIRECTORY:tmp_redirected_to and should +## point to a local hard drive (not something like NFS on network). +## We replace /tmp with an environment variable via mira4.py +## +## -OUT:orc=no is short for -OUTPUT:output_result_caf=no +## which turns off an output file we don't want anyway. + +#for $rg in $read_group + +##This bar goes into the manifest as a comment line +#------------------------------------------------------------------------------ + +readgroup +technology = ${rg.technology} +##Record the segment placement (if any) +#if str($rg.segments.type) == "paired" +segment_placement = ${rg.segments.placement} +segment_naming = ${rg.segments.naming} +#if str($rg.segments.min_size) != "" or str($rg.segments.max_size) != "" +##If our min/max validation failed I trust MIRA to give an error message... +template_size = $rg.segments.min_size $rg.segments.max_size +#end if +#end if +##if str($rg.segments.type) == "none" +##MIRA4 manual says use segment_placement = unknown or ? for unpaired data +##but this stopped working in MIRA 4.0 RC5 and 4.0 (final). See: +##http://www.freelists.org/post/mira_talk/Unpaired-reads-and-segment-placement--or-unknown +##segment_placement = ? +##end if +##MIRA will accept multiple filenames on one data line, or multiple data lines +#for $f in $rg.filenames +##Must now map Galaxy datatypes to MIRA file types... +#if $f.ext.startswith("fastq") +##MIRA doesn't like fastqsanger etc, just plain old fastq: +data = fastq::$f +#elif $f.ext == "mira" +##We're calling *.maf the "mira" format in Galaxy (name space collision) +data = maf::$f +#else +##MIRA is happy with fasta as name, +data = ${f.ext}::$f +#end if +#end for +#end for + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + bam_wanted is True + + + maf_wanted is True + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +Runs MIRA v4.0 in de novo mode, collects the output, generates a sorted BAM +file, and then throws away all the temporary files. + +MIRA is an open source assembly tool capable of handling sequence data from +a range of platforms (Sanger capillary, Solexa/Illumina, Roche 454, Ion Torrent +and also PacBio). + +It is particularly suited to small genomes such as bacteria. + + +**Notes on paired reads** + +.. class:: warningmark + +MIRA uses read naming conventions to identify paired read partners +(and does not care about their order in the input files). In most cases, +the Solexa/Illumina setting is fine. For Sanger capillary sequencing, +you may need to rename your reads to match one of the standard conventions +supported by MIRA. For Roche 454 or Ion Torrent the appropriate settings +depend on how the FASTQ file was produced: + +* If using Roche's ``sffinfo`` or older versions of ``sff_extract`` + to convert SFF files to FASTQ, your reads will probably have the + ``---> <---`` orientation and use the ``.f`` and ``.r`` + suffixes (FR naming). + +* If using a recent version of ``sff_extract``, then the ``/1`` and ``/2`` + suffixes are used (Solexa/Illumina style naming) and the original + ``2---> 1--->`` orientation is preserved. + +The reason for this is the raw data for Roche 454 and Ion Torrent paired-end +libraries sequences a circularised fragment such that the raw data begins +with the end of the fragment, a linker, then the start of the fragment. +This means both the start and end are sequenced from the same strand, and +have the orientation ``2---> 1--->``. However, in order to use the data +with traditional tools expecting Sanger capillary style ``---> <---`` +orientation it was common to reverse complement one of the pair to mimic this. + + +**Citation** + +If you use this Galaxy tool in work leading to a scientific publication please +cite the following papers: + +Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). +Galaxy tools and workflows for sequence analysis with applications +in molecular plant pathology. PeerJ 1:e167 +http://dx.doi.org/10.7717/peerj.167 + +Bastien Chevreux, Thomas Wetter and Sándor Suhai (1999). +Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. +Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. +http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler + + + 10.7717/peerj.167 + @ARTICLE{Chevreux1999-mira3, + author = {B. Chevreux and T. Wetter and S. Suhai}, + year = {1999}, + title = {Genome Sequence Assembly Using Trace Signals and Additional Sequence Information}, + journal = {Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB)} + volume = {99}, + pages = {45-56}, + url = {http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html} + } + + diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/mira4_make_bam.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/mira4_make_bam.py Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,92 @@ +#!/usr/bin/env python +"""Wrapper script using miraconvert & samtools to get BAM from MIRA. +""" +import os +import sys +import shutil +import subprocess +import tempfile + +def sys_exit(msg, err=1): + sys.stderr.write(msg+"\n") + sys.exit(err) + +def run(cmd, log_handle): + try: + child = subprocess.Popen(cmd, shell=True, + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT) + except Exception, err: + sys.stderr.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) + #TODO - call clean up? + log_handle.write("Error invoking command:\n%s\n\n%s\n" % (cmd, err)) + sys.exit(1) + #Use .communicate as can get deadlocks with .wait(), + stdout, stderr = child.communicate() + assert not stderr #Should be empty as sent to stdout + if len(stdout) > 10000: + #miraconvert can be very verbose (is holding stdout in RAM a problem?) + stdout = stdout.split("\n") + stdout = stdout[:10] + ["...", "", "..."] + stdout[-10:] + stdout = "\n".join(stdout) + log_handle.write(stdout) + return child.returncode + +def depad(fasta_file, sam_file, bam_file, log_handle): + log_handle.write("\n================= Converting MIRA assembly from SAM to BAM ===================\n") + #Also doing SAM to (uncompressed) BAM during depad + bam_stem = bam_file + ".tmp" # Have write permissions and want final file in this folder + cmd = 'samtools depad -S -u -T "%s" "%s" | samtools sort - "%s"' % (fasta_file, sam_file, bam_stem) + return_code = run(cmd, log_handle) + if return_code: + return "Error %i from command:\n%s" % (return_code, cmd) + if not os.path.isfile(bam_stem + ".bam"): + return "samtools depad or sort failed to produce BAM file" + + log_handle.write("\n====================== Indexing MIRA assembly BAM file =======================\n") + cmd = 'samtools index "%s.bam"' % bam_stem + return_code = run(cmd, log_handle) + if return_code: + return "Error %i from command:\n%s" % (return_code, cmd) + if not os.path.isfile(bam_stem + ".bam.bai"): + return "samtools indexing of BAM file failed to produce BAI file" + + shutil.move(bam_stem + ".bam", bam_file) + os.remove(bam_stem + ".bam.bai") #Let Galaxy handle that... + + +def make_bam(mira_convert, maf_file, fasta_file, bam_file, log_handle): + if not os.path.isfile(mira_convert): + return "Missing binary %r" % mira_convert + if not os.path.isfile(maf_file): + return "Missing input MIRA file: %r" % maf_file + if not os.path.isfile(fasta_file): + return "Missing padded FASTA file: %r" % fasta_file + + log_handle.write("\n====================== Converting MIRA assembly to SAM =======================\n") + tmp_dir = tempfile.mkdtemp() + sam_file = os.path.join(tmp_dir, "x.sam") + + # Note add nbb to the template name, possible MIRA 4.0 RC4 bug + cmd = '"%s" -f maf -t samnbb "%s" "%snbb"' % (mira_convert, maf_file, sam_file) + return_code = run(cmd, log_handle) + if return_code: + return "Error %i from command:\n%s" % (return_code, cmd) + if not os.path.isfile(sam_file): + return "Conversion from MIRA to SAM failed" + + #Also doing SAM to (uncompressed) BAM during depad + msg = depad(fasta_file, sam_file, bam_file, log_handle) + if msg: + return msg + + os.remove(sam_file) + os.rmdir(tmp_dir) + + return None #Good :) + +if __name__ == "__main__": + mira_convert, maf_file, fasta_file, bam_file = sys.argv[1:] + msg = make_bam(mira_convert, maf_file, fasta_file, bam_file, sys.stdout) + if msg: + sys_exit(msg) diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/mira4_mapping.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/mira4_mapping.xml Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,279 @@ + + Maps Sanger, Roche 454, Solexa/Illumina, Ion Torrent and PacBio reads + + mira + miraconvert + MIRA + samtools + samtools + + + + + + + mira4.py --version + mira4.py +--manifest "$manifest" +#if str($maf_wanted) == "true": +--maf "$out_maf" +#end if +#if str($bam_wanted) == "true": +--bam "$out_bam" +#end if +--fasta "$out_fasta" +--log "$out_log" + + + +project = MIRA +job = mapping,${job_type},${job_quality} +parameters = -NW:cmrnl=no -DI:trt=/tmp -OUT:orc=no +## -GE:not is short for -GENERAL:number_of_threads and using one (1) +## can be useful for repeatability of assemblies and bug hunting. +## This is overriden by the command line -t switch which is easier +## to set from within Galaxy. +## +## -NW:cmrnl is short for -NAG_AND_WARN:check_maxreadnamelength +## and without this MIRA aborts with read names over 40 characters +## due to limitations of some downstream tools. +## +## -DI:trt is short for -DIRECTORY:tmp_redirected_to and should +## point to a local hard drive (not something like NFS on network). +## We replace /tmp with an environment variable via mira4.py +## +## -OUT:orc=no is short for -OUTPUT:output_result_caf=no +## which turns off an output file we don't want anyway. + +##This bar goes into the manifest as a comment line +#------------------------------------------------------------------------------ + +readgroup +is_reference +#if str($strain_setup)=="same" +strain = StrainX +#end if +#for $f in $references +##Must now map Galaxy datatypes to MIRA file types... +#if $f.ext.startswith("fastq") +##MIRA doesn't like fastqsanger etc, just plain old fastq: +data = fastq::$f +#elif $f.ext == "mira" +##We're calling *.maf the "mira" format in Galaxy (name space collision) +data = maf::$f +#elif $f.ext == "fasta" +##We're calling MIRA with the file type as "fna" as otherwise it wants quals +data = fna::$f +#else +##Currently don't expect anything else... +data = ${f.ext}::$f +#end if +#end for +#for $rg in $read_group + +##This bar goes into the manifest as a comment line +#------------------------------------------------------------------------------ + +readgroup +technology = ${rg.technology} +#if str($strain_setup)=="same" +##This is perhaps redundant as MIRA defaults to StrainX for the reads: +strain = StrainX +#end if +##Record the segment placement (if any) +#if str($rg.segments.type) == "paired" +segment_placement = ${rg.segments.placement} +segment_naming = ${rg.segments.naming} +#end if +##if str($rg.segments.type) == "none" +##MIRA4 manual says use segment_placement = unknown or ? for unpaired data +##but this stopped working in MIRA 4.0 RC5 and 4.0 (final). See: +##http://www.freelists.org/post/mira_talk/Unpaired-reads-and-segment-placement--or-unknown +##segment_placement = ? +##end if +##MIRA will accept multiple filenames on one data line, or multiple data lines +#for $f in $rg.filenames +##Must now map Galaxy datatypes to MIRA file types... +#if $f.ext.startswith("fastq") +##MIRA doesn't like fastqsanger etc, just plain old fastq: +data = fastq::$f +#elif $f.ext == "mira" +##We're calling *.maf the "mira" format in Galaxy (name space collision) +data = maf::$f +#else +##Currently don't expect anything else... +data = ${f.ext}::$f +#end if +#end for +#end for + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + bam_wanted is True + + + maf_wanted is True + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +**What it does** + +Runs MIRA v4.0 in mapping mode, collects the output, generates a sorted BAM +file, and throws away all the temporary files. + +MIRA is an open source assembly tool capable of handling sequence data from +a range of platforms (Sanger capillary, Solexa/Illumina, Roche 454, Ion Torrent +and also PacBio). + +It is particularly suited to small genomes such as bacteria. + + +**Notes on paired reads** + +.. class:: warningmark + +MIRA uses read naming conventions to identify paired read partners +(and does not care about their order in the input files). In most cases, +the Solexa/Illumina setting is fine. For Sanger capillary sequencing, +you may need to rename your reads to match one of the standard conventions +supported by MIRA. For Roche 454 or Ion Torrent the appropriate settings +depend on how the FASTQ file was produced: + +* If using Roche's ``sffinfo`` or older versions of ``sff_extract`` + to convert SFF files to FASTQ, your reads will probably have the + ``---> <---`` orientation and use the ``.f`` and ``.r`` + suffixes (FR naming). + +* If using a recent version of ``sff_extract``, then the ``/1`` and ``/2`` + suffixes are used (Solexa/Illumina style naming) and the original + ``2---> 1--->`` orientation is preserved. + +The reason for this is the raw data for Roche 454 and Ion Torrent paired-end +libraries sequences a circularised fragment such that the raw data begins +with the end of the fragment, a linker, then the start of the fragment. +This means both the start and end are sequenced from the same strand, and +have the orientation ``2---> 1--->``. However, in order to use the data +with traditional tools expecting Sanger capillary style ``---> <---`` +orientation it was common to reverse complement one of the pair to mimic this. + + +**Citation** + +If you use this Galaxy tool in work leading to a scientific publication please +cite the following papers: + +Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). +Galaxy tools and workflows for sequence analysis with applications +in molecular plant pathology. PeerJ 1:e167 +http://dx.doi.org/10.7717/peerj.167 + +Bastien Chevreux, Thomas Wetter and Sándor Suhai (1999). +Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. +Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. +http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html + +This wrapper is available to install into other Galaxy Instances via the Galaxy +Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/mira4_assembler + + + 10.7717/peerj.167 + @ARTICLE{Chevreux1999-mira3, + author = {B. Chevreux and T. Wetter and S. Suhai}, + year = {1999}, + title = {Genome Sequence Assembly Using Trace Signals and Additional Sequence Information}, + journal = {Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB)} + volume = {99}, + pages = {45-56}, + url = {http://www.bioinfo.de/isb/gcb99/talks/chevreux/main.html} + } + + diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/mira4_validator.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/mira4_validator.py Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,64 @@ +#Called from the Galaxy Tool XML file +#import sys + +def validate_input(trans, error_map, param_values, page_param_map): + """Validates the min_size/max_size user input, before execution.""" + err_list = [] + for read_group in param_values["read_group"]: + err = dict() + segments = read_group["segments"] + if str(segments["type"]) != "paired": + err_list.append(dict()) + continue + + min_size = str(segments["min_size"]).strip() + max_size = str(segments["max_size"]).strip() + #sys.stderr.write("DEBUG min_size=%r, max_size=%r\n" % (min_size, max_size)) + + #Somehow Galaxy seems to turn an empty field into string "None"... + if min_size=="None": + min_size = "" + if max_size=="None": + max_size = "" + + if min_size=="" and max_size=="": + #Both missing is good + pass + elif min_size=="": + err["min_size"] = "Minimum size required if maximum size given" + elif max_size=="": + err["max_size"] = "Maximum size required if minimum size given" + + if min_size: + try: + min_size_int = int(min_size) + if min_size_int < 0: + err["min_size"] = "Minumum size must not be negative (%i)" % min_size_int + min_size = None # Avoid doing comparison below + except ValueError: + err["min_size"] = "Minimum size is not an integer (%s)" % min_size + min_size = None # Avoid doing comparison below + + if max_size: + try: + max_size_int = int(max_size) + if max_size_int< 0: + err["max_size"] = "Maximum size must not be negative (%i)" % max_size_int + max_size = None # Avoid doing comparison below + except ValueError: + err["max_size"] = "Maximum size is not an integer (%s)" % max_size + max_size = None # Avoid doing comparison below + + if min_size and max_size and min_size_int > max_size_int: + msg = "Minimum size must be less than maximum size (%i vs %i)" % (min_size_int, max_size_int) + err["min_size"] = msg + err["max_size"] = msg + + if err: + err_list.append({"segments":err}) + else: + err_list.append(dict()) + + if any(err_list): + #Return an error map only if any readgroup gave errors + error_map["read_group"] = err_list diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/repository_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/repository_dependencies.xml Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,4 @@ + + + + diff -r 4abe8d59a438 -r efeff76741b8 tools/mira4_assembler/tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/mira4_assembler/tool_dependencies.xml Wed May 13 05:55:43 2015 -0400 @@ -0,0 +1,55 @@ + + + + + + + + + + + http://downloads.sourceforge.net/project/mira-assembler/MIRA/stable/mira_4.0.2_darwin13.1.0_x86_64_static.tar.bz2 + + bin + $INSTALL_DIR + + + + + http://downloads.sourceforge.net/project/mira-assembler/MIRA/stable/mira_4.0.2_linux-gnu_x86_64_static.tar.bz2 + + bin + $INSTALL_DIR + + + + + echo "ERROR: Automated installation on your operating system and CPU architecture combination is not yet supported." + echo "Your machine details (the output from 'uname' and 'arch'):" + uname + arch + echo "If pre-compiled MIRA binaries are now available for this, please report this" + echo "via https://github.com/peterjc/pico_galaxyt/issues - thank you!" + false + + + + + $INSTALL_DIR + + + $INSTALL_DIR + + + + +Downloads MIRA v4.0.2 from Sourceforge, requesting Bastien's precompiled binaries +for 64 bit (x86_64) Linux or Mac OS X. Other platforms where compilation from +source would be required (e.g. 32 bit Linux) are not supported by this automated +installation script. + +http://chevreux.org/projects_mira.html +http://sourceforge.net/projects/mira-assembler/ + + +