changeset 0:6e1152287f2e draft

Uploaded v0.0.2 preview 1
author peterjc
date Tue, 18 Feb 2014 08:43:05 -0500
parents
children bd2b013e0d0d
files test-data/demo_nuc_align.fasta test-data/demo_nucs.fasta test-data/demo_prot_align.fasta tools/align_back_trans/README.rst tools/align_back_trans/align_back_trans.py tools/align_back_trans/align_back_trans.xml tools/align_back_trans/tool_dependencies.xml
diffstat 7 files changed, 368 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/demo_nuc_align.fasta	Tue Feb 18 08:43:05 2014 -0500
@@ -0,0 +1,6 @@
+>Alpha
+GGUGAGGAACGA
+>Beta
+GGCGGG---CGU
+>Gamma
+GGU------CGG
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/demo_nucs.fasta	Tue Feb 18 08:43:05 2014 -0500
@@ -0,0 +1,6 @@
+>Alpha
+GGUGAGGAACGA
+>Beta
+GGCGGGCGU
+>Gamma
+GGUCGG
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/demo_prot_align.fasta	Tue Feb 18 08:43:05 2014 -0500
@@ -0,0 +1,6 @@
+>Alpha
+DEER
+>Beta
+DE-R
+>Gamma
+D--R
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/align_back_trans/README.rst	Tue Feb 18 08:43:05 2014 -0500
@@ -0,0 +1,105 @@
+Galaxy tool to back-translate a protein alignment to nucleotides
+================================================================
+
+This tool is copyright 2012-2014 by Peter Cock, The James Hutton Institute
+(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved.
+See the licence text below (MIT licence).
+
+This tool is a short Python script (using Biopython library functions) to
+load a protein alignment, and matching nucleotide FASTA file of unaligned
+sequences, which are threaded onto the protein alignment in order to produce
+a codon aware nucleotide alignment - which can be viewed as a back translation.
+
+This tool is available from the Galaxy Tool Shed at:
+
+* http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+
+
+Automated Installation
+======================
+
+This should be straightforward using the Galaxy Tool Shed, which should be
+able to automatically install the dependency on Biopython, and then install
+this tool and run its unit tests.
+
+
+Manual Installation
+===================
+
+There are just two files to install to use this tool from within Galaxy:
+
+* ``align_back_trans.py`` (the Python script)
+* ``align_back_trans.xml`` (the Galaxy tool definition)
+
+The suggested location is in a dedicated ``tools/align_back_trans`` folder.
+
+You will also need to modify the tools_conf.xml file to tell Galaxy to offer the
+tool. One suggested location is in the filters section. Simply add the line::
+
+    <tool file="align_back_trans/align_back_trans.xml" />
+
+You will also need to install Biopython 1.54 or later. If you want to run
+the unit tests, include this line in ``tools_conf.xml.sample`` and the sample
+FASTA files under the test-data directory. Then::
+
+    ./run_functional_tests.sh -id align_back_trans
+
+That's it.
+
+
+History
+=======
+
+======= ======================================================================
+Version Changes
+------- ----------------------------------------------------------------------
+v0.0.1  - Initial version, based on a previously written Python script
+======= ======================================================================
+
+
+Developers
+==========
+
+This script was initially developed on this repository:
+https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py
+
+With the addition of a Galaxy wrapper, developement moved here:
+https://github.com/peterjc/pico_galaxy/tree/master/tools/align_back_trans
+
+For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use
+the following command from the Galaxy root folder::
+
+    $ tar -czf align_back_trans.tar.gz tools/align_back_trans/README.rst tools/align_back_trans/align_back_trans.py tools/align_back_trans/align_back_trans.xml tools/align_back_trans/tool_dependencies.xml test-data/demo_nucs.fasta test-data/demo_prot_align.fasta test-data/demo_nuc_align.fasta
+
+Check this worked::
+
+    $ tar -tzf align_back_trans.tar.gz
+    tools/align_back_trans/README.rst
+    tools/align_back_trans/align_back_trans.py
+    tools/align_back_trans/align_back_trans.xml
+    tools/align_back_trans/tool_dependencies.xml
+    test-data/demo_nucs.fasta
+    test-data/demo_prot_align.fasta
+    test-data/demo_nuc_align.fasta
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/align_back_trans/align_back_trans.py	Tue Feb 18 08:43:05 2014 -0500
@@ -0,0 +1,125 @@
+#!/usr/bin/env python
+"""Back-translate a protein alignment to nucleotides
+
+This tool is a short Python script (using Biopython library functions) to
+load a protein alignment, and matching nucleotide FASTA file of unaligned
+sequences, in order to produce a codon aware nucleotide alignment - which
+can be viewed as a back translation.
+
+The development repository for this tool is here:
+
+* https://github.com/peterjc/pico_galaxy/tree/master/tools/align_back_trans  
+
+This tool is available with a Galaxy wrapper from the Galaxy Tool Shed at:
+
+* http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+
+See accompanying text file for licence details (MIT licence).
+
+This is version 0.0.2 of the script.
+"""
+
+import sys
+from Bio.Seq import Seq
+from Bio.Alphabet import generic_dna, generic_protein
+from Bio.Align import MultipleSeqAlignment
+from Bio import SeqIO
+from Bio import AlignIO
+
+if "-v" in sys.argv or "--version" in sys.argv:
+    print "v0.0.2"
+    sys.exit(0)
+
+def stop_err(msg, error_level=1):
+    """Print error message to stdout and quit with given error level."""
+    sys.stderr.write("%s\n" % msg)
+    sys.exit(error_level)
+
+def sequence_back_translate(aligned_protein_record, unaligned_nucleotide_record, gap):
+    #TODO - Separate arguments for protein gap and nucleotide gap?
+    if not gap or len(gap) != 1:
+        raise ValueError("Please supply a single gap character")
+
+    alpha = unaligned_nucleotide_record.seq.alphabet
+    if hasattr(alpha, "gap_char"):
+        gap_codon = alpha.gap_char * 3
+        assert len(gap_codon) == 3
+    else:
+        from Bio.Alphabet import Gapped
+        alpha = Gapped(alpha, gap)
+        gap_codon = gap*3
+
+    if len(aligned_protein_record.seq.ungap(gap))*3 != len(unaligned_nucleotide_record.seq):
+        stop_err("Inconsistent lengths for %s, ungapped protein %i, "
+                 "tripled %i vs ungapped nucleotide %i" %
+                 (len(aligned_protein_record.seq.ungap(gap)),
+                  len(aligned_protein_record.seq.ungap(gap))*3,
+                  len(unaligned_nucleotide_record.seq)))
+
+    seq = []
+    nuc = str(unaligned_nucleotide_record.seq)
+    for amino_acid in aligned_protein_record.seq:
+        if amino_acid == gap:
+            seq.append(gap_codon)
+        else:
+            seq.append(nuc[:3])
+            nuc = nuc[3:]
+    assert not nuc, "Nucleotide sequence for %r longer than protein %s" \
+        % (unaligned_nucleotide_record.id, aligned_protein_record.id)
+
+    aligned_nuc = unaligned_nucleotide_record[:] #copy for most annotation
+    aligned_nuc.letter_annotation = {} #clear this
+    aligned_nuc.seq = Seq("".join(seq), alpha) #replace this
+    assert len(aligned_protein_record.seq) * 3 == len(aligned_nuc)
+    return aligned_nuc
+
+def alignment_back_translate(protein_alignment, nucleotide_records, key_function=None, gap=None):
+    """Thread nucleotide sequences onto a protein alignment."""
+    #TODO - Separate arguments for protein and nucleotide gap characters?
+    if key_function is None:
+        key_function = lambda x: x
+    if gap is None:
+        gap = "-"
+
+    aligned = []
+    for protein in protein_alignment:
+        try:
+            nucleotide = nucleotide_records[key_function(protein.id)]
+        except KeyError:
+            raise ValueError("Could not find nucleotide sequence for protein %r" \
+                             % protein.id)
+        aligned.append(sequence_back_translate(protein, nucleotide, gap))
+    return MultipleSeqAlignment(aligned)
+
+
+if len(sys.argv) == 4:
+    align_format, prot_align_file, nuc_fasta_file = sys.argv[1:]
+    nuc_align_file = sys.stdout
+elif len(sys.argv) == 5:
+    align_format, prot_align_file, nuc_fasta_file, nuc_align_file = sys.argv[1:]
+else:
+    stop_err("""This is a Python script for 'back-translating' a protein alignment,
+
+It requires three or four arguments:
+- alignment format (e.g. fasta, clustal),
+- aligned protein file (in specified format),
+- unaligned nucleotide file (in fasta format).
+- aligned nucleotiode output file (in same format), optional.
+
+The nucleotide alignment is printed to stdout if no output filename is given.
+
+Example usage:
+
+$ python align_back_trans.py fasta demo_prot_align.fasta demo_nucs.fasta demo_nuc_align.fasta
+
+Warning: If the output file already exists, it will be overwritten.
+
+This script is available with sample data and a Galaxy wrapper here:
+https://github.com/peterjc/pico_galaxy/tree/master/tools/align_back_trans
+http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+""")
+
+prot_align = AlignIO.read(prot_align_file, align_format, alphabet=generic_protein)
+nuc_dict = SeqIO.index(nuc_fasta_file, "fasta")
+nuc_align = alignment_back_translate(prot_align, nuc_dict, gap="-")
+AlignIO.write(nuc_align, nuc_align_file, align_format)
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/align_back_trans/align_back_trans.xml	Tue Feb 18 08:43:05 2014 -0500
@@ -0,0 +1,114 @@
+<tool id="align_back_trans" name="Thread nucleotides onto a protein alignment (back-translation)" version="0.0.2">
+    <description>Gives a codon aware alignment</description>
+    <requirements>
+        <requirement type="package" version="1.63">biopython</requirement>
+        <requirement type="python-module">Bio</requirement>
+    </requirements>
+    <version_command interpreter="python">align_back_trans.py --version</version_command>
+    <command interpreter="python">
+align_back_trans.py $prot_align.ext $prot_align $nuc_file $out_nuc_align
+    </command>
+    <stdio>
+        <!-- Anything other than zero is an error -->
+        <exit_code range="1:" />
+        <exit_code range=":-1" />
+    </stdio>
+    <inputs>
+        <param name="prot_align" type="data" format="fasta,muscle,clustal" label="Aligned protein file" help="Mutliple sequence file in FASTA, ClustalW or PHYLIP format." />
+	<!-- TODO? Could verify the translation
+        <param name="table" type="select" label="Genetic code" help="Tables from the NCBI, these determine the start and stop codons">
+            <option value="1">1. Standard</option>
+            <option value="2">2. Vertebrate Mitochondrial</option>
+            <option value="3">3. Yeast Mitochondrial</option>
+            <option value="4">4. Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma</option>
+            <option value="5">5. Invertebrate Mitochondrial</option>
+            <option value="6">6. Ciliate Macronuclear and Dasycladacean</option>
+            <option value="9">9. Echinoderm Mitochondrial</option>
+            <option value="10">10. Euplotid Nuclear</option>
+            <option value="11">11. Bacterial</option>
+            <option value="12">12. Alternative Yeast Nuclear</option>
+            <option value="13">13. Ascidian Mitochondrial</option>
+            <option value="14">14. Flatworm Mitochondrial</option>
+            <option value="15">15. Blepharisma Macronuclear</option>
+            <option value="16">16. Chlorophycean Mitochondrial</option>
+            <option value="21">21. Trematode Mitochondrial</option>
+            <option value="22">22. Scenedesmus obliquus</option>
+            <option value="23">23. Thraustochytrium Mitochondrial</option>
+        </param>
+	-->
+        <param name="nuc_file" type="data" format="fasta" label="Unaligned nucleotide sequences" help="FASTA format, using same identifiers as your protein alignment" />
+    </inputs>
+    <outputs>
+        <data name="out_nuc_align" format="fasta" label="${prot_align.name} (back-translated)">
+            <!-- TODO - Replace this with format="input:prot_align" if/when that works -->
+            <change_format>
+                <when input_dataset="prot_align" attribute="extension" value="clustal" format="clustal" />
+                <when input_dataset="prot_align" attribute="extension" value="phylip" format="phylip" />
+            </change_format>
+        </data>
+    </outputs>
+    <tests>
+        <test>
+            <param name="prot_align" value="demo_prot_align.fasta" />
+            <param name="nuc_file" value="demo_nucs.fasta" />
+            <output name="out_nuc_align" file="demo_nuc_align.fasta" />
+        </test>
+    </tests>
+    <help>
+**What it does**
+
+Takes an input file of aligned protein sequences (typically FASTA or Clustal
+format), and a matching file of unaligned nucleotide sequences (FASTA format,
+using the same identifiers), and threads the nucleotide sequences onto the
+protein alignment to produce a codon aware nucleotide alignment - which can
+be viewed as a back translation.
+
+Note - the provided nucleotide sequences should be exactly three times the
+length of the protein sequences (exluding the gaps).
+
+
+**Example**
+
+Given this protein alignment in FASTA format::
+
+    >Alpha
+    DEER
+    >Beta
+    DE-R
+    >Gamma
+    D--R
+
+and this matching unaligned nucleotide FASTA file::
+
+    >Alpha
+    GGUGAGGAACGA
+    >Beta
+    GGCGGGCGU
+    >Gamma
+    GGUCGG
+
+the tool would return this nucleotide alignment::
+
+    >Alpha
+    GGUGAGGAACGA
+    >Beta
+    GGCGGG---CGU
+    >Gamma
+    GGU------CGG
+
+Notice that all the gaps are multiples of three in length.
+
+
+**Citation**
+
+This tool uses Biopython, so if you use this Galaxy tool in work leading to a
+scientific publication please cite the following paper:
+
+Cock et al (2009). Biopython: freely available Python tools for computational
+molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
+http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.
+
+This tool is available to install into other Galaxy Instances via the Galaxy
+Tool Shed at http://toolshed.g2.bx.psu.edu/view/peterjc/align_back_trans
+    </help>
+</tool>
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/align_back_trans/tool_dependencies.xml	Tue Feb 18 08:43:05 2014 -0500
@@ -0,0 +1,6 @@
+<?xml version="1.0"?>
+<tool_dependency>
+    <package name="biopython" version="1.63">
+        <repository changeset_revision="d8b200f1f5a5" name="package_biopython_1_63" owner="biopython" toolshed="http://testtoolshed.g2.bx.psu.edu" />
+    </package>
+</tool_dependency>