Mercurial > repos > peterjc > seq_select_by_id
changeset 1:cfae9b16ab65 draft
Uploaded v0.0.4
author | peterjc |
---|---|
date | Fri, 12 Apr 2013 06:20:24 -0400 |
parents | 2b27279adeff |
children | 694208ea6c34 |
files | test-data/k12_hypothetical.fasta test-data/k12_hypothetical.tabular test-data/k12_ten_proteins.fasta tools/filters/seq_select_by_id.py tools/filters/seq_select_by_id.txt tools/filters/seq_select_by_id.xml |
diffstat | 6 files changed, 92 insertions(+), 12 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_hypothetical.fasta Fri Apr 12 06:20:24 2013 -0400 @@ -0,0 +1,3 @@ +>gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655] +MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL +HGPPPPPRHHKKAPHDHHGGHGPGKHHR
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_hypothetical.tabular Fri Apr 12 06:20:24 2013 -0400 @@ -0,0 +1,2 @@ +#ID Description +gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655]
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_ten_proteins.fasta Fri Apr 12 06:20:24 2013 -0400 @@ -0,0 +1,60 @@ +>gi|16127995|ref|NP_414542.1| thr operon leader peptide [Escherichia coli str. K-12 substr. MG1655] +MKRISTTITTTITITTGNGAG +>gi|16127996|ref|NP_414543.1| fused aspartokinase I and homoserine dehydrogenase I [Escherichia coli str. K-12 substr. MG1655] +MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI +FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA +RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS +AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC +LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT +QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL +ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW +LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV +ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM +KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE +IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK +VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV +>gi|16127997|ref|NP_414544.1| homoserine kinase [Escherichia coli str. K-12 substr. MG1655] +MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE +RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY +DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF +IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA +DWLGKNYLQNQEGFVHICRLDTAGARVLEN +>gi|16127998|ref|NP_414545.1| threonine synthase [Escherichia coli str. K-12 substr. MG1655] +MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLDFVTRSAKILSAFIGDEIPQE +ILEERVRAAFAFPAPVANVESDVGCLELFHGPTLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAA +VAHAFYGLPNVKVVILYPRGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNS +ANSINISRLLAQICYYFEAVAQLPQETRNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVNDTVP +RFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDETTQQTMRELKELGYTS +EPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGETLDLPKELAERADLPLLSHNLPADFAAL +RKLMMNHQ +>gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655] +MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL +HGPPPPPRHHKKAPHDHHGGHGPGKHHR +>gi|16128000|ref|NP_414547.1| peroxide resistance protein, lowers intracellular iron [Escherichia coli str. K-12 substr. MG1655] +MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPD +FTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARG +KDLYQFWGDIITNKLNEALAAQGDNVVINLASDEYFKSVKPKKLNAEIIKPVFLDEKNGKFKIISFYAKK +ARGLMSRFIIENRLTKPEQLTGFNSEGYFFDEDSSSNGELVFKRYEQR +>gi|16128001|ref|NP_414548.1| putative transporter [Escherichia coli str. K-12 substr. MG1655] +MPDFFSFINSVLWGSVMIYLLFGAGCWFTFRTGFVQFRYIRQFGKSLKNSIHPQPGGLTSFQSLCTSLAA +RVGSGNLAGVALAITAGGPGAVFWMWVAAFIGMATSFAECSLAQLYKERDVNGQFRGGPAWYMARGLGMR +WMGVLFAVFLLIAYGIIFSGVQANAVARALSFSFDFPPLVTGIILAVFTLLAITRGLHGVARLMQGFVPL +MAIIWVLTSLVICVMNIGQLPHVIWSIFESAFGWQEAAGGAAGYTLSQAITNGFQRSMFSNEAGMGSTPN +AAAAAASWPPHPAAQGIVQMIGIFIDTLVICTASAMLILLAGNGTTYMPLEGIQLIQKAMRVLMGSWGAE +FVTLVVILFAFSSIVANYIYAENNLFFLRLNNPKAIWCLRICTFATVIGGTLLSLPLMWQLADIIMACMA +ITNLTAILLLSPVVHTIASDYLRQRKLGVRPVFDPLRYPDIGRQLSPDAWDDVSQE +>gi|16128002|ref|NP_414549.1| transaldolase B [Escherichia coli str. K-12 substr. MG1655] +MTDKLTSLRQYTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRKLIDDAVAWAKQQSNDRAQQI +VDATDKLAVNIGLEILKLVPGRISTEVDARLSYDTEASIAKAKRLIKLYNDAGISNDRILIKLASTWQGI +RAAEQLEKEGINCNLTLLFSFAQARACAEAGVFLISPFVGRILDWYKANTDKKEYAPAEDPGVVSVSEIY +QYYKEHGYETVVMGASFRNIGEILELAGCDRLTIAPALLKELAESEGAIERKLSYTGEVKARPARITESE +FLWQHNQDPMAVDKLAEGIRKFAIDQEKLEKMIGDLL +>gi|16128003|ref|NP_414550.1| molybdochelatase incorporating molybdenum into molybdopterin [Escherichia coli str. K-12 substr. MG1655] +MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDEQAIIEQTLCELVDEMSCHLV +LTTGGTGPARRDVTPDATLAVADREMPGFGEQMRQISLHFVPTAILSRQVGVIRKQALILNLPGQPKSIK +ETLEGVKDAEGNVVVHGIFASVPYCIQLLEGPYVETAPEVVAAFRPKSARRDVSE +>gi|16128004|ref|NP_414551.1| inner membrane protein, Grp1_Fun34_YaaH family [Escherichia coli str. K-12 substr. MG1655] +MGNTKLANPAPLGLMGFGMTTILLNLHNVGYFALDGIILAMGIFYGGIAQIFAGLLEYKKGNTFGLTAFT +SYGSFWLTLVAILLMPKLGLTDAPNAQFLGVYLGLWGVFTLFMFFGTLKGARVLQFVFFSLTVLFALLAI +GNIAGNAAIIHFAGWIGLICGASAIYLAMGEVLNEQFGRTVLPIGESH +
--- a/tools/filters/seq_select_by_id.py Tue Jun 07 17:13:04 2011 -0400 +++ b/tools/filters/seq_select_by_id.py Fri Apr 12 06:20:24 2013 -0400 @@ -16,11 +16,11 @@ molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878. -This script is copyright 2011 by Peter Cock, The James Hutton Institute UK. +This script is copyright 2011-2013 by Peter Cock, The James Hutton Institute UK. All rights reserved. See accompanying text file for licence details (MIT/BSD style). -This is version 0.0.1 of the script. +This is version 0.0.4 of the script. """ import sys @@ -28,6 +28,10 @@ sys.stderr.write(msg.rstrip() + "\n") sys.exit(err) +if "-v" in sys.argv or "--version" in sys.argv: + print "v0.0.4" + sys.exit(0) + #Parse Command Line try: tabular_file, col_arg, in_file, seq_format, out_file = sys.argv[1:] @@ -39,7 +43,7 @@ else: column = int(col_arg)-1 except ValueError: - stop_err("Expected column number, got %s" % cols_arg) + stop_err("Expected column number, got %s" % col_arg) if seq_format == "fastqcssanger": stop_err("Colorspace FASTQ not supported.") @@ -65,7 +69,7 @@ """Read tabular file and record all specified identifiers.""" handle = open(tabular_file, "rU") for line in handle: - if not line.startswith("#"): + if line.strip() and not line.startswith("#"): yield line.rstrip("\n").split("\t")[col].strip() handle.close() @@ -105,7 +109,7 @@ except KeyError, err: out_handle.close() if name not in records: - stop_err("Identifier %s not found in sequence file" % name) + stop_err("Identifier %r not found in sequence file" % name) else: raise err out_handle.close() @@ -119,7 +123,7 @@ out_handle.write(records.get_raw(name)) except KeyError: out_handle.close() - stop_err("Identifier %s not found in sequence file" % name) + stop_err("Identifier %r not found in sequence file" % name) count += 1 out_handle.close()
--- a/tools/filters/seq_select_by_id.txt Tue Jun 07 17:13:04 2011 -0400 +++ b/tools/filters/seq_select_by_id.txt Fri Apr 12 06:20:24 2013 -0400 @@ -1,5 +1,5 @@ -Galaxy tool to select FASTA, FASTQ or SFF sequences by ID -========================================================= +Galaxy tool to select FASTA, QUAL, FASTQ or SFF sequences by ID +=============================================================== This tool is copyright 2011 by Peter Cock, The James Hutton Institute (formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. @@ -26,7 +26,7 @@ You will also need to modify the tools_conf.xml file to tell Galaxy to offer the tool. One suggested location is in the filters section. Simply add the line: -<tool file="filters/sff_select_by_id.xml" /> +<tool file="filters/seq_select_by_id.xml" /> You will also need to install Biopython 1.54 or later. That's it. @@ -35,6 +35,9 @@ ======= v0.0.1 - Initial version. +v0.0.3 - Ignore blank lines in input. +v0.0.4 - Record script version when run from Galaxy. + - Basic unit test included. Developers @@ -43,10 +46,10 @@ This script and related tools are being developed on the following hg branch: http://bitbucket.org/peterjc/galaxy-central/src/tools -For making the "Galaxy Tool Shed" http://community.g2.bx.psu.edu/ tarball use +For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use the following command from the Galaxy root folder: -tar -czf seq_select_by_id.tar.gz tools/filters/seq_select_by_id.* +$ tar -czf seq_select_by_id.tar.gz tools/filters/seq_select_by_id.* test-data/k12_ten_proteins.fasta test-data/k12_hypothetical.fasta test-data/k12_hypothetical.tabular Check this worked: @@ -54,6 +57,9 @@ filter/seq_select_by_id.py filter/seq_select_by_id.txt filter/seq_select_by_id.xml +test-data/k12_ten_proteins.fasta +test-data/k12_hypothetical.fasta +test-data/k12_hypothetical.tabular Licence (MIT/BSD style)
--- a/tools/filters/seq_select_by_id.xml Tue Jun 07 17:13:04 2011 -0400 +++ b/tools/filters/seq_select_by_id.xml Fri Apr 12 06:20:24 2013 -0400 @@ -1,5 +1,6 @@ -<tool id="seq_select_by_id" name="Select sequences by ID" version="0.0.1"> +<tool id="seq_select_by_id" name="Select sequences by ID" version="0.0.4"> <description>from a tabular file</description> + <version_command interpreter="python">seq_select_by_id.py --version</version_command> <command interpreter="python"> seq_select_by_id.py $input_tabular $column $input_file $input_file.ext $output_file </command> @@ -22,6 +23,10 @@ </data> </outputs> <tests> + <param name="input_file" value="test-data/k12_ten_proteins.fasta" ftype="fasta" /> + <param name="input_tabular" value="k12_hypothetical.tabular" ftype="tabular" /> + <param name="column" value="1" /> + <output name="output_file" file="test-data/k12_hypothetical.fasta" ftype="fasta" /> </tests> <requirements> <requirement type="python-module">Bio</requirement>