Mercurial > repos > peterjc > seq_filter_by_id
changeset 1:56e6144f44aa draft
Uploaded v0.0.4
author | peterjc |
---|---|
date | Fri, 12 Apr 2013 06:29:52 -0400 |
parents | 44891766cf9b |
children | 21a065d5f0e2 |
files | test-data/k12_hypothetical.fasta test-data/k12_hypothetical.tabular test-data/k12_ten_proteins.fasta tools/filters/seq_filter_by_id.py tools/filters/seq_filter_by_id.txt tools/filters/seq_filter_by_id.xml |
diffstat | 6 files changed, 102 insertions(+), 14 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_hypothetical.fasta Fri Apr 12 06:29:52 2013 -0400 @@ -0,0 +1,3 @@ +>gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655] +MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL +HGPPPPPRHHKKAPHDHHGGHGPGKHHR
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_hypothetical.tabular Fri Apr 12 06:29:52 2013 -0400 @@ -0,0 +1,2 @@ +#ID Description +gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655]
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/k12_ten_proteins.fasta Fri Apr 12 06:29:52 2013 -0400 @@ -0,0 +1,60 @@ +>gi|16127995|ref|NP_414542.1| thr operon leader peptide [Escherichia coli str. K-12 substr. MG1655] +MKRISTTITTTITITTGNGAG +>gi|16127996|ref|NP_414543.1| fused aspartokinase I and homoserine dehydrogenase I [Escherichia coli str. K-12 substr. MG1655] +MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI +FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA +RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS +AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC +LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT +QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL +ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW +LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV +ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM +KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE +IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK +VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV +>gi|16127997|ref|NP_414544.1| homoserine kinase [Escherichia coli str. K-12 substr. MG1655] +MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE +RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY +DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF +IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA +DWLGKNYLQNQEGFVHICRLDTAGARVLEN +>gi|16127998|ref|NP_414545.1| threonine synthase [Escherichia coli str. K-12 substr. MG1655] +MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLDFVTRSAKILSAFIGDEIPQE +ILEERVRAAFAFPAPVANVESDVGCLELFHGPTLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAA +VAHAFYGLPNVKVVILYPRGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNS +ANSINISRLLAQICYYFEAVAQLPQETRNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVNDTVP +RFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDETTQQTMRELKELGYTS +EPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGETLDLPKELAERADLPLLSHNLPADFAAL +RKLMMNHQ +>gi|16127999|ref|NP_414546.1| hypothetical protein b0005 [Escherichia coli str. K-12 substr. MG1655] +MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL +HGPPPPPRHHKKAPHDHHGGHGPGKHHR +>gi|16128000|ref|NP_414547.1| peroxide resistance protein, lowers intracellular iron [Escherichia coli str. K-12 substr. MG1655] +MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPD +FTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARG +KDLYQFWGDIITNKLNEALAAQGDNVVINLASDEYFKSVKPKKLNAEIIKPVFLDEKNGKFKIISFYAKK +ARGLMSRFIIENRLTKPEQLTGFNSEGYFFDEDSSSNGELVFKRYEQR +>gi|16128001|ref|NP_414548.1| putative transporter [Escherichia coli str. K-12 substr. MG1655] +MPDFFSFINSVLWGSVMIYLLFGAGCWFTFRTGFVQFRYIRQFGKSLKNSIHPQPGGLTSFQSLCTSLAA +RVGSGNLAGVALAITAGGPGAVFWMWVAAFIGMATSFAECSLAQLYKERDVNGQFRGGPAWYMARGLGMR +WMGVLFAVFLLIAYGIIFSGVQANAVARALSFSFDFPPLVTGIILAVFTLLAITRGLHGVARLMQGFVPL +MAIIWVLTSLVICVMNIGQLPHVIWSIFESAFGWQEAAGGAAGYTLSQAITNGFQRSMFSNEAGMGSTPN +AAAAAASWPPHPAAQGIVQMIGIFIDTLVICTASAMLILLAGNGTTYMPLEGIQLIQKAMRVLMGSWGAE +FVTLVVILFAFSSIVANYIYAENNLFFLRLNNPKAIWCLRICTFATVIGGTLLSLPLMWQLADIIMACMA +ITNLTAILLLSPVVHTIASDYLRQRKLGVRPVFDPLRYPDIGRQLSPDAWDDVSQE +>gi|16128002|ref|NP_414549.1| transaldolase B [Escherichia coli str. K-12 substr. MG1655] +MTDKLTSLRQYTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRKLIDDAVAWAKQQSNDRAQQI +VDATDKLAVNIGLEILKLVPGRISTEVDARLSYDTEASIAKAKRLIKLYNDAGISNDRILIKLASTWQGI +RAAEQLEKEGINCNLTLLFSFAQARACAEAGVFLISPFVGRILDWYKANTDKKEYAPAEDPGVVSVSEIY +QYYKEHGYETVVMGASFRNIGEILELAGCDRLTIAPALLKELAESEGAIERKLSYTGEVKARPARITESE +FLWQHNQDPMAVDKLAEGIRKFAIDQEKLEKMIGDLL +>gi|16128003|ref|NP_414550.1| molybdochelatase incorporating molybdenum into molybdopterin [Escherichia coli str. K-12 substr. MG1655] +MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDEQAIIEQTLCELVDEMSCHLV +LTTGGTGPARRDVTPDATLAVADREMPGFGEQMRQISLHFVPTAILSRQVGVIRKQALILNLPGQPKSIK +ETLEGVKDAEGNVVVHGIFASVPYCIQLLEGPYVETAPEVVAAFRPKSARRDVSE +>gi|16128004|ref|NP_414551.1| inner membrane protein, Grp1_Fun34_YaaH family [Escherichia coli str. K-12 substr. MG1655] +MGNTKLANPAPLGLMGFGMTTILLNLHNVGYFALDGIILAMGIFYGGIAQIFAGLLEYKKGNTFGLTAFT +SYGSFWLTLVAILLMPKLGLTDAPNAQFLGVYLGLWGVFTLFMFFGTLKGARVLQFVFFSLTVLFALLAI +GNIAGNAAIIHFAGWIGLICGASAIYLAMGEVLNEQFGRTVLPIGESH +
--- a/tools/filters/seq_filter_by_id.py Tue Jun 07 16:38:54 2011 -0400 +++ b/tools/filters/seq_filter_by_id.py Fri Apr 12 06:29:52 2013 -0400 @@ -25,10 +25,11 @@ molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878. -This script is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved. +This script is copyright 2010-2013 by Peter Cock, The James Hutton Institute +(formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved. See accompanying text file for licence details (MIT/BSD style). -This is version 0.0.1 of the script. +This is version 0.0.3 of the script, use -v or --version to get the version. """ import sys @@ -36,6 +37,10 @@ sys.stderr.write(msg.rstrip() + "\n") sys.exit(err) +if "-v" in sys.argv or "--version" in sys.argv: + print "v0.0.3" + sys.exit(0) + #Parse Command Line try: tabular_file, cols_arg, in_file, seq_format, out_positive_file, out_negative_file = sys.argv[1:] @@ -123,7 +128,7 @@ positive_writer = fastaWriter(open(out_positive_file, "w")) negative_writer = fastaWriter(open(out_negative_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the > on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if record.identifier and record.identifier.split()[0][1:] in ids: positive_writer.write(record) else: @@ -134,7 +139,7 @@ print "Generating matching FASTA file" positive_writer = fastaWriter(open(out_positive_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the > on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if record.identifier and record.identifier.split()[0][1:] in ids: positive_writer.write(record) positive_writer.close() @@ -142,10 +147,11 @@ print "Generating non-matching FASTA file" negative_writer = fastaWriter(open(out_negative_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the > on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if not record.identifier or record.identifier.split()[0][1:] not in ids: negative_writer.write(record) negative_writer.close() + reader.close() elif seq_format.lower().startswith("fastq"): #Write filtered FASTQ file based on IDs from tabular file from galaxy_utils.sequence.fastq import fastqReader, fastqWriter @@ -155,7 +161,7 @@ positive_writer = fastqWriter(open(out_positive_file, "w")) negative_writer = fastqWriter(open(out_negative_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the @ on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if record.identifier and record.identifier.split()[0][1:] in ids: positive_writer.write(record) else: @@ -166,7 +172,7 @@ print "Generating matching FASTQ file" positive_writer = fastqWriter(open(out_positive_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the @ on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if record.identifier and record.identifier.split()[0][1:] in ids: positive_writer.write(record) positive_writer.close() @@ -174,9 +180,10 @@ print "Generating non-matching FASTQ file" negative_writer = fastqWriter(open(out_negative_file, "w")) for record in reader: - #The [1:] is because the fastaReader leaves the @ on the identifer. + #The [1:] is because the fastaReader leaves the > on the identifier. if not record.identifier or record.identifier.split()[0][1:] not in ids: negative_writer.write(record) negative_writer.close() + reader.close() else: stop_err("Unsupported file type %r" % seq_format)
--- a/tools/filters/seq_filter_by_id.txt Tue Jun 07 16:38:54 2011 -0400 +++ b/tools/filters/seq_filter_by_id.txt Fri Apr 12 06:29:52 2013 -0400 @@ -1,7 +1,8 @@ Galaxy tool to filter FASTA, FASTQ or SFF sequences by ID ========================================================= -This tool is copyright 2010 by Peter Cock, SCRI, UK. All rights reserved. +This tool is copyright 2010-2011 by Peter Cock, The James Hutton Institute +(formerly SCRI, Scottish Crop Research Institute), UK. All rights reserved. See the licence text below. This tool is a short Python script (using both the Galaxy and Biopython library @@ -10,6 +11,10 @@ include filtering based on search results from a tool like NCBI BLAST before assembly. + +Installation +============ + There are just two files to install: * seq_filter_by_id.py (the Python script) @@ -30,6 +35,8 @@ ======= v0.0.1 - Initial version, combining three separate scripts for each file format. +v0.0.4 - Record script version when run from Galaxy. + - Basic unit test included. Developers @@ -41,10 +48,10 @@ This incorporates the previously used hg branch: http://bitbucket.org/peterjc/galaxy-central/src/fasta_filter -For making the "Galaxy Tool Shed" http://community.g2.bx.psu.edu/ tarball use +For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use the following command from the Galaxy root folder: -tar -czf seq_filter_by_id.tar.gz tools/filters/seq_filter_by_id.* +$ tar -czf seq_filter_by_id.tar.gz tools/filters/seq_filter_by_id.* test-data/k12_ten_proteins.fasta test-data/k12_hypothetical.fasta test-data/k12_hypothetical.tabular Check this worked: @@ -52,6 +59,9 @@ filter/seq_filter_by_id.py filter/seq_filter_by_id.txt filter/seq_filter_by_id.xml +test-data/k12_ten_proteins.fasta +test-data/k12_hypothetical.fasta +test-data/k12_hypothetical.tabular Licence (MIT/BSD style)
--- a/tools/filters/seq_filter_by_id.xml Tue Jun 07 16:38:54 2011 -0400 +++ b/tools/filters/seq_filter_by_id.xml Fri Apr 12 06:29:52 2013 -0400 @@ -1,5 +1,6 @@ -<tool id="seq_filter_by_id" name="Filter sequences by ID" version="0.0.1"> +<tool id="seq_filter_by_id" name="Filter sequences by ID" version="0.0.4"> <description>from a tabular file</description> + <version_command interpreter="python">seq_filter_by_id.py --version</version_command> <command interpreter="python"> seq_filter_by_id.py $input_tabular $columns $input_file $input_file.ext #if $output_choice_cond.output_choice=="both" @@ -11,7 +12,7 @@ #end if </command> <inputs> - <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file to filter on the identifiers" description="FASTA, FASTQ, or SFF format." /> + <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file to filter on the identifiers" help="FASTA, FASTQ, or SFF format." /> <param name="input_tabular" type="data" format="tabular" label="Tabular file containing sequence identifiers"/> <param name="columns" type="data_column" data_ref="input_tabular" multiple="True" numerical="False" label="Column(s) containing sequence identifiers" help="Multi-select list - hold the appropriate key while clicking to select multiple columns"> <validator type="no_options" message="Pick at least one column"/> @@ -55,6 +56,11 @@ </data> </outputs> <tests> + <param name="input_file" value="test-data/k12_ten_proteins.fasta" ftype="fasta" /> + <param name="input_tabular" value="k12_hypothetical.tabular" ftype="tabular" /> + <param name="columns" value="1" /> + <param name="output_choice_cond" value="pos" /> + <output name="output_pos" file="test-data/k12_hypothetical.fasta" ftype="fasta" /> </tests> <requirements> <requirement type="python-module">Bio</requirement> @@ -80,7 +86,7 @@ reads without BLAST matches (i.e. those which do not match your contaminant database). -You may have a file of FASTA sequences which has been run some some analysis +You may have a file of FASTA sequences which has been used with some analysis tool giving tabular output, which has then been filtered on some criteria. You can then use this tool to divide the original FASTA file into those entries matching or not matching your criteria (those with or without their identifier