Galaxy | (sandbox for testing) |

Changeset 4:09a4ee5d12fd (2015-03-06)

Previous changeset 3:dc55e58fa890 (2014-11-21) Next changeset 5:557afc1343bb (2015-03-06)

Commit message:
Uploaded v0.2.0, adding desired count method

modified:
tools/sample_seqs/README.rst
tools/sample_seqs/sample_seqs.py
tools/sample_seqs/sample_seqs.xml

added:
test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff
test-data/ecoli.sample_C10.fastq
test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta
test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta

diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff

Binary file test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff has changed

diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/ecoli.sample_C10.fastq
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/ecoli.sample_C10.fastq Fri Mar 06 04:54:03 2015 -0500

@@ -0,0 +1,40 @@
+@frag_1
+AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTC
++
+##%')+.024JMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_504
+TACGTTCGGCATCGCTGATATTGGGTAAAGCATCCT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_1008
+GTCGCAGGTATAGACCCCGTCAACGTCCGTCCAAAT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_1512
+ATCACCTACCACCGAGATAATGGCCAGCCGTTCCGT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_2016
+GAAACCTTCGCGCAGGAAGTCGGCATATTGATCCGC
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_2520
+TCCTTCATCACGGGCCTTCGCCACGCGCGCGGCAAA
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_3024
+TCCAGGGTCATCGCCACTGGAATTTGCTTACCCAGT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_3528
+ACCGCGCCGATTTCCGCGACCGCCTGCCGCGCCTGC
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_4032
+CGACCGCCGAAATCTTTAAATGCCAGCGTTGGCCCG
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_4536
+GGCACGGTATCGTTCACGTTGGTCGCAGCAATAAAA
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta Fri Mar 06 04:54:03 2015 -0500

@@ -0,0 +1,119 @@
+>Streptococcus_suis|ORF1 length 457 aa, 1374 bp, from 1..1374 of Streptococcus_suis
+MNQEQLFWQRFIELAKVNFKPSIYDFYVADAKLLGINQQVANIFLNRPFKKDFWEKNFEE
+LMIAASFESYGEPLTIQYQFTEDEQEIRNTTNTRSSIVHQVQTLEPATPQETFKPVHSDI
+KSQYTFANFVQGDNNHWAKAAALAVSDNLGELYNPLFIFGGPGLGKTHILNAIGNKVLAD
+NPQARIKYVSSETFINEFLEHLRLNDMESFKKTYRNLDLLLIDDIQSLRNKATTQEEFFH
+TFNALHEKNKQIVLTSDRNPDHLDNLEERLVTRFKWGLTSEITPPDFETRIAILRNKCEN
+LPYNFTNETLSYLAGQFDSNVRDLEGALKDIHLIATMRQLSEISVEVAAEAIRSRKQTNP
+QNMVIPIEKIQTEVGNFYGVSLKELKGSKRVQHIVHARQVAMFLAREMTDNSLPKIGKEF
+GNRDHTTVMHAYNKIKTLLLDDENLEIEITSIKNKLR
+>Streptococcus_suis|ORF2 length 385 aa, 1158 bp, from 1507..2664 of Streptococcus_suis
+IINKGESMIQFSINKNIFLQALSITKRAISTKNAIPILSTVKITVTSEGITLTGSNGQIS
+IEHFISIQDENAGLLISSPGSILLEAGFFINVVSSMPDLVLDFNEIEQKQIVLTSGKSEI
+TLKGKEAEQYPRLQEVPTSKPLVLETKVLKQTINETAFAASTQESRPILTGVHFVLTENK
+NLKTVATDSHRMSQRKLVLDTSGDDFNVVIPSRSLREFTAVFTDDIETVEVFFSNNQILF
+RSEHISFYTRLLEGTYPDTDRLIPTEFKTTAIFDTANLRHSMERARLLSNATQNGTVKLE
+IANNVVSAHVNSPEVGRVNEELDTVEVSGEDLVISFNPTYLIEALKATTSEQVKISFISS
+VRPFTLIPNNEGEDFIQLVTPVRTN
+>Streptococcus_suis|ORF291 length 760 aa, 2283 bp, from complement(184307..186589) of Streptococcus_suis
+KRGEFMRFNQFSFIKKETSVYLQELDTLGFQLIPDASSKTNLETFVRKCHFLTANTDFAL
+SNMIAEWDTDLLTFFQSDRELTDQIFYQVAFQLLGFVPGMDYTDVMDFVEKSNFPIVYGD
+IIDNLYQLLNTRTKSGNTLIDQLVSDDLIPEDNHYHFFNGKSMATFSTKNLIREVVYVET
+PVDTAGTGQTDIVKLSILRPHFDGKIPAVITNSPYHQGVNDVASDKALHKMEGELAEKQV
+GTIQVKQASITKLDLDQRNLPVSPATEKLGHITSYSLNDYFLARGFASLHVSGVGTLGST
+GYMTSGDYQQVEGYKAVIDWLNGRTKAYTDHTRSLEVKADWANGKVATTGLSYLGTMSNA
+LATTGVDGLEVIIAEAGISSWYDYYRENGLVTSPGGYPGEDLDSLTALTYSKSLQAGDFL
+RNKAAYEKGLAAERAALDRTSGDYNQYWHDRNYLLHADRVKCEVVFTHGSQDWNVKPIHV
+WNMFHALPSHIKKHLFFHNGAHVYMNNWQSIDFRESMNALLSQKLLGYENNYQLPTVIWQ
+DNSGEQTWTTLDTFGGENETVLPLGTGSQTVANQYTQEDFERYGKSYSAFHQDLYAGKAN
+QISIELPVTEGLLLNGQVTLKLRVASSVAKGLLSAQLLDKGNKKRLAPIPAPKARLSLDN
+GRYHAQENLVELPYVEMPQRLVTKGFMNLQNRTDLMTVEEVVPGQWMNLTWKLQPTIYQL
+KKGDVLELILYTTDFECTVRDNSQWQIHLDLSQSQLILPH
+>Streptococcus_suis|ORF292 length 216 aa, 651 bp, from 185183..185833 of Streptococcus_suis
+AVGKDHLTLDPISVEQIIAVMPVLIVVTAGAVQGSTLGSQSFFVGCFIAEEVTCLQTLGV
+GQGGQAVQIFAWIATRAGHQPVFTVVVIPRGNPCFCDDDFQSVHASCCQGIGHGTEIRQS
+RRRYLTIGPIGLDLKRASVVCVGLGATVQPVNHRFIALHLLVVARCHVARRAQRANTRHM
+EAGKAASEEVVIEGVRSNVPQFFSSRADRQVPLVQV
+>Streptococcus_suis|ORF583 length 391 aa, 1176 bp, from 397805..398980 of Streptococcus_suis
+RKKMKKQFELIATAAAGLEAVVGREIRNLGYECQVENGRVRFQGDVKSIIETNIWLRSAD
+RIKIIVGQFPAKTFEELFQGVFNLDWENYLPLGCKFPISKAKCVKSKLHNEPSVQAISKK
+AVVKKLQKHFSRPEGVPLQEMGAEFKIEVSILKDVATVMIDTTGSSLFKRGYRVEKGGAP
+IKENMAAAILQLSNWYPDKPLIDPTCGSGTFCIEAAMLAKNIAPGLKRSFAFEEWPWVED
+QLVVALRKEAQASIKTDLVLDITGSDIDARMIEIAKKNAFAAGVEQDIVFKQMRVQDLRT
+DKINGVIISNPPYGERLLDDEAIVTLYREMGETFEPLKTWSKFILTSDELFETRFGQQAD
+KKRKLYNGTLKVDLYQFFGQRVKRQVQEVQG
+>Streptococcus_suis|ORF584 length 487 aa, 1464 bp, from 398981..400444 of Streptococcus_suis
+EDIVGEKNSHHLPLDEEKVLDFEVAKDLTIEEAVKKHKEIEAGVTEDDGLLDRYIKQHRA
+EIESQKFETKINHLPLVEVADEEKNQGHESAEEVEANESSLTEVSEEIAPIVEELSVTPM
+ETLEETVIASTVAMEGLSSVADDSSLELEEDETEDLDHSEGADRDQKKKFYFWSAVGLSM
+IGVMATALVWMNSVNKSNTATSSSSTSTSQTSSTASSSTDANVTAFEQLYNSFFTDSSLT
+KLKNSEFGKLAELKVLLEKLDKNSDSYTKAKEQYDHLEKAIAAIQAINGQFDKEVVVNGE
+IDTTATVKSGESLSATTTGISAVDSLLASVVNFGRSQQEVASATVASEAAVTRNQGADET
+VSTGVPATTEVASTTVSGSTTDFGIAVPAGVVLQRDRSRVPYNQAMIDDVNNEAWNFNPG
+ILENIVTISQQRGYITGNQYILEKVNIINGNGYYNMFKPDGTYLFSINCKTGYFVGNGAG
+HSDALDY
+>Streptococcus_suis|ORF873 length 343 aa, 1032 bp, from 605439..606470 of Streptococcus_suis
+TLGEETMTNVFKGRHFLAEKDFTRAELEWLIDFSAHLKDLKKRNIPHRYLEGKNIALLFE
+KTSTRTRAAFTVASIDLGAHPEYLGANDIQLGKKESTEDTAKVLGRMFDGIEFRGFSQKM
+VEELAEFSGVPVWNGLTDAWHPTQMLADYLTVKENFGKLEGLTLVYCGDGRNNVANSLLV
+TGAILGVNVHIFSPKELFPEEEVVALAEGFAKESGARVLITDNADEAVKGADVLYTDVWV
+SMGEEDKFAERVALLKPYQVNMELVKKAENENLIFLHCLPAFHDTNTVYGKDVAEKFGVE
+EMEVTDEVFRSKYARHFDQAENRMHTIKAVMAATLGDPFVPRV
+>Streptococcus_suis|ORF874 length 113 aa, 342 bp, from complement(605625..605966) of Streptococcus_suis
+VSNIVTAITTVNQSQAFQLAKVFFDSQVVRQHLSWVPCICQTIPYWHTGEFCQFFHHFLT
+ETTEFNTVEHTSQNFSSIFCRFFLTKLDVICTKIFWMGTKVNRCYCEGSTSTC
+>Streptococcus_suis|ORF1165 length 105 aa, 318 bp, from 811613..811930 of Streptococcus_suis
+AYNESVKRKECHLMKQVNMSKIINYLTILGLLILLSAFFLDNWIRDWFFPSSWGNVATML
+ILPLLGALILILSIYYKKLWTGLISIFLIISFPLIFGIGYFIFGP
+>Streptococcus_suis|ORF1166 length 125 aa, 378 bp, from 811867..812244 of Streptococcus_suis
+YLLNNLISSDIRYWLLYIWPLEGVVMNLTLLKRLNLVLYGIAIFLFVMLFLPIGQWFDIV
+NVNFKLTFFIIPFFGLASLPTAIYTKNVRQILLSVLLVALYFILFSLITALSGLFHLNFY
+SFFFK
+>Streptococcus_suis|ORF1455 length 114 aa, 345 bp, from 1026973..1027317 of Streptococcus_suis
+SCKLSLHIRWESWMGQGFYCYRFKLIHLRTNSNPFSFFRHLNSHFQHLRNEWTVMLPDSV
+LDQDISTSHCRCHHKGTRFDTILHHLMFCASQFFYTSNRNRLCTCPLNFCPHFV
+>Streptococcus_suis|ORF1456 length 116 aa, 351 bp, from complement(1027944..1028294) of Streptococcus_suis
+YGNACNSRPPTCDKSYSCWETLIYMGLNLVQFHFLISWYNGNMVISILQFFSHILFIYLA
+HHLLVTTVDWSRWLKVTGDNQRKINLLILFLAIALGYLVSTFFLELLMMGRSFANM
+>Streptococcus_suis|ORF1747 length 335 aa, 1008 bp, from complement(1225218..1226225) of Streptococcus_suis
+RMLNTDDTVTIYDVAREAGVSMATVSRVVNGNKNVKENTRKKVLEVIDRLDYRPNAVARG
+LASKKTTTVGVVIPNIANAYFATLAKGIDDIADMYKYNIVLANSDENDEKEINVVNTLFS
+KQVDGIIFMGYHLTDKIRAEFSRSRTPIVLAGTVDLEHQLPSVNIDYAAASVDAVNLLAK
+NNKKIAFVSGPLVDDINGKVRFAGYKQGLKDNGIEFNEGLVFESKYKYEEGYALAERILN
+AGATAAYVAEDEIAAGLLNGVSDMGIKVPEDFEIITSDDSLVTKFTRPNLTSINQPLYDI
+GAIAMRMLTKIMHKEELENREVVLNHGIKVRKSTK
+>Streptococcus_suis|ORF1748 length 377 aa, 1134 bp, from 1226384..1227517 of Streptococcus_suis
+TKISLFLPLHARKVSTMSKLHHVKSYLEANKMDLAIFSDPVSIYYLTGYHSDPHERHMML
+FVMPDHDSLLFLPALDVERAVATVDFPVAGYMDSENPWQIIKSKLPQKSFSAICAEFDNL
+NLTRYHGLQSIFSQPFSDITPLINTMKLIKSRDEIEKMLVAGEFADKAMQVGFNNISLDV
+TETDIIAQIEFEMKKQGISKMSFETMVLTGDNAANPHGIPSTNKIENNALLLFDLGVEAL
+GYTSDMTRTVAVGKPDQFKKDIYNLTLEAHMAAVNMIKPGVTAGEIDYAARSVIEKAGYG
+EYFNHRLGHGLGMSVHEFPSIMEGNDLVIEEGMCFSVEPGIYIPGKVGVRIEDCGYVTKN
+GFEVFTKTPKELLYFEG
+>Streptococcus_suis|ORF2037 length 234 aa, 705 bp, from complement(1422380..1423084) of Streptococcus_suis
+KSMTKTALITGVSSGIGLAQAGIFLENGWRVFGIDLASKPDLAGDFHFLQLDLTGDLSPV
+FSWCQSVDVLCNTAGILDDYRPHLDISEDELAQIFAVNFFAVTRLTRPYLQQMVDRQSGI
+IINMCSIASSLAGGGGSAYTASKHALAGFTKQLALDYAKDKVQIFGIAPGAVQTGMTQKD
+FEPGGLADWVADQTPIGRWTQPSEIAELTFMLATGKLASMQGQIITIDGGWSLK
+>Streptococcus_suis|ORF2038 length 112 aa, 339 bp, from 1422849..1423187 of Streptococcus_suis
+SSKMPAVLQRTSTDWHQEKTGDKSPVRSSCRKWKSPAKSGLLARSIPKTRQPFSKKIPAC
+ARPMPLETPVMRAVLVMDFYPVGRKDIARGRAPHGEAFTLAGHVDEEIGRRL
+>Streptococcus_suis|ORF2329 length 160 aa, 483 bp, from 1612284..1612766 of Streptococcus_suis
+LIETNWFHHLTGQEGLDVLFFHNLGFRITDQLYLEVRKFHLLQGLSQLLRRWSQESRVKG
+ARYIERNHPLDTCFLQQFNRLIHCSHLASDDDLGWCVVVGWGNNPRGNSRTDFFNQVDIC
+VENSNHLTSPCWRSQFHIFTTLSNQGNRIFKGQSSRCHQS
+>Streptococcus_suis|ORF2330 length 329 aa, 990 bp, from complement(1613050..1614039) of Streptococcus_suis
+ARKKDEGIMKTKITELLDIKYPIFQGGMAWVADGDLAGAVSNAGGLGIIGGGNAPKEVVK
+ANIDKVKSITDKPFGVNIMLLSPFADDIVDLVIEEGVKVVTTGAGNPGKYMERLHAAGIT
+VIPVVPSVALAKRMEKLGVDAVIAEGMEAGGHIGKLTTMTLVRQVVEAVSIPVIAAGGIA
+DGAGAAAAFMLGAEAVQVGTRFVVATESNAHQAYKEKVLKAKDIDTTVSASIVGHPVRAI
+KNKLSSAYAAAEKDFLAGKISADAIEELGAGALRNAVVDGDVTNGSVMAGQIAGLVSKEE
+SCEDILKDIYYGAAKVIREEASRWASVGE
+>Streptococcus_suis|ORF2619 length 107 aa, 324 bp, from 1802386..1802709 of Streptococcus_suis
+QLCVGSNPINSLFRRNFFVCCISSQSSCYVHTMWFVGIIVEIIVARYIIIAMGNFQCVCP
+CRRWSNVLNFRNDTIIQPHVFVLNIQTGVNDCNHHSATICLIFRTCF
+>Streptococcus_suis|ORF2620 length 192 aa, 579 bp, from complement(1803558..1804136) of Streptococcus_suis
+RLKIPCFQRKEVTMYDSFDKGWFVLQTYSGYENKVKENLLQRAHTYNMLENILRVEIPTQ
+TVQVEKNGEVKEVEENRFPGYVLVEMVMTDEAWFVVRNTPNVTGFVGSHGNRSKPTPLLE
+EEIRQILVSMGQTVQEFDIDVKVGDTVRIIDGAFTDYTGKITEIDNNKVKMVISMFGNDT
+IAEVNLSQIAEL

diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta Fri Mar 06 04:54:03 2015 -0500

@@ -0,0 +1,50 @@
+>Streptococcus_suis|ORF1 length 457 aa, 1374 bp, from 1..1374 of Streptococcus_suis
+MNQEQLFWQRFIELAKVNFKPSIYDFYVADAKLLGINQQVANIFLNRPFKKDFWEKNFEE
+LMIAASFESYGEPLTIQYQFTEDEQEIRNTTNTRSSIVHQVQTLEPATPQETFKPVHSDI
+KSQYTFANFVQGDNNHWAKAAALAVSDNLGELYNPLFIFGGPGLGKTHILNAIGNKVLAD
+NPQARIKYVSSETFINEFLEHLRLNDMESFKKTYRNLDLLLIDDIQSLRNKATTQEEFFH
+TFNALHEKNKQIVLTSDRNPDHLDNLEERLVTRFKWGLTSEITPPDFETRIAILRNKCEN
+LPYNFTNETLSYLAGQFDSNVRDLEGALKDIHLIATMRQLSEISVEVAAEAIRSRKQTNP
+QNMVIPIEKIQTEVGNFYGVSLKELKGSKRVQHIVHARQVAMFLAREMTDNSLPKIGKEF
+GNRDHTTVMHAYNKIKTLLLDDENLEIEITSIKNKLR
+>Streptococcus_suis|ORF292 length 216 aa, 651 bp, from 185183..185833 of Streptococcus_suis
+AVGKDHLTLDPISVEQIIAVMPVLIVVTAGAVQGSTLGSQSFFVGCFIAEEVTCLQTLGV
+GQGGQAVQIFAWIATRAGHQPVFTVVVIPRGNPCFCDDDFQSVHASCCQGIGHGTEIRQS
+RRRYLTIGPIGLDLKRASVVCVGLGATVQPVNHRFIALHLLVVARCHVARRAQRANTRHM
+EAGKAASEEVVIEGVRSNVPQFFSSRADRQVPLVQV
+>Streptococcus_suis|ORF583 length 391 aa, 1176 bp, from 397805..398980 of Streptococcus_suis
+RKKMKKQFELIATAAAGLEAVVGREIRNLGYECQVENGRVRFQGDVKSIIETNIWLRSAD
+RIKIIVGQFPAKTFEELFQGVFNLDWENYLPLGCKFPISKAKCVKSKLHNEPSVQAISKK
+AVVKKLQKHFSRPEGVPLQEMGAEFKIEVSILKDVATVMIDTTGSSLFKRGYRVEKGGAP
+IKENMAAAILQLSNWYPDKPLIDPTCGSGTFCIEAAMLAKNIAPGLKRSFAFEEWPWVED
+QLVVALRKEAQASIKTDLVLDITGSDIDARMIEIAKKNAFAAGVEQDIVFKQMRVQDLRT
+DKINGVIISNPPYGERLLDDEAIVTLYREMGETFEPLKTWSKFILTSDELFETRFGQQAD
+KKRKLYNGTLKVDLYQFFGQRVKRQVQEVQG
+>Streptococcus_suis|ORF874 length 113 aa, 342 bp, from complement(605625..605966) of Streptococcus_suis
+VSNIVTAITTVNQSQAFQLAKVFFDSQVVRQHLSWVPCICQTIPYWHTGEFCQFFHHFLT
+ETTEFNTVEHTSQNFSSIFCRFFLTKLDVICTKIFWMGTKVNRCYCEGSTSTC
+>Streptococcus_suis|ORF1165 length 105 aa, 318 bp, from 811613..811930 of Streptococcus_suis
+AYNESVKRKECHLMKQVNMSKIINYLTILGLLILLSAFFLDNWIRDWFFPSSWGNVATML
+ILPLLGALILILSIYYKKLWTGLISIFLIISFPLIFGIGYFIFGP
+>Streptococcus_suis|ORF1456 length 116 aa, 351 bp, from complement(1027944..1028294) of Streptococcus_suis
+YGNACNSRPPTCDKSYSCWETLIYMGLNLVQFHFLISWYNGNMVISILQFFSHILFIYLA
+HHLLVTTVDWSRWLKVTGDNQRKINLLILFLAIALGYLVSTFFLELLMMGRSFANM
+>Streptococcus_suis|ORF1747 length 335 aa, 1008 bp, from complement(1225218..1226225) of Streptococcus_suis
+RMLNTDDTVTIYDVAREAGVSMATVSRVVNGNKNVKENTRKKVLEVIDRLDYRPNAVARG
+LASKKTTTVGVVIPNIANAYFATLAKGIDDIADMYKYNIVLANSDENDEKEINVVNTLFS
+KQVDGIIFMGYHLTDKIRAEFSRSRTPIVLAGTVDLEHQLPSVNIDYAAASVDAVNLLAK
+NNKKIAFVSGPLVDDINGKVRFAGYKQGLKDNGIEFNEGLVFESKYKYEEGYALAERILN
+AGATAAYVAEDEIAAGLLNGVSDMGIKVPEDFEIITSDDSLVTKFTRPNLTSINQPLYDI
+GAIAMRMLTKIMHKEELENREVVLNHGIKVRKSTK
+>Streptococcus_suis|ORF2038 length 112 aa, 339 bp, from 1422849..1423187 of Streptococcus_suis
+SSKMPAVLQRTSTDWHQEKTGDKSPVRSSCRKWKSPAKSGLLARSIPKTRQPFSKKIPAC
+ARPMPLETPVMRAVLVMDFYPVGRKDIARGRAPHGEAFTLAGHVDEEIGRRL
+>Streptococcus_suis|ORF2329 length 160 aa, 483 bp, from 1612284..1612766 of Streptococcus_suis
+LIETNWFHHLTGQEGLDVLFFHNLGFRITDQLYLEVRKFHLLQGLSQLLRRWSQESRVKG
+ARYIERNHPLDTCFLQQFNRLIHCSHLASDDDLGWCVVVGWGNNPRGNSRTDFFNQVDIC
+VENSNHLTSPCWRSQFHIFTTLSNQGNRIFKGQSSRCHQS
+>Streptococcus_suis|ORF2620 length 192 aa, 579 bp, from complement(1803558..1804136) of Streptococcus_suis
+RLKIPCFQRKEVTMYDSFDKGWFVLQTYSGYENKVKENLLQRAHTYNMLENILRVEIPTQ
+TVQVEKNGEVKEVEENRFPGYVLVEMVMTDEAWFVVRNTPNVTGFVGSHGNRSKPTPLLE
+EEIRQILVSMGQTVQEFDIDVKVGDTVRIIDGAFTDYTGKITEIDNNKVKMVISMFGNDT
+IAEVNLSQIAEL

diff -r dc55e58fa890 -r 09a4ee5d12fd tools/sample_seqs/README.rst
--- a/tools/sample_seqs/README.rst Fri Nov 21 08:30:03 2014 -0500
+++ b/tools/sample_seqs/README.rst Fri Mar 06 04:54:03 2015 -0500

@@ -59,6 +59,10 @@
v0.1.1  - Using optparse to provide a proper command line API.
v0.1.2  - Interleaved mode for working with paired records.
         - Tool definition now embeds citation information.
+v0.2.0  - Option to give number of sequences (or pairs) desired.
+          This works by first counting all your sequences, then calculates
+          the percentage required in order to sample them uniformly (evenly).
+          This makes two passes through the input and is therefore slower.
======= ======================================================================

@@ -71,7 +75,7 @@
For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use
the following command from the Galaxy root folder::

-    $ tar -czf sample_seqs.tar.gz tools/sample_seqs/README.rst tools/sample_seqs/sample_seqs.py tools/sample_seqs/sample_seqs.xml tools/sample_seqs/tool_dependencies.xml test-data/ecoli.fastq test-data/ecoli.sample_N100.fastq test-data/ecoli.pair_sample_N100.fastq test-data/get_orf_input.Suis_ORF.prot.fasta test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta test-data/MID4_GLZRM4E04_rnd30_frclip.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff
+    $ tar -czf sample_seqs.tar.gz tools/sample_seqs/README.rst tools/sample_seqs/sample_seqs.py tools/sample_seqs/sample_seqs.xml tools/sample_seqs/tool_dependencies.xml test-data/ecoli.fastq test-data/ecoli.sample_N100.fastq test-data/ecoli.pair_sample_N100.fastq test-data/ecoli.sample_C10.fastq test-data/get_orf_input.Suis_ORF.prot.fasta test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta test-data/MID4_GLZRM4E04_rnd30_frclip.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff

Check this worked::

@@ -83,13 +87,17 @@
     test-data/ecoli.fastq
     test-data/ecoli.sample_N100.fastq
     test-data/ecoli.pair_sample_N100.fastq
+    test-data/ecoli.sample_C10.fastq
     test-data/get_orf_input.Suis_ORF.prot.fasta
     test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta
     test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta
+    test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta
+    test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta
     test-data/MID4_GLZRM4E04_rnd30_frclip.sff
     test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff
     test-data/MID4_GLZRM4E04_rnd30_pair_sample.sff
     test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff
+    test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff

Licence (MIT)

diff -r dc55e58fa890 -r 09a4ee5d12fd tools/sample_seqs/sample_seqs.py
--- a/tools/sample_seqs/sample_seqs.py Fri Nov 21 08:30:03 2014 -0500
+++ b/tools/sample_seqs/sample_seqs.py Fri Mar 06 04:54:03 2015 -0500

b'@@ -2,14 +2,14 @@\n """Sub-sample sequence from a FASTA, FASTQ or SFF file.\n \n This tool is a short Python script which requires Biopython 1.62 or later\n-for SFF file support. If you use this tool in scientific work leading to a\n+for sequence parsing. If you use this tool in scientific work leading to a\n publication, please cite the Biopython application note:\n \n Cock et al 2009. Biopython: freely available Python tools for computational\n molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.\n http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.\n \n-This script is copyright 2014 by Peter Cock, The James Hutton Institute\n+This script is copyright 2014-2015 by Peter Cock, The James Hutton Institute\n (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.\n See accompanying text file for licence details (MIT license).\n \n@@ -20,7 +20,7 @@\n from optparse import OptionParser\n \n \n-def stop_err(msg, err=1):\n+def sys_exit(msg, err=1):\n sys.stderr.write(msg.rstrip() + "\\n")\n sys.exit(err)\n \n@@ -32,6 +32,9 @@\n e.g. Sample 20% of the reads:\n \n $ python sample_seqs.py -i my_seq.fastq -f fastq -p 20.0 -o sample.fastq\n+\n+This samples uniformly though the file, rather than at random, and therefore\n+should be reproducible.\n """\n parser = OptionParser(usage=usage)\n parser.add_option(\'-i\', \'--input\', dest=\'input\',\n@@ -49,6 +52,9 @@\n parser.add_option(\'-n\', \'--everyn\', dest=\'everyn\',\n default=None,\n help=\'Take every N-th read\')\n+parser.add_option(\'-c\', \'--count\', dest=\'count\',\n+ default=None,\n+ help=\'Take exactly N reads\')\n parser.add_option("--interleaved", dest="interleaved",\n default=False, action="store_true",\n help="Input is interleaved reads, preserve the pairings")\n@@ -58,31 +64,74 @@\n options, args = parser.parse_args()\n \n if options.version:\n- print("v0.1.2")\n+ print("v0.2.0")\n sys.exit(0)\n \n-seq_format = options.format\n in_file = options.input\n out_file = options.output\n interleaved = options.interleaved\n \n if not in_file:\n- stop_err("Require an input filename")\n+ sys_exit("Require an input filename")\n if in_file != "/dev/stdin" and not os.path.isfile(in_file):\n- stop_err("Missing input file %r" % in_file)\n+ sys_exit("Missing input file %r" % in_file)\n if not out_file:\n- stop_err("Require an output filename")\n+ sys_exit("Require an output filename")\n+if not options.format:\n+ sys_exit("Require the sequence format")\n+seq_format = options.format.lower()\n+\n+\n+def count_fasta(filename):\n+ from Bio.SeqIO.FastaIO import SimpleFastaParser\n+ count = 0\n+ with open(filename) as handle:\n+ for title, seq in SimpleFastaParser(handle):\n+ count += 1\n+ return count\n+\n+\n+def count_fastq(filename):\n+ from Bio.SeqIO.QualityIO import FastqGeneralIterator\n+ count = 0\n+ with open(filename) as handle:\n+ for title, seq, qual in FastqGeneralIterator(handle):\n+ count += 1\n+ return count\n+\n+\n+def count_sff(filename):\n+ from Bio import SeqIO\n+ # If the SFF file has a built in index (which is normal),\n+ # this will be parsed and is the quicker than scanning\n+ # the whole file.\n+ return len(SeqIO.index(filename, "sff"))\n+\n+\n+def count_sequences(filename, format):\n+ if seq_format == "sff":\n+ return count_sff(filename)\n+ elif seq_format == "fasta":\n+ return count_fasta(filename)\n+ elif seq_format.startswith("fastq"):\n+ return count_fastq(filename)\n+ else:\n+ sys_exit("Unsupported file type %r" % seq_format)\n \n \n if options.percent and options.everyn:\n- stop_err("Cannot combine -p and -n options")\n+ sys_exit("Cannot combine -p and -n options")\n+elif options.everyn and options.count:\n+ sys_exit("Cannot combine -p and -c options")\n+elif options.percent and options.count:\n+ sys_exit("Cannot combine -n and -c options")\n elif options.everyn:\n try:\n N = '..b' record\n+ elif total - count + 1 <= N - taken:\n+ # remaining records (incuding this one) <= what we still need.\n+ # This is a safey check for floating point edge cases where\n+ # we need to take all remaining sequences to meet target\n+ taken += 1\n+ yield record\n+ assert taken == N, "Picked %i, wanted %i" % (taken, N)\n else:\n- stop_err("Must use either -n or -p")\n+ sys_exit("Must use either -n, -p or -c")\n \n \n def pair(iterator):\n@@ -180,48 +297,30 @@\n pos_handle.write(record)\n return count\n \n-try:\n- from galaxy_utils.sequence.fastq import fastqReader, fastqWriter\n- def fastq_filter(in_file, out_file, iterator_filter, inter):\n- count = 0\n- #from galaxy_utils.sequence.fastq import fastqReader, fastqWriter\n- reader = fastqReader(open(in_file, "rU"))\n- writer = fastqWriter(open(out_file, "w"))\n- if inter:\n- for r1, r2 in iterator_filter(pair(reader)):\n- count += 1\n- writer.write(r1)\n- writer.write(r2)\n- else:\n- for record in iterator_filter(reader):\n- count += 1\n- writer.write(record)\n- writer.close()\n- reader.close()\n- return count\n-except ImportError:\n- from Bio.SeqIO.QualityIO import FastqGeneralIterator\n- def fastq_filter(in_file, out_file, iterator_filter, inter):\n- count = 0\n- with open(in_file) as in_handle:\n- with open(out_file, "w") as pos_handle:\n- if inter:\n- for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))):\n- count += 1\n- pos_handle.write("@%s\\n%s\\n+\\n%s\\n" % r1)\n- pos_handle.write("@%s\\n%s\\n+\\n%s\\n" % r2)\n- else:\n- for title, seq, qual in iterator_filter(FastqGeneralIterator(in_handle)):\n- count += 1\n- pos_handle.write("@%s\\n%s\\n+\\n%s\\n" % (title, seq, qual))\n- return count\n+\n+from Bio.SeqIO.QualityIO import FastqGeneralIterator\n+def fastq_filter(in_file, out_file, iterator_filter, inter):\n+ count = 0\n+ with open(in_file) as in_handle:\n+ with open(out_file, "w") as pos_handle:\n+ if inter:\n+ for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))):\n+ count += 1\n+ pos_handle.write("@%s\\n%s\\n+\\n%s\\n" % r1)\n+ pos_handle.write("@%s\\n%s\\n+\\n%s\\n" % r2)\n+ else:\n+ for title, seq, qual in iterator_filter(FastqGeneralIterator(in_handle)):\n+ count += 1\n+ pos_handle.write("@%s\\n%s\\n+\\n%s\\n" % (title, seq, qual))\n+ return count\n+\n \n def sff_filter(in_file, out_file, iterator_filter, inter):\n count = 0\n try:\n from Bio.SeqIO.SffIO import SffIterator, SffWriter\n except ImportError:\n- stop_err("SFF filtering requires Biopython 1.54 or later")\n+ sys_exit("SFF filtering requires Biopython 1.54 or later")\n try:\n from Bio.SeqIO.SffIO import ReadRocheXmlManifest\n except ImportError:\n@@ -246,14 +345,14 @@\n #count = writer.write_file(SffIterator(in_handle))\n return count\n \n-if seq_format.lower()=="sff":\n+if seq_format == "sff":\n count = sff_filter(in_file, out_file, sampler, interleaved)\n-elif seq_format.lower()=="fasta":\n+elif seq_format == "fasta":\n count = fasta_filter(in_file, out_file, sampler, interleaved)\n-elif seq_format.lower().startswith("fastq"):\n+elif seq_format.startswith("fastq"):\n count = fastq_filter(in_file, out_file, sampler, interleaved)\n else:\n- stop_err("Unsupported file type %r" % seq_format)\n+ sys_exit("Unsupported file type %r" % seq_format)\n \n if interleaved:\n sys.stderr.write("Selected %i pairs\\n" % count)\n'

diff -r dc55e58fa890 -r 09a4ee5d12fd tools/sample_seqs/sample_seqs.xml
--- a/tools/sample_seqs/sample_seqs.xml Fri Nov 21 08:30:03 2014 -0500
+++ b/tools/sample_seqs/sample_seqs.xml Fri Mar 06 04:54:03 2015 -0500

b'@@ -1,4 +1,4 @@\n-<tool id="sample_seqs" name="Sub-sample sequences files" version="0.1.2">\n+<tool id="sample_seqs" name="Sub-sample sequences files" version="0.2.0">\n <description>e.g. to reduce coverage</description>\n <requirements>\n <requirement type="package" version="1.63">biopython</requirement>\n@@ -9,9 +9,10 @@\n sample_seqs.py -f "$input_file.ext" -i "$input_file" -o "$output_file"\n #if str($sampling.type) == "everyNth":\n -n "${sampling.every_n}"\n+#elif str($sampling.type) == "percentage":\n+-p "${sampling.percent}"\n #else\n-##elif str($sampling.type) == "percentage":\n--p "${sampling.percent}"\n+-c "${sampling.count}"\n #end if\n #if $interleaved\n --interleaved\n@@ -26,8 +27,9 @@\n <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file" help="FASTA, FASTQ, or SFF format." />\n <conditional name="sampling">\n <param name="type" type="select" label="Sub-sampling approach">\n- <option value="everyNth">Take every N-th sequence (e.g. every fifth sequence)</option>\n- <option value="percentage">Take some percentage of the sequences (e.g. 20% will take every fifth sequence)</option>\n+ <option value="everyNth">Take every N-th sequence (or pair, e.g. every fifth sequence)</option>\n+ <option value="percentage">Take some percentage of the sequences (or pairs, e.g. 20% will take every fifth sequence)</option>\n+ <option value="desired_count">Take exactly N sequences (or pairs, e.g. 1000 sequences)</option>\n \n </param>\n <when value="everyNth">\n@@ -36,8 +38,11 @@\n <when value="percentage">\n <param name="percent" value="20.0" type="float" min="0" max="100" label="Percentage" help="Between 0 and 100, e.g. 20% will take every 5th sequence" />\n </when>\n+ <when value="desired_count">\n+ <param name="count" value="1000" type="integer" min="1" label="N" help="Number of unique sequences to pick (between 1 and number itotal n input file)" />\n+ </when>\n </conditional>\n- <param name="interleaved" type="boolean" label="Interleaved paired reads" help="Tick to preserve interleaved pairs on output" />\n+ <param name="interleaved" type="boolean" label="Interleaved paired reads" help="This mode keeps paired reads together (e.g. take every 5th read pair)" />\n </inputs>\n <outputs>\n <data name="output_file" format="input" metadata_source="input_file" label="${input_file.name} (sub-sampled)"/>\n@@ -82,12 +87,37 @@\n <output name="output_file" file="get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta" />\n </test>\n <test>\n+ <param name="input_file" value="get_orf_input.Suis_ORF.prot.fasta" />\n+ <param name="type" value="desired_count" />\n+ <param name="count" value="2910" />\n+ <output name="output_file" file="get_orf_input.Suis_ORF.prot.fasta" />\n+ </test>\n+ <test>\n+ <param name="input_file" value="get_orf_input.Suis_ORF.prot.fasta" />\n+ <param name="type" value="desired_count" />\n+ <param name="count" value="10" />\n+ <param name="interleaved" value="true" />\n+ <output name="output_file" file="get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta" />\n+ </test>\n+ <test>\n <param name="input_file" value="ecoli.fastq" />\n <param name="type" value="percentage" />\n <param name="percent" value="1.0" />\n <output name="output_file" file="ecoli.sample_N100.fastq" />\n </test>\n <test>\n+ <param name="input_file" value="ecoli.fastq" />\n+ <param name="type" value="desired_count" />\n+ <param name="count" value="10" />\n+ <output name="output_file" file="ecoli.sample_C10.fastq" />\n+ </test>\n+ '..b' <param name="percent" value="20.0" />\n@@ -100,27 +130,55 @@\n <param name="interleaved" value="true" />\n <output name="output_file" file="MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff" ftype="sff"/>\n </test>\n+ <test>\n+ <param name="input_file" value="MID4_GLZRM4E04_rnd30.sff" ftype="sff" />\n+ <param name="type" value="desired_count" />\n+ <param name="count" value="30" />\n+ <output name="output_file" file="MID4_GLZRM4E04_rnd30.sff" ftype="sff"/>\n+ </test>\n+ <test>\n+ <param name="input_file" value="MID4_GLZRM4E04_rnd30_frclip.sff" ftype="sff" />\n+ <param name="type" value="desired_count" />\n+ <param name="count" value="1" />\n+ <output name="output_file" file="MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff" ftype="sff"/>\n+ </test>\n </tests>\n <help>\n **What it does**\n \n Takes an input file of sequences (typically FASTA or FASTQ, but also\n Standard Flowgram Format (SFF) is supported), and returns a new sequence\n-file sub-sampling from this (in the same format).\n+file sub-sampling uniformly from this (in the same format, preserving the\n+input order and selecting sequencing evenly though the input file).\n \n-Several sampling modes are supported, all designed to be non-random. This\n-allows reproducibility, and also works on paired sequence files. Also\n-note that by sampling uniformly through the file, this avoids any bias\n-should reads in any part of the file are of lesser quality (e.g. one part\n-of the slide).\n+Several sampling modes are supported, all designed to do non-random\n+uniform sampling (i.e. evenly through the input file). This allows\n+reproducibility, and also works on paired sequence files (run the tool\n+twice, once on each file using the same settings).\n \n-The simplest mode is to take every N-th sequence, for example taking\n+By sampling uniformly (evenly) through the file, this avoids any bias\n+should reads in any part of the file be of lesser quality (e.g. for\n+high throughput sequencing the reads at the start and end of the file\n+can be of lower quality).\n+\n+The simplest mode is to take every *N*-th sequence, for example taking\n every 2nd sequence would sample half the file - while taking every 5th\n sequence would take 20% of the file.\n \n+The target count method picks *N* sequences from the input file, which\n+again will be distributed uniformly (evenly) though the file. This works\n+by first counting the number of records, then calculating the desired\n+percentage of sequences to take. Note if your input file has exactly\n+*N* sequences this selects them all (effectively copying the input file).\n+If your input file has less than *N* sequences, this is treated as an\n+error.\n+\n If you tick the interleaved option, the file is processed as pairs of\n-records - taking for example using 20% would take every 5th pair of\n-records. This ensures your read pairs are preserved. Note this does not\n+records to ensure your read pairs are not separated by sampling.\n+For example using 20% would take every 5th pair of records, or you\n+could request 1000 read pairs.\n+\n+.. class:: warningmark Note interleaves/pair mode does *not*\n actually check your read names match a known pair naming scheme!\n \n **Example Usage**\n@@ -133,8 +191,12 @@\n \n Similarly, if you had some Illumina paired end data interleaved into one\n file with an estimated x200 coverage, you would run this tool in\n-interleaved mode. Taking every 3rd read pair. This would reduce the\n-estimated coverage to about x66, while preserving the read pairing.\n+interleaved mode, taking every 3rd read pair. This would again reduce\n+the estimated coverage to about x66, while preserving the read pairing.\n+\n+Suppose you have a transcriptome assembly, and wish to look at the\n+species distribution of the top BLAST hits for an initial quality check.\n+Rather than using all your sequences, you could pick 1000 only for this.\n \n **Citation**\n \n'