Mercurial > repos > peterjc > sample_seqs

Binary file test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff has changed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/ecoli.sample_C10.fastq	Fri Mar 06 04:54:03 2015 -0500
@@ -0,0 +1,40 @@
+@frag_1
+AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTC
++
+##%')+.024JMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_504
+TACGTTCGGCATCGCTGATATTGGGTAAAGCATCCT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_1008
+GTCGCAGGTATAGACCCCGTCAACGTCCGTCCAAAT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_1512
+ATCACCTACCACCGAGATAATGGCCAGCCGTTCCGT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_2016
+GAAACCTTCGCGCAGGAAGTCGGCATATTGATCCGC
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_2520
+TCCTTCATCACGGGCCTTCGCCACGCGCGCGGCAAA
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_3024
+TCCAGGGTCATCGCCACTGGAATTTGCTTACCCAGT
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_3528
+ACCGCGCCGATTTCCGCGACCGCCTGCCGCGCCTGC
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_4032
+CGACCGCCGAAATCTTTAAATGCCAGCGTTGGCCCG
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
+@frag_4536
+GGCACGGTATCGTTCACGTTGGTCGCAGCAATAAAA
++
+MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta	Fri Mar 06 04:54:03 2015 -0500
@@ -0,0 +1,119 @@
+>Streptococcus_suis|ORF1 length 457 aa, 1374 bp, from 1..1374 of Streptococcus_suis
+MNQEQLFWQRFIELAKVNFKPSIYDFYVADAKLLGINQQVANIFLNRPFKKDFWEKNFEE
+LMIAASFESYGEPLTIQYQFTEDEQEIRNTTNTRSSIVHQVQTLEPATPQETFKPVHSDI
+KSQYTFANFVQGDNNHWAKAAALAVSDNLGELYNPLFIFGGPGLGKTHILNAIGNKVLAD
+NPQARIKYVSSETFINEFLEHLRLNDMESFKKTYRNLDLLLIDDIQSLRNKATTQEEFFH
+TFNALHEKNKQIVLTSDRNPDHLDNLEERLVTRFKWGLTSEITPPDFETRIAILRNKCEN
+LPYNFTNETLSYLAGQFDSNVRDLEGALKDIHLIATMRQLSEISVEVAAEAIRSRKQTNP
+QNMVIPIEKIQTEVGNFYGVSLKELKGSKRVQHIVHARQVAMFLAREMTDNSLPKIGKEF
+GNRDHTTVMHAYNKIKTLLLDDENLEIEITSIKNKLR
+>Streptococcus_suis|ORF2 length 385 aa, 1158 bp, from 1507..2664 of Streptococcus_suis
+IINKGESMIQFSINKNIFLQALSITKRAISTKNAIPILSTVKITVTSEGITLTGSNGQIS
+IEHFISIQDENAGLLISSPGSILLEAGFFINVVSSMPDLVLDFNEIEQKQIVLTSGKSEI
+TLKGKEAEQYPRLQEVPTSKPLVLETKVLKQTINETAFAASTQESRPILTGVHFVLTENK
+NLKTVATDSHRMSQRKLVLDTSGDDFNVVIPSRSLREFTAVFTDDIETVEVFFSNNQILF
+RSEHISFYTRLLEGTYPDTDRLIPTEFKTTAIFDTANLRHSMERARLLSNATQNGTVKLE
+IANNVVSAHVNSPEVGRVNEELDTVEVSGEDLVISFNPTYLIEALKATTSEQVKISFISS
+VRPFTLIPNNEGEDFIQLVTPVRTN
+>Streptococcus_suis|ORF291 length 760 aa, 2283 bp, from complement(184307..186589) of Streptococcus_suis
+KRGEFMRFNQFSFIKKETSVYLQELDTLGFQLIPDASSKTNLETFVRKCHFLTANTDFAL
+SNMIAEWDTDLLTFFQSDRELTDQIFYQVAFQLLGFVPGMDYTDVMDFVEKSNFPIVYGD
+IIDNLYQLLNTRTKSGNTLIDQLVSDDLIPEDNHYHFFNGKSMATFSTKNLIREVVYVET
+PVDTAGTGQTDIVKLSILRPHFDGKIPAVITNSPYHQGVNDVASDKALHKMEGELAEKQV
+GTIQVKQASITKLDLDQRNLPVSPATEKLGHITSYSLNDYFLARGFASLHVSGVGTLGST
+GYMTSGDYQQVEGYKAVIDWLNGRTKAYTDHTRSLEVKADWANGKVATTGLSYLGTMSNA
+LATTGVDGLEVIIAEAGISSWYDYYRENGLVTSPGGYPGEDLDSLTALTYSKSLQAGDFL
+RNKAAYEKGLAAERAALDRTSGDYNQYWHDRNYLLHADRVKCEVVFTHGSQDWNVKPIHV
+WNMFHALPSHIKKHLFFHNGAHVYMNNWQSIDFRESMNALLSQKLLGYENNYQLPTVIWQ
+DNSGEQTWTTLDTFGGENETVLPLGTGSQTVANQYTQEDFERYGKSYSAFHQDLYAGKAN
+QISIELPVTEGLLLNGQVTLKLRVASSVAKGLLSAQLLDKGNKKRLAPIPAPKARLSLDN
+GRYHAQENLVELPYVEMPQRLVTKGFMNLQNRTDLMTVEEVVPGQWMNLTWKLQPTIYQL
+KKGDVLELILYTTDFECTVRDNSQWQIHLDLSQSQLILPH
+>Streptococcus_suis|ORF292 length 216 aa, 651 bp, from 185183..185833 of Streptococcus_suis
+AVGKDHLTLDPISVEQIIAVMPVLIVVTAGAVQGSTLGSQSFFVGCFIAEEVTCLQTLGV
+GQGGQAVQIFAWIATRAGHQPVFTVVVIPRGNPCFCDDDFQSVHASCCQGIGHGTEIRQS
+RRRYLTIGPIGLDLKRASVVCVGLGATVQPVNHRFIALHLLVVARCHVARRAQRANTRHM
+EAGKAASEEVVIEGVRSNVPQFFSSRADRQVPLVQV
+>Streptococcus_suis|ORF583 length 391 aa, 1176 bp, from 397805..398980 of Streptococcus_suis
+RKKMKKQFELIATAAAGLEAVVGREIRNLGYECQVENGRVRFQGDVKSIIETNIWLRSAD
+RIKIIVGQFPAKTFEELFQGVFNLDWENYLPLGCKFPISKAKCVKSKLHNEPSVQAISKK
+AVVKKLQKHFSRPEGVPLQEMGAEFKIEVSILKDVATVMIDTTGSSLFKRGYRVEKGGAP
+IKENMAAAILQLSNWYPDKPLIDPTCGSGTFCIEAAMLAKNIAPGLKRSFAFEEWPWVED
+QLVVALRKEAQASIKTDLVLDITGSDIDARMIEIAKKNAFAAGVEQDIVFKQMRVQDLRT
+DKINGVIISNPPYGERLLDDEAIVTLYREMGETFEPLKTWSKFILTSDELFETRFGQQAD
+KKRKLYNGTLKVDLYQFFGQRVKRQVQEVQG
+>Streptococcus_suis|ORF584 length 487 aa, 1464 bp, from 398981..400444 of Streptococcus_suis
+EDIVGEKNSHHLPLDEEKVLDFEVAKDLTIEEAVKKHKEIEAGVTEDDGLLDRYIKQHRA
+EIESQKFETKINHLPLVEVADEEKNQGHESAEEVEANESSLTEVSEEIAPIVEELSVTPM
+ETLEETVIASTVAMEGLSSVADDSSLELEEDETEDLDHSEGADRDQKKKFYFWSAVGLSM
+IGVMATALVWMNSVNKSNTATSSSSTSTSQTSSTASSSTDANVTAFEQLYNSFFTDSSLT
+KLKNSEFGKLAELKVLLEKLDKNSDSYTKAKEQYDHLEKAIAAIQAINGQFDKEVVVNGE
+IDTTATVKSGESLSATTTGISAVDSLLASVVNFGRSQQEVASATVASEAAVTRNQGADET
+VSTGVPATTEVASTTVSGSTTDFGIAVPAGVVLQRDRSRVPYNQAMIDDVNNEAWNFNPG
+ILENIVTISQQRGYITGNQYILEKVNIINGNGYYNMFKPDGTYLFSINCKTGYFVGNGAG
+HSDALDY
+>Streptococcus_suis|ORF873 length 343 aa, 1032 bp, from 605439..606470 of Streptococcus_suis
+TLGEETMTNVFKGRHFLAEKDFTRAELEWLIDFSAHLKDLKKRNIPHRYLEGKNIALLFE
+KTSTRTRAAFTVASIDLGAHPEYLGANDIQLGKKESTEDTAKVLGRMFDGIEFRGFSQKM
+VEELAEFSGVPVWNGLTDAWHPTQMLADYLTVKENFGKLEGLTLVYCGDGRNNVANSLLV
+TGAILGVNVHIFSPKELFPEEEVVALAEGFAKESGARVLITDNADEAVKGADVLYTDVWV
+SMGEEDKFAERVALLKPYQVNMELVKKAENENLIFLHCLPAFHDTNTVYGKDVAEKFGVE
+EMEVTDEVFRSKYARHFDQAENRMHTIKAVMAATLGDPFVPRV
+>Streptococcus_suis|ORF874 length 113 aa, 342 bp, from complement(605625..605966) of Streptococcus_suis
+VSNIVTAITTVNQSQAFQLAKVFFDSQVVRQHLSWVPCICQTIPYWHTGEFCQFFHHFLT
+ETTEFNTVEHTSQNFSSIFCRFFLTKLDVICTKIFWMGTKVNRCYCEGSTSTC
+>Streptococcus_suis|ORF1165 length 105 aa, 318 bp, from 811613..811930 of Streptococcus_suis
+AYNESVKRKECHLMKQVNMSKIINYLTILGLLILLSAFFLDNWIRDWFFPSSWGNVATML
+ILPLLGALILILSIYYKKLWTGLISIFLIISFPLIFGIGYFIFGP
+>Streptococcus_suis|ORF1166 length 125 aa, 378 bp, from 811867..812244 of Streptococcus_suis
+YLLNNLISSDIRYWLLYIWPLEGVVMNLTLLKRLNLVLYGIAIFLFVMLFLPIGQWFDIV
+NVNFKLTFFIIPFFGLASLPTAIYTKNVRQILLSVLLVALYFILFSLITALSGLFHLNFY
+SFFFK
+>Streptococcus_suis|ORF1455 length 114 aa, 345 bp, from 1026973..1027317 of Streptococcus_suis
+SCKLSLHIRWESWMGQGFYCYRFKLIHLRTNSNPFSFFRHLNSHFQHLRNEWTVMLPDSV
+LDQDISTSHCRCHHKGTRFDTILHHLMFCASQFFYTSNRNRLCTCPLNFCPHFV
+>Streptococcus_suis|ORF1456 length 116 aa, 351 bp, from complement(1027944..1028294) of Streptococcus_suis
+YGNACNSRPPTCDKSYSCWETLIYMGLNLVQFHFLISWYNGNMVISILQFFSHILFIYLA
+HHLLVTTVDWSRWLKVTGDNQRKINLLILFLAIALGYLVSTFFLELLMMGRSFANM
+>Streptococcus_suis|ORF1747 length 335 aa, 1008 bp, from complement(1225218..1226225) of Streptococcus_suis
+RMLNTDDTVTIYDVAREAGVSMATVSRVVNGNKNVKENTRKKVLEVIDRLDYRPNAVARG
+LASKKTTTVGVVIPNIANAYFATLAKGIDDIADMYKYNIVLANSDENDEKEINVVNTLFS
+KQVDGIIFMGYHLTDKIRAEFSRSRTPIVLAGTVDLEHQLPSVNIDYAAASVDAVNLLAK
+NNKKIAFVSGPLVDDINGKVRFAGYKQGLKDNGIEFNEGLVFESKYKYEEGYALAERILN
+AGATAAYVAEDEIAAGLLNGVSDMGIKVPEDFEIITSDDSLVTKFTRPNLTSINQPLYDI
+GAIAMRMLTKIMHKEELENREVVLNHGIKVRKSTK
+>Streptococcus_suis|ORF1748 length 377 aa, 1134 bp, from 1226384..1227517 of Streptococcus_suis
+TKISLFLPLHARKVSTMSKLHHVKSYLEANKMDLAIFSDPVSIYYLTGYHSDPHERHMML
+FVMPDHDSLLFLPALDVERAVATVDFPVAGYMDSENPWQIIKSKLPQKSFSAICAEFDNL
+NLTRYHGLQSIFSQPFSDITPLINTMKLIKSRDEIEKMLVAGEFADKAMQVGFNNISLDV
+TETDIIAQIEFEMKKQGISKMSFETMVLTGDNAANPHGIPSTNKIENNALLLFDLGVEAL
+GYTSDMTRTVAVGKPDQFKKDIYNLTLEAHMAAVNMIKPGVTAGEIDYAARSVIEKAGYG
+EYFNHRLGHGLGMSVHEFPSIMEGNDLVIEEGMCFSVEPGIYIPGKVGVRIEDCGYVTKN
+GFEVFTKTPKELLYFEG
+>Streptococcus_suis|ORF2037 length 234 aa, 705 bp, from complement(1422380..1423084) of Streptococcus_suis
+KSMTKTALITGVSSGIGLAQAGIFLENGWRVFGIDLASKPDLAGDFHFLQLDLTGDLSPV
+FSWCQSVDVLCNTAGILDDYRPHLDISEDELAQIFAVNFFAVTRLTRPYLQQMVDRQSGI
+IINMCSIASSLAGGGGSAYTASKHALAGFTKQLALDYAKDKVQIFGIAPGAVQTGMTQKD
+FEPGGLADWVADQTPIGRWTQPSEIAELTFMLATGKLASMQGQIITIDGGWSLK
+>Streptococcus_suis|ORF2038 length 112 aa, 339 bp, from 1422849..1423187 of Streptococcus_suis
+SSKMPAVLQRTSTDWHQEKTGDKSPVRSSCRKWKSPAKSGLLARSIPKTRQPFSKKIPAC
+ARPMPLETPVMRAVLVMDFYPVGRKDIARGRAPHGEAFTLAGHVDEEIGRRL
+>Streptococcus_suis|ORF2329 length 160 aa, 483 bp, from 1612284..1612766 of Streptococcus_suis
+LIETNWFHHLTGQEGLDVLFFHNLGFRITDQLYLEVRKFHLLQGLSQLLRRWSQESRVKG
+ARYIERNHPLDTCFLQQFNRLIHCSHLASDDDLGWCVVVGWGNNPRGNSRTDFFNQVDIC
+VENSNHLTSPCWRSQFHIFTTLSNQGNRIFKGQSSRCHQS
+>Streptococcus_suis|ORF2330 length 329 aa, 990 bp, from complement(1613050..1614039) of Streptococcus_suis
+ARKKDEGIMKTKITELLDIKYPIFQGGMAWVADGDLAGAVSNAGGLGIIGGGNAPKEVVK
+ANIDKVKSITDKPFGVNIMLLSPFADDIVDLVIEEGVKVVTTGAGNPGKYMERLHAAGIT
+VIPVVPSVALAKRMEKLGVDAVIAEGMEAGGHIGKLTTMTLVRQVVEAVSIPVIAAGGIA
+DGAGAAAAFMLGAEAVQVGTRFVVATESNAHQAYKEKVLKAKDIDTTVSASIVGHPVRAI
+KNKLSSAYAAAEKDFLAGKISADAIEELGAGALRNAVVDGDVTNGSVMAGQIAGLVSKEE
+SCEDILKDIYYGAAKVIREEASRWASVGE
+>Streptococcus_suis|ORF2619 length 107 aa, 324 bp, from 1802386..1802709 of Streptococcus_suis
+QLCVGSNPINSLFRRNFFVCCISSQSSCYVHTMWFVGIIVEIIVARYIIIAMGNFQCVCP
+CRRWSNVLNFRNDTIIQPHVFVLNIQTGVNDCNHHSATICLIFRTCF
+>Streptococcus_suis|ORF2620 length 192 aa, 579 bp, from complement(1803558..1804136) of Streptococcus_suis
+RLKIPCFQRKEVTMYDSFDKGWFVLQTYSGYENKVKENLLQRAHTYNMLENILRVEIPTQ
+TVQVEKNGEVKEVEENRFPGYVLVEMVMTDEAWFVVRNTPNVTGFVGSHGNRSKPTPLLE
+EEIRQILVSMGQTVQEFDIDVKVGDTVRIIDGAFTDYTGKITEIDNNKVKMVISMFGNDT
+IAEVNLSQIAEL
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta	Fri Mar 06 04:54:03 2015 -0500
@@ -0,0 +1,50 @@
+>Streptococcus_suis|ORF1 length 457 aa, 1374 bp, from 1..1374 of Streptococcus_suis
+MNQEQLFWQRFIELAKVNFKPSIYDFYVADAKLLGINQQVANIFLNRPFKKDFWEKNFEE
+LMIAASFESYGEPLTIQYQFTEDEQEIRNTTNTRSSIVHQVQTLEPATPQETFKPVHSDI
+KSQYTFANFVQGDNNHWAKAAALAVSDNLGELYNPLFIFGGPGLGKTHILNAIGNKVLAD
+NPQARIKYVSSETFINEFLEHLRLNDMESFKKTYRNLDLLLIDDIQSLRNKATTQEEFFH
+TFNALHEKNKQIVLTSDRNPDHLDNLEERLVTRFKWGLTSEITPPDFETRIAILRNKCEN
+LPYNFTNETLSYLAGQFDSNVRDLEGALKDIHLIATMRQLSEISVEVAAEAIRSRKQTNP
+QNMVIPIEKIQTEVGNFYGVSLKELKGSKRVQHIVHARQVAMFLAREMTDNSLPKIGKEF
+GNRDHTTVMHAYNKIKTLLLDDENLEIEITSIKNKLR
+>Streptococcus_suis|ORF292 length 216 aa, 651 bp, from 185183..185833 of Streptococcus_suis
+AVGKDHLTLDPISVEQIIAVMPVLIVVTAGAVQGSTLGSQSFFVGCFIAEEVTCLQTLGV
+GQGGQAVQIFAWIATRAGHQPVFTVVVIPRGNPCFCDDDFQSVHASCCQGIGHGTEIRQS
+RRRYLTIGPIGLDLKRASVVCVGLGATVQPVNHRFIALHLLVVARCHVARRAQRANTRHM
+EAGKAASEEVVIEGVRSNVPQFFSSRADRQVPLVQV
+>Streptococcus_suis|ORF583 length 391 aa, 1176 bp, from 397805..398980 of Streptococcus_suis
+RKKMKKQFELIATAAAGLEAVVGREIRNLGYECQVENGRVRFQGDVKSIIETNIWLRSAD
+RIKIIVGQFPAKTFEELFQGVFNLDWENYLPLGCKFPISKAKCVKSKLHNEPSVQAISKK
+AVVKKLQKHFSRPEGVPLQEMGAEFKIEVSILKDVATVMIDTTGSSLFKRGYRVEKGGAP
+IKENMAAAILQLSNWYPDKPLIDPTCGSGTFCIEAAMLAKNIAPGLKRSFAFEEWPWVED
+QLVVALRKEAQASIKTDLVLDITGSDIDARMIEIAKKNAFAAGVEQDIVFKQMRVQDLRT
+DKINGVIISNPPYGERLLDDEAIVTLYREMGETFEPLKTWSKFILTSDELFETRFGQQAD
+KKRKLYNGTLKVDLYQFFGQRVKRQVQEVQG
+>Streptococcus_suis|ORF874 length 113 aa, 342 bp, from complement(605625..605966) of Streptococcus_suis
+VSNIVTAITTVNQSQAFQLAKVFFDSQVVRQHLSWVPCICQTIPYWHTGEFCQFFHHFLT
+ETTEFNTVEHTSQNFSSIFCRFFLTKLDVICTKIFWMGTKVNRCYCEGSTSTC
+>Streptococcus_suis|ORF1165 length 105 aa, 318 bp, from 811613..811930 of Streptococcus_suis
+AYNESVKRKECHLMKQVNMSKIINYLTILGLLILLSAFFLDNWIRDWFFPSSWGNVATML
+ILPLLGALILILSIYYKKLWTGLISIFLIISFPLIFGIGYFIFGP
+>Streptococcus_suis|ORF1456 length 116 aa, 351 bp, from complement(1027944..1028294) of Streptococcus_suis
+YGNACNSRPPTCDKSYSCWETLIYMGLNLVQFHFLISWYNGNMVISILQFFSHILFIYLA
+HHLLVTTVDWSRWLKVTGDNQRKINLLILFLAIALGYLVSTFFLELLMMGRSFANM
+>Streptococcus_suis|ORF1747 length 335 aa, 1008 bp, from complement(1225218..1226225) of Streptococcus_suis
+RMLNTDDTVTIYDVAREAGVSMATVSRVVNGNKNVKENTRKKVLEVIDRLDYRPNAVARG
+LASKKTTTVGVVIPNIANAYFATLAKGIDDIADMYKYNIVLANSDENDEKEINVVNTLFS
+KQVDGIIFMGYHLTDKIRAEFSRSRTPIVLAGTVDLEHQLPSVNIDYAAASVDAVNLLAK
+NNKKIAFVSGPLVDDINGKVRFAGYKQGLKDNGIEFNEGLVFESKYKYEEGYALAERILN
+AGATAAYVAEDEIAAGLLNGVSDMGIKVPEDFEIITSDDSLVTKFTRPNLTSINQPLYDI
+GAIAMRMLTKIMHKEELENREVVLNHGIKVRKSTK
+>Streptococcus_suis|ORF2038 length 112 aa, 339 bp, from 1422849..1423187 of Streptococcus_suis
+SSKMPAVLQRTSTDWHQEKTGDKSPVRSSCRKWKSPAKSGLLARSIPKTRQPFSKKIPAC
+ARPMPLETPVMRAVLVMDFYPVGRKDIARGRAPHGEAFTLAGHVDEEIGRRL
+>Streptococcus_suis|ORF2329 length 160 aa, 483 bp, from 1612284..1612766 of Streptococcus_suis
+LIETNWFHHLTGQEGLDVLFFHNLGFRITDQLYLEVRKFHLLQGLSQLLRRWSQESRVKG
+ARYIERNHPLDTCFLQQFNRLIHCSHLASDDDLGWCVVVGWGNNPRGNSRTDFFNQVDIC
+VENSNHLTSPCWRSQFHIFTTLSNQGNRIFKGQSSRCHQS
+>Streptococcus_suis|ORF2620 length 192 aa, 579 bp, from complement(1803558..1804136) of Streptococcus_suis
+RLKIPCFQRKEVTMYDSFDKGWFVLQTYSGYENKVKENLLQRAHTYNMLENILRVEIPTQ
+TVQVEKNGEVKEVEENRFPGYVLVEMVMTDEAWFVVRNTPNVTGFVGSHGNRSKPTPLLE
+EEIRQILVSMGQTVQEFDIDVKVGDTVRIIDGAFTDYTGKITEIDNNKVKMVISMFGNDT
+IAEVNLSQIAEL
--- a/tools/sample_seqs/README.rst	Fri Nov 21 08:30:03 2014 -0500
+++ b/tools/sample_seqs/README.rst	Fri Mar 06 04:54:03 2015 -0500
@@ -59,6 +59,10 @@
 v0.1.1  - Using optparse to provide a proper command line API.
 v0.1.2  - Interleaved mode for working with paired records.
         - Tool definition now embeds citation information.
+v0.2.0  - Option to give number of sequences (or pairs) desired.
+          This works by first counting all your sequences, then calculates
+          the percentage required in order to sample them uniformly (evenly).
+          This makes two passes through the input and is therefore slower.
 ======= ======================================================================


@@ -71,7 +75,7 @@
 For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use
 the following command from the Galaxy root folder::

-    $ tar -czf sample_seqs.tar.gz tools/sample_seqs/README.rst tools/sample_seqs/sample_seqs.py tools/sample_seqs/sample_seqs.xml tools/sample_seqs/tool_dependencies.xml test-data/ecoli.fastq test-data/ecoli.sample_N100.fastq test-data/ecoli.pair_sample_N100.fastq test-data/get_orf_input.Suis_ORF.prot.fasta test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta test-data/MID4_GLZRM4E04_rnd30_frclip.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff
+    $ tar -czf sample_seqs.tar.gz tools/sample_seqs/README.rst tools/sample_seqs/sample_seqs.py tools/sample_seqs/sample_seqs.xml tools/sample_seqs/tool_dependencies.xml test-data/ecoli.fastq test-data/ecoli.sample_N100.fastq test-data/ecoli.pair_sample_N100.fastq test-data/ecoli.sample_C10.fastq test-data/get_orf_input.Suis_ORF.prot.fasta test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta test-data/MID4_GLZRM4E04_rnd30_frclip.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff

 Check this worked::

@@ -83,13 +87,17 @@
     test-data/ecoli.fastq
     test-data/ecoli.sample_N100.fastq
     test-data/ecoli.pair_sample_N100.fastq
+    test-data/ecoli.sample_C10.fastq
     test-data/get_orf_input.Suis_ORF.prot.fasta
     test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta
     test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta
+    test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta
+    test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta
     test-data/MID4_GLZRM4E04_rnd30_frclip.sff
     test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff
     test-data/MID4_GLZRM4E04_rnd30_pair_sample.sff
     test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff
+    test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff


 Licence (MIT)
--- a/tools/sample_seqs/sample_seqs.py	Fri Nov 21 08:30:03 2014 -0500
+++ b/tools/sample_seqs/sample_seqs.py	Fri Mar 06 04:54:03 2015 -0500
@@ -2,14 +2,14 @@
 """Sub-sample sequence from a FASTA, FASTQ or SFF file.

 This tool is a short Python script which requires Biopython 1.62 or later
-for SFF file support. If you use this tool in scientific work leading to a
+for sequence parsing. If you use this tool in scientific work leading to a
 publication, please cite the Biopython application note:

 Cock et al 2009. Biopython: freely available Python tools for computational
 molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3.
 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878.

-This script is copyright 2014 by Peter Cock, The James Hutton Institute
+This script is copyright 2014-2015 by Peter Cock, The James Hutton Institute
 (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved.
 See accompanying text file for licence details (MIT license).

@@ -20,7 +20,7 @@
 from optparse import OptionParser


-def stop_err(msg, err=1):
+def sys_exit(msg, err=1):
     sys.stderr.write(msg.rstrip() + "\n")
     sys.exit(err)

@@ -32,6 +32,9 @@
 e.g. Sample 20% of the reads:

 $ python sample_seqs.py -i my_seq.fastq -f fastq -p 20.0 -o sample.fastq
+
+This samples uniformly though the file, rather than at random, and therefore
+should be reproducible.
 """
 parser = OptionParser(usage=usage)
 parser.add_option('-i', '--input', dest='input',
@@ -49,6 +52,9 @@
 parser.add_option('-n', '--everyn', dest='everyn',
                   default=None,
                   help='Take every N-th read')
+parser.add_option('-c', '--count', dest='count',
+                  default=None,
+                  help='Take exactly N reads')
 parser.add_option("--interleaved", dest="interleaved",
                   default=False, action="store_true",
                   help="Input is interleaved reads, preserve the pairings")
@@ -58,31 +64,74 @@
 options, args = parser.parse_args()

 if options.version:
-    print("v0.1.2")
+    print("v0.2.0")
     sys.exit(0)

-seq_format = options.format
 in_file = options.input
 out_file = options.output
 interleaved = options.interleaved

 if not in_file:
-    stop_err("Require an input filename")
+    sys_exit("Require an input filename")
 if in_file != "/dev/stdin" and not os.path.isfile(in_file):
-    stop_err("Missing input file %r" % in_file)
+    sys_exit("Missing input file %r" % in_file)
 if not out_file:
-    stop_err("Require an output filename")
+    sys_exit("Require an output filename")
+if not options.format:
+    sys_exit("Require the sequence format")
+seq_format = options.format.lower()
+
+
+def count_fasta(filename):
+    from Bio.SeqIO.FastaIO import SimpleFastaParser
+    count = 0
+    with open(filename) as handle:
+        for title, seq in SimpleFastaParser(handle):
+            count += 1
+    return count
+
+
+def count_fastq(filename):
+    from Bio.SeqIO.QualityIO import FastqGeneralIterator
+    count = 0
+    with open(filename) as handle:
+        for title, seq, qual in FastqGeneralIterator(handle):
+            count += 1
+    return count
+
+
+def count_sff(filename):
+    from Bio import SeqIO
+    # If the SFF file has a built in index (which is normal),
+    # this will be parsed and is the quicker than scanning
+    # the whole file.
+    return len(SeqIO.index(filename, "sff"))
+
+
+def count_sequences(filename, format):
+    if seq_format == "sff":
+        return count_sff(filename)
+    elif seq_format == "fasta":
+        return count_fasta(filename)
+    elif seq_format.startswith("fastq"):
+        return count_fastq(filename)
+    else:
+        sys_exit("Unsupported file type %r" % seq_format)


 if options.percent and options.everyn:
-    stop_err("Cannot combine -p and -n options")
+    sys_exit("Cannot combine -p and -n options")
+elif options.everyn and options.count:
+    sys_exit("Cannot combine -p and -c options")
+elif options.percent and options.count:
+    sys_exit("Cannot combine -n and -c options")
 elif options.everyn:
     try:
         N = int(options.everyn)
     except:
-        stop_err("Bad N argument %r" % options.everyn)
+        sys_exit("Bad -n argument %r" % options.everyn)
     if N < 2:
-        stop_err("Bad N argument %r" % options.everyn)
+        sys_exit("Bad -n argument %r" % options.everyn)
     if (N % 10) == 1:
         sys.stderr.write("Sampling every %ist sequence\n" % N)
     elif (N % 10) == 2:
@@ -102,9 +151,9 @@
     try:
         percent = float(options.percent) / 100.0
     except:
-        stop_err("Bad percent argument %r" % options.percent)
+        sys_exit("Bad -p percent argument %r" % options.percent)
     if percent <= 0.0 or 1.0 <= percent:
-        stop_err("Bad percent argument %r" % options.percent)
+        sys_exit("Bad -p percent argument %r" % options.percent)
     sys.stderr.write("Sampling %0.3f%% of sequences\n" % (100.0 * percent))
     def sampler(iterator):
         global percent
@@ -115,8 +164,76 @@
             if percent * count > taken:
                 taken += 1
                 yield record
+elif options.count:
+    try:
+        N = int(options.count)
+    except:
+        sys_exit("Bad -c count argument %r" % options.count)
+    if N < 1:
+        sys_exit("Bad -c count argument %r" % options.count)
+    total = count_sequences(in_file, seq_format)
+    print("Input file has %i sequences" % total)
+    if interleaved:
+        # Paired
+        if total % 2:
+            sys_exit("Paired mode, but input file has an odd number of sequences: %i"
+                     % total)
+        elif N > total // 2:
+            sys_exit("Requested %i sequence pairs, but file only has %i pairs (%i sequences)."
+                     % (N, total // 2, total))
+        total = total // 2
+        if N == 1:
+            sys.stderr.write("Sampling just first sequence pair!\n")
+        elif N == total:
+            sys.stderr.write("Taking all the sequence pairs\n")
+        else:
+            sys.stderr.write("Sampling %i sequence pairs\n" % N)
+    else:
+        # Not paired
+        if total < N:
+            sys_exit("Requested %i sequences, but file only has %i." % (N, total))
+        if N == 1:
+            sys.stderr.write("Sampling just first sequence!\n")
+        elif N == total:
+            sys.stderr.write("Taking all the sequences\n")
+        else:
+            sys.stderr.write("Sampling %i sequences\n" % N)
+    if N == total:
+        def sampler(iterator):
+            """Dummy filter to filter nothing, taking everything."""
+            global N
+            taken = 0
+            for record in iterator:
+                taken += 1
+                yield record
+            assert taken == N, "Picked %i, wanted %i" % (taken, N)
+    else:
+        def sampler(iterator):
+            # Mimic the percentage sampler, with double check on final count
+            global N, total
+            # Do we need a floating point fudge factor epsilon?
+            # i.e. What if percentage comes out slighty too low, and
+            # we could end up missing last few desired sequences?
+            percentage = float(N) / float(total)
+            #print("DEBUG: Want %i out of %i sequences/pairs, as a percentage %0.2f"
+            #      % (N, total, percentage * 100.0))
+            count = 0
+            taken = 0
+            for record in iterator:
+                count += 1
+                # Do we need the extra upper bound?
+                if percentage * count > taken and taken < N:
+                    taken += 1
+                    yield record
+                elif total - count + 1 <= N - taken:
+                    # remaining records (incuding this one) <= what we still need.
+                    # This is a safey check for floating point edge cases where
+                    # we need to take all remaining sequences to meet target
+                    taken += 1
+                    yield record
+            assert taken == N, "Picked %i, wanted %i" % (taken, N)
 else:
-    stop_err("Must use either -n or -p")
+    sys_exit("Must use either -n, -p or -c")


 def pair(iterator):
@@ -180,48 +297,30 @@
                     pos_handle.write(record)
     return count

-try:
-    from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
-    def fastq_filter(in_file, out_file, iterator_filter, inter):
-        count = 0
-        #from galaxy_utils.sequence.fastq import fastqReader, fastqWriter
-        reader = fastqReader(open(in_file, "rU"))
-        writer = fastqWriter(open(out_file, "w"))
-        if inter:
-            for r1, r2 in iterator_filter(pair(reader)):
-                count += 1
-                writer.write(r1)
-                writer.write(r2)
-        else:
-            for record in iterator_filter(reader):
-                count += 1
-                writer.write(record)
-        writer.close()
-        reader.close()
-        return count
-except ImportError:
-    from Bio.SeqIO.QualityIO import FastqGeneralIterator
-    def fastq_filter(in_file, out_file, iterator_filter, inter):
-        count = 0
-        with open(in_file) as in_handle:
-            with open(out_file, "w") as pos_handle:
-                if inter:
-                    for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))):
-                        count += 1
-                        pos_handle.write("@%s\n%s\n+\n%s\n" % r1)
-                        pos_handle.write("@%s\n%s\n+\n%s\n" % r2)
-                else:
-                    for title, seq, qual in iterator_filter(FastqGeneralIterator(in_handle)):
-                        count += 1
-                        pos_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
-        return count
+
+from Bio.SeqIO.QualityIO import FastqGeneralIterator
+def fastq_filter(in_file, out_file, iterator_filter, inter):
+    count = 0
+    with open(in_file) as in_handle:
+        with open(out_file, "w") as pos_handle:
+            if inter:
+                for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))):
+                    count += 1
+                    pos_handle.write("@%s\n%s\n+\n%s\n" % r1)
+                    pos_handle.write("@%s\n%s\n+\n%s\n" % r2)
+            else:
+                for title, seq, qual in iterator_filter(FastqGeneralIterator(in_handle)):
+                    count += 1
+                    pos_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
+    return count
+

 def sff_filter(in_file, out_file, iterator_filter, inter):
     count = 0
     try:
         from Bio.SeqIO.SffIO import SffIterator, SffWriter
     except ImportError:
-        stop_err("SFF filtering requires Biopython 1.54 or later")
+        sys_exit("SFF filtering requires Biopython 1.54 or later")
     try:
         from Bio.SeqIO.SffIO import ReadRocheXmlManifest
     except ImportError:
@@ -246,14 +345,14 @@
                 #count = writer.write_file(SffIterator(in_handle))
     return count

-if seq_format.lower()=="sff":
+if seq_format == "sff":
     count = sff_filter(in_file, out_file, sampler, interleaved)
-elif seq_format.lower()=="fasta":
+elif seq_format == "fasta":
     count = fasta_filter(in_file, out_file, sampler, interleaved)
-elif seq_format.lower().startswith("fastq"):
+elif seq_format.startswith("fastq"):
     count = fastq_filter(in_file, out_file, sampler, interleaved)
 else:
-    stop_err("Unsupported file type %r" % seq_format)
+    sys_exit("Unsupported file type %r" % seq_format)

 if interleaved:
     sys.stderr.write("Selected %i pairs\n" % count)
--- a/tools/sample_seqs/sample_seqs.xml	Fri Nov 21 08:30:03 2014 -0500
+++ b/tools/sample_seqs/sample_seqs.xml	Fri Mar 06 04:54:03 2015 -0500
@@ -1,4 +1,4 @@
-<tool id="sample_seqs" name="Sub-sample sequences files" version="0.1.2">
+<tool id="sample_seqs" name="Sub-sample sequences files" version="0.2.0">
     <description>e.g. to reduce coverage</description>
     <requirements>
         <requirement type="package" version="1.63">biopython</requirement>
@@ -9,9 +9,10 @@
 sample_seqs.py -f "$input_file.ext" -i "$input_file" -o "$output_file"
 #if str($sampling.type) == "everyNth":
 -n "${sampling.every_n}"
+#elif str($sampling.type) == "percentage":
+-p "${sampling.percent}"
 #else
-##elif str($sampling.type) == "percentage":
--p "${sampling.percent}"
+-c "${sampling.count}"
 #end if
 #if $interleaved
 --interleaved
@@ -26,8 +27,9 @@
         <param name="input_file" type="data" format="fasta,fastq,sff" label="Sequence file" help="FASTA, FASTQ, or SFF format." />
         <conditional name="sampling">
             <param name="type" type="select" label="Sub-sampling approach">
-                <option value="everyNth">Take every N-th sequence (e.g. every fifth sequence)</option>
-                <option value="percentage">Take some percentage of the sequences (e.g. 20% will take every fifth sequence)</option>
+                <option value="everyNth">Take every N-th sequence (or pair, e.g. every fifth sequence)</option>
+                <option value="percentage">Take some percentage of the sequences (or pairs, e.g. 20% will take every fifth sequence)</option>
+                <option value="desired_count">Take exactly N sequences (or pairs, e.g. 1000 sequences)</option>
                 <!-- TODO - target coverage etc -->
             </param>
             <when value="everyNth">
@@ -36,8 +38,11 @@
             <when value="percentage">
                 <param name="percent" value="20.0" type="float" min="0" max="100" label="Percentage" help="Between 0 and 100, e.g. 20% will take every 5th sequence" />
             </when>
+            <when value="desired_count">
+                <param name="count" value="1000" type="integer" min="1" label="N" help="Number of unique sequences to pick (between 1 and number itotal n input file)" />
+            </when>
         </conditional>
-        <param name="interleaved" type="boolean" label="Interleaved paired reads" help="Tick to preserve interleaved pairs on output" />
+        <param name="interleaved" type="boolean" label="Interleaved paired reads" help="This mode keeps paired reads together (e.g. take every 5th read pair)" />
     </inputs>
     <outputs>
         <data name="output_file" format="input" metadata_source="input_file" label="${input_file.name} (sub-sampled)"/>
@@ -82,12 +87,37 @@
             <output name="output_file" file="get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta" />
         </test>
         <test>
+            <param name="input_file" value="get_orf_input.Suis_ORF.prot.fasta" />
+            <param name="type" value="desired_count" />
+            <param name="count" value="2910" />
+            <output name="output_file" file="get_orf_input.Suis_ORF.prot.fasta" />
+        </test>
+        <test>
+            <param name="input_file" value="get_orf_input.Suis_ORF.prot.fasta" />
+            <param name="type" value="desired_count" />
+            <param name="count" value="10" />
+            <param name="interleaved" value="true" />
+            <output name="output_file" file="get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta" />
+        </test>
+        <test>
             <param name="input_file" value="ecoli.fastq" />
             <param name="type" value="percentage" />
             <param name="percent" value="1.0" />
             <output name="output_file" file="ecoli.sample_N100.fastq" />
         </test>
         <test>
+            <param name="input_file" value="ecoli.fastq" />
+            <param name="type" value="desired_count" />
+            <param name="count" value="10" />
+            <output name="output_file" file="ecoli.sample_C10.fastq" />
+        </test>
+        <test>
+            <param name="input_file" value="ecoli.sample_C10.fastq" />
+            <param name="type" value="desired_count" />
+            <param name="count" value="10" />
+            <output name="output_file" file="ecoli.sample_C10.fastq" />
+        </test>
+        <test>
             <param name="input_file" value="MID4_GLZRM4E04_rnd30_frclip.sff" ftype="sff" />
             <param name="type" value="percentage" />
             <param name="percent" value="20.0" />
@@ -100,27 +130,55 @@
             <param name="interleaved" value="true" />
             <output name="output_file" file="MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff" ftype="sff"/>
         </test>
+        <test>
+            <param name="input_file" value="MID4_GLZRM4E04_rnd30.sff" ftype="sff" />
+            <param name="type" value="desired_count" />
+            <param name="count" value="30" />
+            <output name="output_file" file="MID4_GLZRM4E04_rnd30.sff" ftype="sff"/>
+        </test>
+        <test>
+            <param name="input_file" value="MID4_GLZRM4E04_rnd30_frclip.sff" ftype="sff" />
+            <param name="type" value="desired_count" />
+            <param name="count" value="1" />
+            <output name="output_file" file="MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff" ftype="sff"/>
+        </test>
     </tests>
     <help>
 **What it does**

 Takes an input file of sequences (typically FASTA or FASTQ, but also
 Standard Flowgram Format (SFF) is supported), and returns a new sequence
-file sub-sampling from this (in the same format).
+file sub-sampling uniformly from this (in the same format, preserving the
+input order and selecting sequencing evenly though the input file).

-Several sampling modes are supported, all designed to be non-random. This
-allows reproducibility, and also works on paired sequence files. Also
-note that by sampling uniformly through the file, this avoids any bias
-should reads in any part of the file are of lesser quality (e.g. one part
-of the slide).
+Several sampling modes are supported, all designed to do non-random
+uniform sampling (i.e. evenly through the input file). This allows
+reproducibility, and also works on paired sequence files (run the tool
+twice, once on each file using the same settings).

-The simplest mode is to take every N-th sequence, for example taking
+By sampling uniformly (evenly) through the file, this avoids any bias
+should reads in any part of the file be of lesser quality (e.g. for
+high throughput sequencing the reads at the start and end of the file
+can be of lower quality).
+
+The simplest mode is to take every *N*-th sequence, for example taking
 every 2nd sequence would sample half the file - while taking every 5th
 sequence would take 20% of the file.

+The target count method picks *N* sequences from the input file, which
+again will be distributed uniformly (evenly) though the file. This works
+by first counting the number of records, then calculating the desired
+percentage of sequences to take. Note if your input file has exactly
+*N* sequences this selects them all (effectively copying the input file).
+If your input file has less than *N* sequences, this is treated as an
+error.
+
 If you tick the interleaved option, the file is processed as pairs of
-records - taking for example using 20% would take every 5th pair of
-records. This ensures your read pairs are preserved. Note this does not
+records to ensure your read pairs are not separated by sampling.
+For example using 20% would take every 5th pair of records, or you
+could request 1000 read pairs.
+
+.. class:: warningmark Note interleaves/pair mode does *not*
 actually check your read names match a known pair naming scheme!

 **Example Usage**
@@ -133,8 +191,12 @@

 Similarly, if you had some Illumina paired end data interleaved into one
 file with an estimated x200 coverage, you would run this tool in
-interleaved mode. Taking every 3rd read pair. This would reduce the
-estimated coverage to about x66, while preserving the read pairing.
+interleaved mode, taking every 3rd read pair. This would again reduce
+the estimated coverage to about x66, while preserving the read pairing.
+
+Suppose you have a transcriptome assembly, and wish to look at the
+species distribution of the top BLAST hits for an initial quality check.
+Rather than using all your sequences, you could pick 1000 only for this.

 **Citation**