# HG changeset patch # User peterjc # Date 1425635643 18000 # Node ID 09a4ee5d12fd8611c91ae51693094f15ea289833 # Parent dc55e58fa890cc0b5b0f9f813dff0725a42837c4 Uploaded v0.2.0, adding desired count method diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff Binary file test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff has changed diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/ecoli.sample_C10.fastq --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/ecoli.sample_C10.fastq Fri Mar 06 04:54:03 2015 -0500 @@ -0,0 +1,40 @@ +@frag_1 +AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTC ++ +##%')+.024JMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_504 +TACGTTCGGCATCGCTGATATTGGGTAAAGCATCCT ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_1008 +GTCGCAGGTATAGACCCCGTCAACGTCCGTCCAAAT ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_1512 +ATCACCTACCACCGAGATAATGGCCAGCCGTTCCGT ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_2016 +GAAACCTTCGCGCAGGAAGTCGGCATATTGATCCGC ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_2520 +TCCTTCATCACGGGCCTTCGCCACGCGCGCGGCAAA ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_3024 +TCCAGGGTCATCGCCACTGGAATTTGCTTACCCAGT ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_3528 +ACCGCGCCGATTTCCGCGACCGCCTGCCGCGCCTGC ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_4032 +CGACCGCCGAAATCTTTAAATGCCAGCGTTGGCCCG ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM +@frag_4536 +GGCACGGTATCGTTCACGTTGGTCGCAGCAATAAAA ++ +MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta Fri Mar 06 04:54:03 2015 -0500 @@ -0,0 +1,119 @@ +>Streptococcus_suis|ORF1 length 457 aa, 1374 bp, from 1..1374 of Streptococcus_suis +MNQEQLFWQRFIELAKVNFKPSIYDFYVADAKLLGINQQVANIFLNRPFKKDFWEKNFEE +LMIAASFESYGEPLTIQYQFTEDEQEIRNTTNTRSSIVHQVQTLEPATPQETFKPVHSDI +KSQYTFANFVQGDNNHWAKAAALAVSDNLGELYNPLFIFGGPGLGKTHILNAIGNKVLAD +NPQARIKYVSSETFINEFLEHLRLNDMESFKKTYRNLDLLLIDDIQSLRNKATTQEEFFH +TFNALHEKNKQIVLTSDRNPDHLDNLEERLVTRFKWGLTSEITPPDFETRIAILRNKCEN +LPYNFTNETLSYLAGQFDSNVRDLEGALKDIHLIATMRQLSEISVEVAAEAIRSRKQTNP +QNMVIPIEKIQTEVGNFYGVSLKELKGSKRVQHIVHARQVAMFLAREMTDNSLPKIGKEF +GNRDHTTVMHAYNKIKTLLLDDENLEIEITSIKNKLR +>Streptococcus_suis|ORF2 length 385 aa, 1158 bp, from 1507..2664 of Streptococcus_suis +IINKGESMIQFSINKNIFLQALSITKRAISTKNAIPILSTVKITVTSEGITLTGSNGQIS +IEHFISIQDENAGLLISSPGSILLEAGFFINVVSSMPDLVLDFNEIEQKQIVLTSGKSEI +TLKGKEAEQYPRLQEVPTSKPLVLETKVLKQTINETAFAASTQESRPILTGVHFVLTENK +NLKTVATDSHRMSQRKLVLDTSGDDFNVVIPSRSLREFTAVFTDDIETVEVFFSNNQILF +RSEHISFYTRLLEGTYPDTDRLIPTEFKTTAIFDTANLRHSMERARLLSNATQNGTVKLE +IANNVVSAHVNSPEVGRVNEELDTVEVSGEDLVISFNPTYLIEALKATTSEQVKISFISS +VRPFTLIPNNEGEDFIQLVTPVRTN +>Streptococcus_suis|ORF291 length 760 aa, 2283 bp, from complement(184307..186589) of Streptococcus_suis +KRGEFMRFNQFSFIKKETSVYLQELDTLGFQLIPDASSKTNLETFVRKCHFLTANTDFAL +SNMIAEWDTDLLTFFQSDRELTDQIFYQVAFQLLGFVPGMDYTDVMDFVEKSNFPIVYGD +IIDNLYQLLNTRTKSGNTLIDQLVSDDLIPEDNHYHFFNGKSMATFSTKNLIREVVYVET +PVDTAGTGQTDIVKLSILRPHFDGKIPAVITNSPYHQGVNDVASDKALHKMEGELAEKQV +GTIQVKQASITKLDLDQRNLPVSPATEKLGHITSYSLNDYFLARGFASLHVSGVGTLGST +GYMTSGDYQQVEGYKAVIDWLNGRTKAYTDHTRSLEVKADWANGKVATTGLSYLGTMSNA +LATTGVDGLEVIIAEAGISSWYDYYRENGLVTSPGGYPGEDLDSLTALTYSKSLQAGDFL +RNKAAYEKGLAAERAALDRTSGDYNQYWHDRNYLLHADRVKCEVVFTHGSQDWNVKPIHV +WNMFHALPSHIKKHLFFHNGAHVYMNNWQSIDFRESMNALLSQKLLGYENNYQLPTVIWQ +DNSGEQTWTTLDTFGGENETVLPLGTGSQTVANQYTQEDFERYGKSYSAFHQDLYAGKAN +QISIELPVTEGLLLNGQVTLKLRVASSVAKGLLSAQLLDKGNKKRLAPIPAPKARLSLDN +GRYHAQENLVELPYVEMPQRLVTKGFMNLQNRTDLMTVEEVVPGQWMNLTWKLQPTIYQL +KKGDVLELILYTTDFECTVRDNSQWQIHLDLSQSQLILPH +>Streptococcus_suis|ORF292 length 216 aa, 651 bp, from 185183..185833 of Streptococcus_suis +AVGKDHLTLDPISVEQIIAVMPVLIVVTAGAVQGSTLGSQSFFVGCFIAEEVTCLQTLGV +GQGGQAVQIFAWIATRAGHQPVFTVVVIPRGNPCFCDDDFQSVHASCCQGIGHGTEIRQS +RRRYLTIGPIGLDLKRASVVCVGLGATVQPVNHRFIALHLLVVARCHVARRAQRANTRHM +EAGKAASEEVVIEGVRSNVPQFFSSRADRQVPLVQV +>Streptococcus_suis|ORF583 length 391 aa, 1176 bp, from 397805..398980 of Streptococcus_suis +RKKMKKQFELIATAAAGLEAVVGREIRNLGYECQVENGRVRFQGDVKSIIETNIWLRSAD +RIKIIVGQFPAKTFEELFQGVFNLDWENYLPLGCKFPISKAKCVKSKLHNEPSVQAISKK +AVVKKLQKHFSRPEGVPLQEMGAEFKIEVSILKDVATVMIDTTGSSLFKRGYRVEKGGAP +IKENMAAAILQLSNWYPDKPLIDPTCGSGTFCIEAAMLAKNIAPGLKRSFAFEEWPWVED +QLVVALRKEAQASIKTDLVLDITGSDIDARMIEIAKKNAFAAGVEQDIVFKQMRVQDLRT +DKINGVIISNPPYGERLLDDEAIVTLYREMGETFEPLKTWSKFILTSDELFETRFGQQAD +KKRKLYNGTLKVDLYQFFGQRVKRQVQEVQG +>Streptococcus_suis|ORF584 length 487 aa, 1464 bp, from 398981..400444 of Streptococcus_suis +EDIVGEKNSHHLPLDEEKVLDFEVAKDLTIEEAVKKHKEIEAGVTEDDGLLDRYIKQHRA +EIESQKFETKINHLPLVEVADEEKNQGHESAEEVEANESSLTEVSEEIAPIVEELSVTPM +ETLEETVIASTVAMEGLSSVADDSSLELEEDETEDLDHSEGADRDQKKKFYFWSAVGLSM +IGVMATALVWMNSVNKSNTATSSSSTSTSQTSSTASSSTDANVTAFEQLYNSFFTDSSLT +KLKNSEFGKLAELKVLLEKLDKNSDSYTKAKEQYDHLEKAIAAIQAINGQFDKEVVVNGE +IDTTATVKSGESLSATTTGISAVDSLLASVVNFGRSQQEVASATVASEAAVTRNQGADET +VSTGVPATTEVASTTVSGSTTDFGIAVPAGVVLQRDRSRVPYNQAMIDDVNNEAWNFNPG +ILENIVTISQQRGYITGNQYILEKVNIINGNGYYNMFKPDGTYLFSINCKTGYFVGNGAG +HSDALDY +>Streptococcus_suis|ORF873 length 343 aa, 1032 bp, from 605439..606470 of Streptococcus_suis +TLGEETMTNVFKGRHFLAEKDFTRAELEWLIDFSAHLKDLKKRNIPHRYLEGKNIALLFE +KTSTRTRAAFTVASIDLGAHPEYLGANDIQLGKKESTEDTAKVLGRMFDGIEFRGFSQKM +VEELAEFSGVPVWNGLTDAWHPTQMLADYLTVKENFGKLEGLTLVYCGDGRNNVANSLLV +TGAILGVNVHIFSPKELFPEEEVVALAEGFAKESGARVLITDNADEAVKGADVLYTDVWV +SMGEEDKFAERVALLKPYQVNMELVKKAENENLIFLHCLPAFHDTNTVYGKDVAEKFGVE +EMEVTDEVFRSKYARHFDQAENRMHTIKAVMAATLGDPFVPRV +>Streptococcus_suis|ORF874 length 113 aa, 342 bp, from complement(605625..605966) of Streptococcus_suis +VSNIVTAITTVNQSQAFQLAKVFFDSQVVRQHLSWVPCICQTIPYWHTGEFCQFFHHFLT +ETTEFNTVEHTSQNFSSIFCRFFLTKLDVICTKIFWMGTKVNRCYCEGSTSTC +>Streptococcus_suis|ORF1165 length 105 aa, 318 bp, from 811613..811930 of Streptococcus_suis +AYNESVKRKECHLMKQVNMSKIINYLTILGLLILLSAFFLDNWIRDWFFPSSWGNVATML +ILPLLGALILILSIYYKKLWTGLISIFLIISFPLIFGIGYFIFGP +>Streptococcus_suis|ORF1166 length 125 aa, 378 bp, from 811867..812244 of Streptococcus_suis +YLLNNLISSDIRYWLLYIWPLEGVVMNLTLLKRLNLVLYGIAIFLFVMLFLPIGQWFDIV +NVNFKLTFFIIPFFGLASLPTAIYTKNVRQILLSVLLVALYFILFSLITALSGLFHLNFY +SFFFK +>Streptococcus_suis|ORF1455 length 114 aa, 345 bp, from 1026973..1027317 of Streptococcus_suis +SCKLSLHIRWESWMGQGFYCYRFKLIHLRTNSNPFSFFRHLNSHFQHLRNEWTVMLPDSV +LDQDISTSHCRCHHKGTRFDTILHHLMFCASQFFYTSNRNRLCTCPLNFCPHFV +>Streptococcus_suis|ORF1456 length 116 aa, 351 bp, from complement(1027944..1028294) of Streptococcus_suis +YGNACNSRPPTCDKSYSCWETLIYMGLNLVQFHFLISWYNGNMVISILQFFSHILFIYLA +HHLLVTTVDWSRWLKVTGDNQRKINLLILFLAIALGYLVSTFFLELLMMGRSFANM +>Streptococcus_suis|ORF1747 length 335 aa, 1008 bp, from complement(1225218..1226225) of Streptococcus_suis +RMLNTDDTVTIYDVAREAGVSMATVSRVVNGNKNVKENTRKKVLEVIDRLDYRPNAVARG +LASKKTTTVGVVIPNIANAYFATLAKGIDDIADMYKYNIVLANSDENDEKEINVVNTLFS +KQVDGIIFMGYHLTDKIRAEFSRSRTPIVLAGTVDLEHQLPSVNIDYAAASVDAVNLLAK +NNKKIAFVSGPLVDDINGKVRFAGYKQGLKDNGIEFNEGLVFESKYKYEEGYALAERILN +AGATAAYVAEDEIAAGLLNGVSDMGIKVPEDFEIITSDDSLVTKFTRPNLTSINQPLYDI +GAIAMRMLTKIMHKEELENREVVLNHGIKVRKSTK +>Streptococcus_suis|ORF1748 length 377 aa, 1134 bp, from 1226384..1227517 of Streptococcus_suis +TKISLFLPLHARKVSTMSKLHHVKSYLEANKMDLAIFSDPVSIYYLTGYHSDPHERHMML +FVMPDHDSLLFLPALDVERAVATVDFPVAGYMDSENPWQIIKSKLPQKSFSAICAEFDNL +NLTRYHGLQSIFSQPFSDITPLINTMKLIKSRDEIEKMLVAGEFADKAMQVGFNNISLDV +TETDIIAQIEFEMKKQGISKMSFETMVLTGDNAANPHGIPSTNKIENNALLLFDLGVEAL +GYTSDMTRTVAVGKPDQFKKDIYNLTLEAHMAAVNMIKPGVTAGEIDYAARSVIEKAGYG +EYFNHRLGHGLGMSVHEFPSIMEGNDLVIEEGMCFSVEPGIYIPGKVGVRIEDCGYVTKN +GFEVFTKTPKELLYFEG +>Streptococcus_suis|ORF2037 length 234 aa, 705 bp, from complement(1422380..1423084) of Streptococcus_suis +KSMTKTALITGVSSGIGLAQAGIFLENGWRVFGIDLASKPDLAGDFHFLQLDLTGDLSPV +FSWCQSVDVLCNTAGILDDYRPHLDISEDELAQIFAVNFFAVTRLTRPYLQQMVDRQSGI +IINMCSIASSLAGGGGSAYTASKHALAGFTKQLALDYAKDKVQIFGIAPGAVQTGMTQKD +FEPGGLADWVADQTPIGRWTQPSEIAELTFMLATGKLASMQGQIITIDGGWSLK +>Streptococcus_suis|ORF2038 length 112 aa, 339 bp, from 1422849..1423187 of Streptococcus_suis +SSKMPAVLQRTSTDWHQEKTGDKSPVRSSCRKWKSPAKSGLLARSIPKTRQPFSKKIPAC +ARPMPLETPVMRAVLVMDFYPVGRKDIARGRAPHGEAFTLAGHVDEEIGRRL +>Streptococcus_suis|ORF2329 length 160 aa, 483 bp, from 1612284..1612766 of Streptococcus_suis +LIETNWFHHLTGQEGLDVLFFHNLGFRITDQLYLEVRKFHLLQGLSQLLRRWSQESRVKG +ARYIERNHPLDTCFLQQFNRLIHCSHLASDDDLGWCVVVGWGNNPRGNSRTDFFNQVDIC +VENSNHLTSPCWRSQFHIFTTLSNQGNRIFKGQSSRCHQS +>Streptococcus_suis|ORF2330 length 329 aa, 990 bp, from complement(1613050..1614039) of Streptococcus_suis +ARKKDEGIMKTKITELLDIKYPIFQGGMAWVADGDLAGAVSNAGGLGIIGGGNAPKEVVK +ANIDKVKSITDKPFGVNIMLLSPFADDIVDLVIEEGVKVVTTGAGNPGKYMERLHAAGIT +VIPVVPSVALAKRMEKLGVDAVIAEGMEAGGHIGKLTTMTLVRQVVEAVSIPVIAAGGIA +DGAGAAAAFMLGAEAVQVGTRFVVATESNAHQAYKEKVLKAKDIDTTVSASIVGHPVRAI +KNKLSSAYAAAEKDFLAGKISADAIEELGAGALRNAVVDGDVTNGSVMAGQIAGLVSKEE +SCEDILKDIYYGAAKVIREEASRWASVGE +>Streptococcus_suis|ORF2619 length 107 aa, 324 bp, from 1802386..1802709 of Streptococcus_suis +QLCVGSNPINSLFRRNFFVCCISSQSSCYVHTMWFVGIIVEIIVARYIIIAMGNFQCVCP +CRRWSNVLNFRNDTIIQPHVFVLNIQTGVNDCNHHSATICLIFRTCF +>Streptococcus_suis|ORF2620 length 192 aa, 579 bp, from complement(1803558..1804136) of Streptococcus_suis +RLKIPCFQRKEVTMYDSFDKGWFVLQTYSGYENKVKENLLQRAHTYNMLENILRVEIPTQ +TVQVEKNGEVKEVEENRFPGYVLVEMVMTDEAWFVVRNTPNVTGFVGSHGNRSKPTPLLE +EEIRQILVSMGQTVQEFDIDVKVGDTVRIIDGAFTDYTGKITEIDNNKVKMVISMFGNDT +IAEVNLSQIAEL diff -r dc55e58fa890 -r 09a4ee5d12fd test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta Fri Mar 06 04:54:03 2015 -0500 @@ -0,0 +1,50 @@ +>Streptococcus_suis|ORF1 length 457 aa, 1374 bp, from 1..1374 of Streptococcus_suis +MNQEQLFWQRFIELAKVNFKPSIYDFYVADAKLLGINQQVANIFLNRPFKKDFWEKNFEE +LMIAASFESYGEPLTIQYQFTEDEQEIRNTTNTRSSIVHQVQTLEPATPQETFKPVHSDI +KSQYTFANFVQGDNNHWAKAAALAVSDNLGELYNPLFIFGGPGLGKTHILNAIGNKVLAD +NPQARIKYVSSETFINEFLEHLRLNDMESFKKTYRNLDLLLIDDIQSLRNKATTQEEFFH +TFNALHEKNKQIVLTSDRNPDHLDNLEERLVTRFKWGLTSEITPPDFETRIAILRNKCEN +LPYNFTNETLSYLAGQFDSNVRDLEGALKDIHLIATMRQLSEISVEVAAEAIRSRKQTNP +QNMVIPIEKIQTEVGNFYGVSLKELKGSKRVQHIVHARQVAMFLAREMTDNSLPKIGKEF +GNRDHTTVMHAYNKIKTLLLDDENLEIEITSIKNKLR +>Streptococcus_suis|ORF292 length 216 aa, 651 bp, from 185183..185833 of Streptococcus_suis +AVGKDHLTLDPISVEQIIAVMPVLIVVTAGAVQGSTLGSQSFFVGCFIAEEVTCLQTLGV +GQGGQAVQIFAWIATRAGHQPVFTVVVIPRGNPCFCDDDFQSVHASCCQGIGHGTEIRQS +RRRYLTIGPIGLDLKRASVVCVGLGATVQPVNHRFIALHLLVVARCHVARRAQRANTRHM +EAGKAASEEVVIEGVRSNVPQFFSSRADRQVPLVQV +>Streptococcus_suis|ORF583 length 391 aa, 1176 bp, from 397805..398980 of Streptococcus_suis +RKKMKKQFELIATAAAGLEAVVGREIRNLGYECQVENGRVRFQGDVKSIIETNIWLRSAD +RIKIIVGQFPAKTFEELFQGVFNLDWENYLPLGCKFPISKAKCVKSKLHNEPSVQAISKK +AVVKKLQKHFSRPEGVPLQEMGAEFKIEVSILKDVATVMIDTTGSSLFKRGYRVEKGGAP +IKENMAAAILQLSNWYPDKPLIDPTCGSGTFCIEAAMLAKNIAPGLKRSFAFEEWPWVED +QLVVALRKEAQASIKTDLVLDITGSDIDARMIEIAKKNAFAAGVEQDIVFKQMRVQDLRT +DKINGVIISNPPYGERLLDDEAIVTLYREMGETFEPLKTWSKFILTSDELFETRFGQQAD +KKRKLYNGTLKVDLYQFFGQRVKRQVQEVQG +>Streptococcus_suis|ORF874 length 113 aa, 342 bp, from complement(605625..605966) of Streptococcus_suis +VSNIVTAITTVNQSQAFQLAKVFFDSQVVRQHLSWVPCICQTIPYWHTGEFCQFFHHFLT +ETTEFNTVEHTSQNFSSIFCRFFLTKLDVICTKIFWMGTKVNRCYCEGSTSTC +>Streptococcus_suis|ORF1165 length 105 aa, 318 bp, from 811613..811930 of Streptococcus_suis +AYNESVKRKECHLMKQVNMSKIINYLTILGLLILLSAFFLDNWIRDWFFPSSWGNVATML +ILPLLGALILILSIYYKKLWTGLISIFLIISFPLIFGIGYFIFGP +>Streptococcus_suis|ORF1456 length 116 aa, 351 bp, from complement(1027944..1028294) of Streptococcus_suis +YGNACNSRPPTCDKSYSCWETLIYMGLNLVQFHFLISWYNGNMVISILQFFSHILFIYLA +HHLLVTTVDWSRWLKVTGDNQRKINLLILFLAIALGYLVSTFFLELLMMGRSFANM +>Streptococcus_suis|ORF1747 length 335 aa, 1008 bp, from complement(1225218..1226225) of Streptococcus_suis +RMLNTDDTVTIYDVAREAGVSMATVSRVVNGNKNVKENTRKKVLEVIDRLDYRPNAVARG +LASKKTTTVGVVIPNIANAYFATLAKGIDDIADMYKYNIVLANSDENDEKEINVVNTLFS +KQVDGIIFMGYHLTDKIRAEFSRSRTPIVLAGTVDLEHQLPSVNIDYAAASVDAVNLLAK +NNKKIAFVSGPLVDDINGKVRFAGYKQGLKDNGIEFNEGLVFESKYKYEEGYALAERILN +AGATAAYVAEDEIAAGLLNGVSDMGIKVPEDFEIITSDDSLVTKFTRPNLTSINQPLYDI +GAIAMRMLTKIMHKEELENREVVLNHGIKVRKSTK +>Streptococcus_suis|ORF2038 length 112 aa, 339 bp, from 1422849..1423187 of Streptococcus_suis +SSKMPAVLQRTSTDWHQEKTGDKSPVRSSCRKWKSPAKSGLLARSIPKTRQPFSKKIPAC +ARPMPLETPVMRAVLVMDFYPVGRKDIARGRAPHGEAFTLAGHVDEEIGRRL +>Streptococcus_suis|ORF2329 length 160 aa, 483 bp, from 1612284..1612766 of Streptococcus_suis +LIETNWFHHLTGQEGLDVLFFHNLGFRITDQLYLEVRKFHLLQGLSQLLRRWSQESRVKG +ARYIERNHPLDTCFLQQFNRLIHCSHLASDDDLGWCVVVGWGNNPRGNSRTDFFNQVDIC +VENSNHLTSPCWRSQFHIFTTLSNQGNRIFKGQSSRCHQS +>Streptococcus_suis|ORF2620 length 192 aa, 579 bp, from complement(1803558..1804136) of Streptococcus_suis +RLKIPCFQRKEVTMYDSFDKGWFVLQTYSGYENKVKENLLQRAHTYNMLENILRVEIPTQ +TVQVEKNGEVKEVEENRFPGYVLVEMVMTDEAWFVVRNTPNVTGFVGSHGNRSKPTPLLE +EEIRQILVSMGQTVQEFDIDVKVGDTVRIIDGAFTDYTGKITEIDNNKVKMVISMFGNDT +IAEVNLSQIAEL diff -r dc55e58fa890 -r 09a4ee5d12fd tools/sample_seqs/README.rst --- a/tools/sample_seqs/README.rst Fri Nov 21 08:30:03 2014 -0500 +++ b/tools/sample_seqs/README.rst Fri Mar 06 04:54:03 2015 -0500 @@ -59,6 +59,10 @@ v0.1.1 - Using optparse to provide a proper command line API. v0.1.2 - Interleaved mode for working with paired records. - Tool definition now embeds citation information. +v0.2.0 - Option to give number of sequences (or pairs) desired. + This works by first counting all your sequences, then calculates + the percentage required in order to sample them uniformly (evenly). + This makes two passes through the input and is therefore slower. ======= ====================================================================== @@ -71,7 +75,7 @@ For making the "Galaxy Tool Shed" http://toolshed.g2.bx.psu.edu/ tarball use the following command from the Galaxy root folder:: - $ tar -czf sample_seqs.tar.gz tools/sample_seqs/README.rst tools/sample_seqs/sample_seqs.py tools/sample_seqs/sample_seqs.xml tools/sample_seqs/tool_dependencies.xml test-data/ecoli.fastq test-data/ecoli.sample_N100.fastq test-data/ecoli.pair_sample_N100.fastq test-data/get_orf_input.Suis_ORF.prot.fasta test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta test-data/MID4_GLZRM4E04_rnd30_frclip.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff + $ tar -czf sample_seqs.tar.gz tools/sample_seqs/README.rst tools/sample_seqs/sample_seqs.py tools/sample_seqs/sample_seqs.xml tools/sample_seqs/tool_dependencies.xml test-data/ecoli.fastq test-data/ecoli.sample_N100.fastq test-data/ecoli.pair_sample_N100.fastq test-data/ecoli.sample_C10.fastq test-data/get_orf_input.Suis_ORF.prot.fasta test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta test-data/MID4_GLZRM4E04_rnd30_frclip.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff Check this worked:: @@ -83,13 +87,17 @@ test-data/ecoli.fastq test-data/ecoli.sample_N100.fastq test-data/ecoli.pair_sample_N100.fastq + test-data/ecoli.sample_C10.fastq test-data/get_orf_input.Suis_ORF.prot.fasta test-data/get_orf_input.Suis_ORF.prot.sample_N100.fasta test-data/get_orf_input.Suis_ORF.prot.pair_sample_N100.fasta + test-data/get_orf_input.Suis_ORF.prot.sample_C10.fasta + test-data/get_orf_input.Suis_ORF.prot.pair_sample_C10.fasta test-data/MID4_GLZRM4E04_rnd30_frclip.sff test-data/MID4_GLZRM4E04_rnd30_frclip.sample_N5.sff test-data/MID4_GLZRM4E04_rnd30_pair_sample.sff test-data/MID4_GLZRM4E04_rnd30_frclip.pair_sample_N5.sff + test-data/MID4_GLZRM4E04_rnd30_frclip.sample_C1.sff Licence (MIT) diff -r dc55e58fa890 -r 09a4ee5d12fd tools/sample_seqs/sample_seqs.py --- a/tools/sample_seqs/sample_seqs.py Fri Nov 21 08:30:03 2014 -0500 +++ b/tools/sample_seqs/sample_seqs.py Fri Mar 06 04:54:03 2015 -0500 @@ -2,14 +2,14 @@ """Sub-sample sequence from a FASTA, FASTQ or SFF file. This tool is a short Python script which requires Biopython 1.62 or later -for SFF file support. If you use this tool in scientific work leading to a +for sequence parsing. If you use this tool in scientific work leading to a publication, please cite the Biopython application note: Cock et al 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11) 1422-3. http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878. -This script is copyright 2014 by Peter Cock, The James Hutton Institute +This script is copyright 2014-2015 by Peter Cock, The James Hutton Institute (formerly the Scottish Crop Research Institute, SCRI), UK. All rights reserved. See accompanying text file for licence details (MIT license). @@ -20,7 +20,7 @@ from optparse import OptionParser -def stop_err(msg, err=1): +def sys_exit(msg, err=1): sys.stderr.write(msg.rstrip() + "\n") sys.exit(err) @@ -32,6 +32,9 @@ e.g. Sample 20% of the reads: $ python sample_seqs.py -i my_seq.fastq -f fastq -p 20.0 -o sample.fastq + +This samples uniformly though the file, rather than at random, and therefore +should be reproducible. """ parser = OptionParser(usage=usage) parser.add_option('-i', '--input', dest='input', @@ -49,6 +52,9 @@ parser.add_option('-n', '--everyn', dest='everyn', default=None, help='Take every N-th read') +parser.add_option('-c', '--count', dest='count', + default=None, + help='Take exactly N reads') parser.add_option("--interleaved", dest="interleaved", default=False, action="store_true", help="Input is interleaved reads, preserve the pairings") @@ -58,31 +64,74 @@ options, args = parser.parse_args() if options.version: - print("v0.1.2") + print("v0.2.0") sys.exit(0) -seq_format = options.format in_file = options.input out_file = options.output interleaved = options.interleaved if not in_file: - stop_err("Require an input filename") + sys_exit("Require an input filename") if in_file != "/dev/stdin" and not os.path.isfile(in_file): - stop_err("Missing input file %r" % in_file) + sys_exit("Missing input file %r" % in_file) if not out_file: - stop_err("Require an output filename") + sys_exit("Require an output filename") +if not options.format: + sys_exit("Require the sequence format") +seq_format = options.format.lower() + + +def count_fasta(filename): + from Bio.SeqIO.FastaIO import SimpleFastaParser + count = 0 + with open(filename) as handle: + for title, seq in SimpleFastaParser(handle): + count += 1 + return count + + +def count_fastq(filename): + from Bio.SeqIO.QualityIO import FastqGeneralIterator + count = 0 + with open(filename) as handle: + for title, seq, qual in FastqGeneralIterator(handle): + count += 1 + return count + + +def count_sff(filename): + from Bio import SeqIO + # If the SFF file has a built in index (which is normal), + # this will be parsed and is the quicker than scanning + # the whole file. + return len(SeqIO.index(filename, "sff")) + + +def count_sequences(filename, format): + if seq_format == "sff": + return count_sff(filename) + elif seq_format == "fasta": + return count_fasta(filename) + elif seq_format.startswith("fastq"): + return count_fastq(filename) + else: + sys_exit("Unsupported file type %r" % seq_format) if options.percent and options.everyn: - stop_err("Cannot combine -p and -n options") + sys_exit("Cannot combine -p and -n options") +elif options.everyn and options.count: + sys_exit("Cannot combine -p and -c options") +elif options.percent and options.count: + sys_exit("Cannot combine -n and -c options") elif options.everyn: try: N = int(options.everyn) except: - stop_err("Bad N argument %r" % options.everyn) + sys_exit("Bad -n argument %r" % options.everyn) if N < 2: - stop_err("Bad N argument %r" % options.everyn) + sys_exit("Bad -n argument %r" % options.everyn) if (N % 10) == 1: sys.stderr.write("Sampling every %ist sequence\n" % N) elif (N % 10) == 2: @@ -102,9 +151,9 @@ try: percent = float(options.percent) / 100.0 except: - stop_err("Bad percent argument %r" % options.percent) + sys_exit("Bad -p percent argument %r" % options.percent) if percent <= 0.0 or 1.0 <= percent: - stop_err("Bad percent argument %r" % options.percent) + sys_exit("Bad -p percent argument %r" % options.percent) sys.stderr.write("Sampling %0.3f%% of sequences\n" % (100.0 * percent)) def sampler(iterator): global percent @@ -115,8 +164,76 @@ if percent * count > taken: taken += 1 yield record +elif options.count: + try: + N = int(options.count) + except: + sys_exit("Bad -c count argument %r" % options.count) + if N < 1: + sys_exit("Bad -c count argument %r" % options.count) + total = count_sequences(in_file, seq_format) + print("Input file has %i sequences" % total) + if interleaved: + # Paired + if total % 2: + sys_exit("Paired mode, but input file has an odd number of sequences: %i" + % total) + elif N > total // 2: + sys_exit("Requested %i sequence pairs, but file only has %i pairs (%i sequences)." + % (N, total // 2, total)) + total = total // 2 + if N == 1: + sys.stderr.write("Sampling just first sequence pair!\n") + elif N == total: + sys.stderr.write("Taking all the sequence pairs\n") + else: + sys.stderr.write("Sampling %i sequence pairs\n" % N) + else: + # Not paired + if total < N: + sys_exit("Requested %i sequences, but file only has %i." % (N, total)) + if N == 1: + sys.stderr.write("Sampling just first sequence!\n") + elif N == total: + sys.stderr.write("Taking all the sequences\n") + else: + sys.stderr.write("Sampling %i sequences\n" % N) + if N == total: + def sampler(iterator): + """Dummy filter to filter nothing, taking everything.""" + global N + taken = 0 + for record in iterator: + taken += 1 + yield record + assert taken == N, "Picked %i, wanted %i" % (taken, N) + else: + def sampler(iterator): + # Mimic the percentage sampler, with double check on final count + global N, total + # Do we need a floating point fudge factor epsilon? + # i.e. What if percentage comes out slighty too low, and + # we could end up missing last few desired sequences? + percentage = float(N) / float(total) + #print("DEBUG: Want %i out of %i sequences/pairs, as a percentage %0.2f" + # % (N, total, percentage * 100.0)) + count = 0 + taken = 0 + for record in iterator: + count += 1 + # Do we need the extra upper bound? + if percentage * count > taken and taken < N: + taken += 1 + yield record + elif total - count + 1 <= N - taken: + # remaining records (incuding this one) <= what we still need. + # This is a safey check for floating point edge cases where + # we need to take all remaining sequences to meet target + taken += 1 + yield record + assert taken == N, "Picked %i, wanted %i" % (taken, N) else: - stop_err("Must use either -n or -p") + sys_exit("Must use either -n, -p or -c") def pair(iterator): @@ -180,48 +297,30 @@ pos_handle.write(record) return count -try: - from galaxy_utils.sequence.fastq import fastqReader, fastqWriter - def fastq_filter(in_file, out_file, iterator_filter, inter): - count = 0 - #from galaxy_utils.sequence.fastq import fastqReader, fastqWriter - reader = fastqReader(open(in_file, "rU")) - writer = fastqWriter(open(out_file, "w")) - if inter: - for r1, r2 in iterator_filter(pair(reader)): - count += 1 - writer.write(r1) - writer.write(r2) - else: - for record in iterator_filter(reader): - count += 1 - writer.write(record) - writer.close() - reader.close() - return count -except ImportError: - from Bio.SeqIO.QualityIO import FastqGeneralIterator - def fastq_filter(in_file, out_file, iterator_filter, inter): - count = 0 - with open(in_file) as in_handle: - with open(out_file, "w") as pos_handle: - if inter: - for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))): - count += 1 - pos_handle.write("@%s\n%s\n+\n%s\n" % r1) - pos_handle.write("@%s\n%s\n+\n%s\n" % r2) - else: - for title, seq, qual in iterator_filter(FastqGeneralIterator(in_handle)): - count += 1 - pos_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) - return count + +from Bio.SeqIO.QualityIO import FastqGeneralIterator +def fastq_filter(in_file, out_file, iterator_filter, inter): + count = 0 + with open(in_file) as in_handle: + with open(out_file, "w") as pos_handle: + if inter: + for r1, r2 in iterator_filter(pair(FastqGeneralIterator(in_handle))): + count += 1 + pos_handle.write("@%s\n%s\n+\n%s\n" % r1) + pos_handle.write("@%s\n%s\n+\n%s\n" % r2) + else: + for title, seq, qual in iterator_filter(FastqGeneralIterator(in_handle)): + count += 1 + pos_handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual)) + return count + def sff_filter(in_file, out_file, iterator_filter, inter): count = 0 try: from Bio.SeqIO.SffIO import SffIterator, SffWriter except ImportError: - stop_err("SFF filtering requires Biopython 1.54 or later") + sys_exit("SFF filtering requires Biopython 1.54 or later") try: from Bio.SeqIO.SffIO import ReadRocheXmlManifest except ImportError: @@ -246,14 +345,14 @@ #count = writer.write_file(SffIterator(in_handle)) return count -if seq_format.lower()=="sff": +if seq_format == "sff": count = sff_filter(in_file, out_file, sampler, interleaved) -elif seq_format.lower()=="fasta": +elif seq_format == "fasta": count = fasta_filter(in_file, out_file, sampler, interleaved) -elif seq_format.lower().startswith("fastq"): +elif seq_format.startswith("fastq"): count = fastq_filter(in_file, out_file, sampler, interleaved) else: - stop_err("Unsupported file type %r" % seq_format) + sys_exit("Unsupported file type %r" % seq_format) if interleaved: sys.stderr.write("Selected %i pairs\n" % count) diff -r dc55e58fa890 -r 09a4ee5d12fd tools/sample_seqs/sample_seqs.xml --- a/tools/sample_seqs/sample_seqs.xml Fri Nov 21 08:30:03 2014 -0500 +++ b/tools/sample_seqs/sample_seqs.xml Fri Mar 06 04:54:03 2015 -0500 @@ -1,4 +1,4 @@ - + e.g. to reduce coverage biopython @@ -9,9 +9,10 @@ sample_seqs.py -f "$input_file.ext" -i "$input_file" -o "$output_file" #if str($sampling.type) == "everyNth": -n "${sampling.every_n}" +#elif str($sampling.type) == "percentage": +-p "${sampling.percent}" #else -##elif str($sampling.type) == "percentage": --p "${sampling.percent}" +-c "${sampling.count}" #end if #if $interleaved --interleaved @@ -26,8 +27,9 @@ - - + + + @@ -36,8 +38,11 @@ + + + - + @@ -82,12 +87,37 @@ + + + + + + + + + + + + + + + + + + + + + + + + + @@ -100,27 +130,55 @@ + + + + + + + + + + + + **What it does** Takes an input file of sequences (typically FASTA or FASTQ, but also Standard Flowgram Format (SFF) is supported), and returns a new sequence -file sub-sampling from this (in the same format). +file sub-sampling uniformly from this (in the same format, preserving the +input order and selecting sequencing evenly though the input file). -Several sampling modes are supported, all designed to be non-random. This -allows reproducibility, and also works on paired sequence files. Also -note that by sampling uniformly through the file, this avoids any bias -should reads in any part of the file are of lesser quality (e.g. one part -of the slide). +Several sampling modes are supported, all designed to do non-random +uniform sampling (i.e. evenly through the input file). This allows +reproducibility, and also works on paired sequence files (run the tool +twice, once on each file using the same settings). -The simplest mode is to take every N-th sequence, for example taking +By sampling uniformly (evenly) through the file, this avoids any bias +should reads in any part of the file be of lesser quality (e.g. for +high throughput sequencing the reads at the start and end of the file +can be of lower quality). + +The simplest mode is to take every *N*-th sequence, for example taking every 2nd sequence would sample half the file - while taking every 5th sequence would take 20% of the file. +The target count method picks *N* sequences from the input file, which +again will be distributed uniformly (evenly) though the file. This works +by first counting the number of records, then calculating the desired +percentage of sequences to take. Note if your input file has exactly +*N* sequences this selects them all (effectively copying the input file). +If your input file has less than *N* sequences, this is treated as an +error. + If you tick the interleaved option, the file is processed as pairs of -records - taking for example using 20% would take every 5th pair of -records. This ensures your read pairs are preserved. Note this does not +records to ensure your read pairs are not separated by sampling. +For example using 20% would take every 5th pair of records, or you +could request 1000 read pairs. + +.. class:: warningmark Note interleaves/pair mode does *not* actually check your read names match a known pair naming scheme! **Example Usage** @@ -133,8 +191,12 @@ Similarly, if you had some Illumina paired end data interleaved into one file with an estimated x200 coverage, you would run this tool in -interleaved mode. Taking every 3rd read pair. This would reduce the -estimated coverage to about x66, while preserving the read pairing. +interleaved mode, taking every 3rd read pair. This would again reduce +the estimated coverage to about x66, while preserving the read pairing. + +Suppose you have a transcriptome assembly, and wish to look at the +species distribution of the top BLAST hits for an initial quality check. +Rather than using all your sequences, you could pick 1000 only for this. **Citation**